Genomic Science Program
U.S. Department of Energy | Office of Science | Biological and Environmental Research Program

Improving Candidate Gene Discovery by Combining Multiple Genetic Mapping Datasets


Rubén Rellán-Álvarez* (, Nirwan Tandukar, Fausto Rodríguez-Zapata, Jung-Ying Tzeng


North Carolina State University


  1. Exploit domain knowledge on phosphorus action in roots to identify strong predictors for phosphorus data and employ an XGBoost model to predict phosphorus levels for ~2,000 georeferenced sorghum landraces distributed in Africa.
  2. Perform an environmental genome-wide association study (GWAS) in those landraces that have already been genotyped using the predicted phosphorus data as the phenotypes for the GWAS analysis.
  3. Characterize the genetic architecture of lipid content during the early stages of sorghum development using the Sorghum Association Panel (SAP). Researchers will perform a GWAS on lipid content under low temperature and low phosphorus.
  4. Develop algorithms that incorporate all the different types of information the group collects (i.e., metabolite levels, GWAS candidate genes, selection signals) to improve researchers’ ability to detect signals of small effects and increase their confidence in the selection of candidate genes. The algorithms and pipelines developed here will be made available to the community as R packages.


Phosphorus (P) is one of the three primary nutrients in commercial fertilizers, essential for plant growth and development. Excessive use of P-rich fertilizers in agriculture leads to leaching and runoff to water bodies, harming aquatic life. The limited global reserves of rock P and water pollution make it necessary to find a sustainable solution that ensures the proper utilization of P, minimizing leaching and runoff. Landrace varieties adapted in soils with varying levels of P availability likely possess unique genetic mechanisms to cope with P scarcity.

Environmental GWAS using these genotypes with georeferenced accessions present a potential for identifying candidate genes. Employing GWAS, the group seeks to identify genes and pathways associated with P in plants that will help researchers overcome these obstacles. Central to GWAS’s success is accurate phenotype measurement. To this end, researchers have developed an XGBoost model predicting P availability in the soil using domain-based knowledge, surpassing current models in prediction accuracy and capability to discern lower-end values. Utilizing high-dimensional genetic datasets of georeferenced Sorghum bicolor in Africa, the team will conduct environmental GWAS using the team’s new P availability data. The team will use a linear regression-based p-value combination method (MAGMA) to aggregate multiple small effects on a gene-based level. The group’s previous study has identified lipid variations, specifically phosphatidylcholine, in maize adapted to low P conditions in the Mexican highlands.

Additionally, researchers have obtained lipid profiles for 400 SAP grown in normal and low P conditions. Analyzing the corresponding lipid dynamics will play a significant role in understanding P utilization. Subsequently, a GWAS will be conducted focusing on these identified candidate lipids. By employing the Cauchy Combination test to combine the findings from both the P GWAS and lipid GWAS, this team aims to reevaluate and redefine the order of gene importance. This integrative analysis will facilitate the identification of candidates that are linked to both lipids and P efficiency.


This research was supported by the DOE Office of Science, BER program (grant no. DE‐SC0021889.