Genomic Science Program
U.S. Department of Energy | Office of Science | Biological and Environmental Research Program

Machine Learning-Assisted Genome-Wide Association Study Uncovers Copy-Number Variations of Tandem Paralogs Driving Stress Tolerance Evolution in Issatchenckia orientalis


Ping-Hung Hsieh1,2* (, Yusuke Sasaki1,2, Chu-I Yang3, Zong-Yen Wu1,2, Andrei S. Steindorff4, Sajeet Haridas4, Jing Ke4, Zia Fatma1,5, Zhiying Zhao4, Dana A. Opulente6, Siwen Deng7, Chris Todd Hittinger6, Igor V. Grigoriev2,4,7, Bruce Dien1,8, Huimin Zhao1,5,9, Yi-Pei Li3, Yasuo Yoshikuni1,2,4,10,11,12, Andrew Leakey1


1DOE Center for Bioenergy and Bioproducts Innovation; 2Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory; 3Department of Chemical Engineering, National Taiwan University, Taipei, Taiwan; 4The U.S. DOE Joint Genome Institute, Lawrence Berkeley National Laboratory; 5Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign; 6Laboratory of Genetics, DOE Great Lakes Bioenergy Research Center, Center for Genomic Science Innovation, J. F. Crow Institute for the Study of Evolution, Wisconsin Energy Institute, University of Wisconsin–Madison; 7Department of Plant and Microbial Biology, University of California–Berkeley; 8USDA, Agricultural Research Service, National Center for Agricultural Utilization Research, Bioenergy Research Unit, Peoria, IL; 9Department of Chemistry, Biochemistry, and Bioengineering, University of Illinois Urbana-Champaign; 10Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory; 11Global Institution for Collaborative Research and Education, Hokkaido University, Japan; 12Institute of Global Innovation Research, Tokyo University of Agriculture and Technology, Tokyo, Japan



This project aims to exploit the genetic and phenotypic diversity of Issatchenckia orientalis through population genomics and machine learning approaches to: (1) explore the potential molecular mechanisms behind its multi-stress tolerance; (2) develop predictive models for various stress tolerances; and (3) identify key genes associated with tolerance to industrial stresses and resistance to different fungicides.


The environmental yeast I. orientalis plays a dual role in human society, owing to its multi-stress tolerance. Its ability to withstand common industrial stressors, such as low pH and high temperatures, makes it an ideal candidate for engineered biosynthesis of bioproducts. However, it also poses significant health risks as a multidrug-resistant pathogen capable of causing invasive fungal diseases and is recognized by the World Health Organization as a priority fungal pathogen. Deciphering the molecular mechanisms and evolutionary paths of its stress tolerance is key for managing risks and exploiting the benefits of this species. Here, researchers report the potential mechanisms driving the adaptive evolution of I. orientalis to various stressors using population genomics and machine learning approaches.

We conducted whole-genome sequencing of 170 I. orientalis isolates and assessed the growth of 161 isolates under 57 different stressors, including heat, low pH, organic acids, lignocellulosic inhibitors, and fungicides from different families (e.g., Azoles, Polyenes, and Echinocandins). The team also developed a machine-learning-assisted analysis pipeline, Machine-Learning-Assisted Engineering of Stress-Tolerance Rational Optimization (MAESTRO), to streamline the analysis. MAESTRO revealed that pleiotropic effects of copy-number variations (CNVs) among a small set of genes (less than 3.5%) play a significant role in I. orientalis’ stress tolerance. Using CNVs as features, the team successfully correlated genetic variation with phenotypic variation of stress tolerance, demonstrating a median R2 of 0.67 and a median Pearson’s correlation of 0.92 across 57 stress conditions when comparing actual fitness to predicted fitness values. Notably, many of these genes (17 to 23%) were tandem repeat paralogs (TRPs), a genomic configuration known as hot spots for gene amplification, reduction, or even shuffling to invent new activities. Additionally, TRPs were significantly enriched with transporters (52%, compared to the genome-wide average of 3%), likely composing the resistome network. Finally, as a proof of concept, the team engineered a strain with enhanced tolerance to the lignocellulosic inhibitor 5-hydroxymethyl furfural but lower resistance to the fungicide fluconazole by deleting four TRPs. This work demonstrates the potential of leveraging fungal genetic variation to predict their potential risks in society and designing more robust industrial strains to develop a sustainable bioeconomy.

Funding Information

This work was supported by the U.S. DOE Center for Advanced Bioenergy and Bioproducts Innovation (DOE, Office of Science contract DE-SC0018420 and DE-AC02-05CH1123) and Biosystems Design program (DOE, Office of Science contract DE-SC0018260 and DE-AC02-05CH1123). The work (proposal: 10.46936/10.25585/60001019) conducted by the U.S. DOE Joint Genome Institute (, a DOE Office of Science User Facility, is supported by the Office of Science of the U.S. DOE operated under Contract No. DE-AC02-05CH11231. Y.-P.L. is supported by the Taiwan NSTC Young Scholar Fellowship Einstein Program (111-2636-E-002-025). Research in the Hittinger Laboratory is supported by the U.S. National Science Foundation under Grant No. DEB-2110403, the USDA National Institute of Food and Agriculture (Hatch Project 1020204), in part by the DOE Great Lakes Bioenergy Research Center (DOE BER Office of Science DE–SC0018409), and an H. I. Romnes Faculty Fellowship, supported by the Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation. The team is grateful that the U.S. Department of Agriculture, Agricultural Research Service Culture Collection (Northern Regional Research Laboratory [NRRL]) Database provided researchers with nearly 60 strains free of charge.