Genomic Science Program
U.S. Department of Energy | Office of Science | Biological and Environmental Research Program

A Gapless and Phased Diploid Genome Assembly for Auxenochlorella protothecoides Facilitates Metabolic Modeling and Proteomics Analyses


Rory J. Craig1, Dimitrios J. Camacho1, Sean D. Gallaher1, Carrie D. Nicora3, Samuel O. Purvine3, Jacob Tamburro2, Mark Vigliotti2, Jeffrey L. Moseley1, Mary S. Lipton3, Nanette R. Boyle2* (, and Sabeeha S. Merchant1


1University of California–Berkeley; 2Colorado School of Mines; and 3Pacific Northwest National Laboratory


Auxenochlorella protothecoides, a Trebouxiophyte oleaginous alga, is a reference for discovery and a platform for photosynthesis-driven synthetic biology and sustainable bio-production. Researchers will expand transformation markers, regulatory sequences and reporter genes, improve transformation efficiency, and develop RNP-mediated gene-editing methods for genome modification. Systems analyses and metabolic modeling approaches will inform genome modifications for rational improvement of photosynthetic carbon fixation and strain engineering to produce cyclopropane fatty acids. Regulatory factors and signaling pathways responsible for activating fatty acid and triacylglycerol biosynthesis will be identified, and researchers will manipulate them to increase lipid productivity. Non-photochemical quenching and a regulatory circuit for maintaining photosynthesis under Cu-limitation, both of which are absent in A. protothecoides, will be introduced to improve photosynthetic resilience, and the performance of engineered strains will be modeled.


High quality reference genomes and structural annotations are the foundation of many systems and synthetic biology approaches. Researchers have produced a gapless and phased genome assembly for the diploid A. protothecoides strain UTEX 250 using high-coverage Pacific Biosciences (PacBio) HiFi sequencing and Omni-C linked read sequencing. The haploid length of the UTEX 250 nuclear genome is 22 Mb, which is arranged on 12 chromosomes ranging from 0.5 to 4.1 Mb. The genome is GC-rich (64%) and generally highly heterozygous; the two haplotypes differ at ~3% of sites, enabling allele-specific transformation and allele-specific gene expression to be quantified. However, approximately a third of the genome is homozygous, including three entire chromosomes, suggesting widespread loss-of-heterozygosity events as observed in other vegetative diploids (e.g. yeasts). Complete circular plastome (84.6 kb) and mitogenome (54.0 kb) assemblies have also been produced.

To produce highly accurate structural annotations, researchers have sequenced PacBio Iso-Seq libraries from mixotrophic and heterotrophic conditions, and ~60 million paired-end and stranded RNA-seq reads from several other growth conditions. Utilizing these data, researchers have annotated ~7,600 gene models per haploid genome, approximately 70% of which are supported by full-length Iso-Seq reads. More than 200 complex gene models were corrected by manual annotation. The quality of the annotations are supported by a Benchmarking Universal Single-Copy Orthologs (BUSCO) score of ~99% completeness, with all missing BUSCOs manually confirmed to be biologically absent in the genome. The A. protothecoides genome is remarkably compact with respect to gene content and features <5% repeats, although a handful of potentially active DNA transposons have been identified.

The A. protothecoides haploid gene number is less than half as many as the reference green alga Chlamydomonas reinhardtii. Researchers are presently performing comparative analyses among several high-quality algal genomes to functionally characterize gene presence and absence, with particular focus on the GreenCut of proteins that are typically found in photosynthetic plants and green algae. Researchers are using the structural annotations to improve synthetic biology approaches, including the identification of potential condition-specific promoters and the refinement of Kozak sequence and codon bias optimization. A. protothecoides is also unusual among green algae in that several geographically and environmentally diverse isolates are available in culture. Researchers are targeting an ~20 strain pan-genome, with a view to the identification of standing genetic variation that may facilitate strain improvement e.g. growth in brackish water.

The new genome sequence will be used to update and improve the first draft metabolic network of A. protothecoides that the team published previously. In order to model growth in a variety of conditions, a complete biomass analysis will be performed for growth in heterotrophic, autotrophic and mixotrophic conditions. The experimentally obtained data will be used to develop accurate biomass objective equations for each growth condition. Constraints based on metabolic modeling approaches, such as flux balance analysis (FBA), will be used to predict growth and yields in different environmental and genetic backgrounds. These simulations will also be used to identify gene targets to further improve production of cyclopropane fatty acids. Isotope assisted metabolic flux analysis (13C-MFA) will also be used to quantify fluxes in different growth conditions and mutant strains.

Assessment and quantification of protein production is a key step in evaluating engineering outcomes. Researchers have extended the pipelines to A. protothecoides using state-of-the-art technologies resident at PNNL, and in a single pilot experiment the team captured >6100 A. protothecoides proteins, representing around 80% of the ~7600 proteins encoded in the genome. More detailed analyses are now possible due to the completion of the genome. For applications where it is critical to know the exact abundance of proteins, researchers will employ targeted proteomics with selected reaction monitoring (SRM) to determine the absolute amount of a subset of proteins. For instance, metabolic models of flux from carbon-fixation to triacylglycerol biosynthesis will be more accurate when they can incorporate the concentration of active sites for key pathway enzymes. Transgenes from engineering efforts, orthologs of fatty acid and lipid biosynthesis enzymes, CBC enzymes, PSI, PSII and light-harvesting proteins in Auxenochlorella are candidate targets for the SRM approach.

Funding Information

This work was supported by the DOE Office of Science, Office of Biological and Environmental Research (BER), grant no. DE-SC0023027.