Scalable Computational Tools for Inference of Protein Annotation and Metabolic Models in Microbial Communities
Saeedeh Davoudi1* (firstname.lastname@example.org), Janaka N. Edirisinghe2* (email@example.com), Mikayla A. Borton3 (firstname.lastname@example.org), Michael Shaffer3, Zahmeeth Sakkaff2, Filipe Liu2, Rory M. Flynn3, Lucia S. Guatney1,4, Derick Singleton1, James Stegen5, Byron C. Crump6, James J. Davis2, Farnoush Banaei-Kashani1, Kelly C. Wrighton3, Christopher Henry2, and Christopher S. Miller1
1University of Colorado–Denver; 2Argonne National Laboratory; 3Colorado State University; 4University of Colorado–Anschutz Medical Campus; 5Pacific Northwest National Laboratory; and 6Oregon State University
High-throughput omics technologies have made the assembly of microbial genomes recovered from the environment routine. Computational inference of the protein products encoded by these genomes, and the associated biochemical functions, should allow for the accurate prediction and modeling of microbial metabolism, organismal interactions, and ecosystem processes. However, a lack of scalable, probabilistic protein annotation tools limits the full potential of metabolic modeling. The approach to inference of improved models relies on developing new computational tools in three main areas: 1) improved protein annotations, 2) iterative cycles of gap-filling metabolic models with improved protein annotations and informing probabilistic protein annotations based on metabolic models, and 3) integrating improved protein annotations with community-level flux balance metabolic models. Researchers aim to make these tools broadly accessible via the DOE Systems Biology Knowledgebase (KBase; Arkin et al. 2018).
In the past year researchers continued to improve the genome annotation and modeling capabilities emerging from this project. Researchers have achieved this via advancement of traditional homology-based approaches, and advancement of new approaches leveraging genome-scale metabolic models and approaches from the field of natural language processing.
The DRAM (Shaffer et al. 2020) app in KBase was significantly improved, enhancing reliability, usability, and overall quality of output annotations, all within the KBase framework that allows for interoperability of these annotations with other annotation and modeling apps. This work culminated in a new publication highlighting the utility of DRAM within KBase (Shaffer et al. submitted). The team also extended DRAM to be able to predict microbial traits (e.g. nitrate reducer, aerobe, fermenter, etc..) from protein annotations. These traits were developed and validated (below) via extensive expert curation.
Beyond DRAM, researchers also deployed a new tool called GLM4EC in which researchers trained and fine-tuned a modification of Generalized Language Models (GLMs; also known as Large Language Models) based on ProteinBERT (Brandes et al. 2022) to the task of annotation of microbial proteins with Enzyme Commission (E.C.) numbers. This model was trained to learn sequence embeddings and annotation classification from a subset of UniRefKB annotated with E.C. numbers. On held-out test sets, the model predicts E.C. numbers with high precision and recall, regardless of input sequence length. Researchers are now testing new models that have additional global features beyond E.C. numbers available for pretraining and exploring the utility of alternative model architectures. Integration of GLM4EC into KBase, along with the improved version of DRAM and other existing annotation pipelines, provides multiple hypotheses for the function of gene products within genomes and metagenome assembled genomes (MAGs) in KBase, all of which can be integrated or explored in a common, interoperable framework with other KBase tools.
To aid in determining which of the potentially alternative functions a gene product actually performs, researchers developed machine learning classifiers to predict growth phenotypes based on multiple functional annotations. These classifiers are now loaded into KBase, along with apps enabling them to be applied to predict phenotypes for any KBase genome or MAG. Further, in collaboration with the Hoffmockel Science Focus Area (SFA), researchers have integrated apps in KBase that reconcile metabolic models with predicted phenotypes, using the alternative functions proposed by the new annotation apps to associate gene candidates with gapfilled reactions. This leads to dramatic improvements in model accuracy from 56% to 72% on average. This also leads to numerous corrected annotations across all genomes, particularly in MAGs where many genes and associated functions are often missing due to incomplete assemblies. Combined with the recent enhancements to the ModelSEED pipeline to improve energy biosynthesis prediction in methanogens (a key target for this project), researchers now have models with greatly improved accuracy to predict both core and periphery metabolism.
With all these components in place, researchers are applying this improved pipeline to build models for over 2,000 MAGs loaded into KBase from the Genome Resolved Open Watersheds (GROW) project. The MAG models from this analysis now have many more reactions and annotations, particularly for poorly annotated clades. Researchers are now assembling these MAG models into compartmentalized community metabolic models for each of the 178 GROW samples. Researchers are loading hand-curated microbial traits (in part aided by DRAM inference) and corresponding trophic interaction networks into KBase as phenotypes, enabling the gapfilling of community models to replicate these hypothesized expert-curated trophic webs. The degree of gapfilling required to replicate trophic webs provides valuable feedback for identification of potential errors in these webs. The resulting models can also be tested against a growing collection of metatranscriptome data gathered from these samples, measuring agreement between reaction flux and associated gene expression.
Arkin, A. P., et al. 2018. “KBase: The United States Department of Energy Systems Biology Knowledgebase.” Nature Biotechnology 36, 566–9.
Shaffer, M., et al. 2020. “DRAM for Distilling Microbial Metabolism to Automate the Curation of Microbiome Function.” Nucleic Acids Research 48, 8883–900.
Shaffer, M., et al. Submitted. “kb_DRAM: Distilled Genome Annotations and Metabolic Modeling in KBase.”
Brandes N, et al. 2022. “ProteinBERT: A Universal Deep-Learning Model of Protein Sequence and Function.” Bioinformatics 38(8), 2102–10.
This research was supported by the DOE Office of Science, Office of Biological and Environmental Research (BER), grant no. DE-SC0021350, and the DOE Joint Genome Institute Community Science Program. KBase was supported by Award Numbers DE-AC02-05CH11231, DE-AC02-06CH11357, DE-AC05-00OR22725, and DE-AC02-98CH10886.