Genomic Science Program
U.S. Department of Energy | Office of Science | Biological and Environmental Research Program

Probabilistic Annotation and Ensemble Metabolic Modeling in KBase

Authors:

Patrik D’haeseleer1* (dhaeseleer2@llnl.gov), Jeffrey Kimbrel1, Ali Navid1, Chris Henry2, and Rhona Stuart1

Institutions:

1Lawrence Livermore National Laboratory (LLNL); and 2Argonne National Laboratory

URLs:

Goals

Functional annotation tools such as Rapid Annotation using Subsystem Technology (RAST) or Kyoto Encyclopedia of Genes and Genomes (KEGG) don’t always agree on how to best leverage them for metabolic modeling. This project is developing tools for the DOE Systems Biology Knowledgebase (KBase) to give users a principled way to weigh multiple sources of functional annotation against each other, enable better metabolic modeling of hard-to-annotate organisms and pathways, allow analysis of uncertainty in the resulting models network structure or behavior, and provide an infrastructure on which to build more sophisticated machine learning techniques in KBase.

Abstract

The µBiospheres Science Focus Area (SFA) at LLNL investigates metabolic interactions in bioenergy-relevant microbial communities. A critical part of this research is development of genome-scale models of metabolism, which requires well-annotated genomes. By combining annotations from multiple sources, researchers can achieve a more complete metabolic network reconstruction, greatly reducing the effort required to curate quality metabolic models (Griesemer et al. 2018). In previous work, researchers developed a set of KBase apps to import, compare, and merge functional annotations from a wide range of different functional annotation tools into KBase for metabolic modeling to achieve significantly improved metabolic models. These apps have proven to be very useful and are currently in daily use in this SFA and several other research groups using KBase.

It is quite common for functional annotation tools to disagree on the function that should be assigned to certain genes, and this uncertainty can have significant consequences on the resulting metabolic networks and the behavior they predict for the organism. The team is now developing a set of tools to deal with these disagreements in a more systematic manner: by calculating the likelihood of metabolic reactions given the annotations from various sources, and then carrying those reaction likelihoods through into the modeling results. Researchers have modified the existing import app to support importing annotation scores and evidence codes, such as reaction probabilities, log likelihoods, Basic Local Alignment Search Tool, or hidden Markov model scores.

Researchers can use a Naive Bayes approach to estimate the probability of each reaction assigned to a gene, given the annotations from a range of different annotation tools. For this the team first needs to evaluate the reliability—False Positive and False Negative rates—for each of the major annotation tools (currently, RAST, Prokka, Distilled and Refined Annotation of Metabolism, and KEGG), by running them on a reference dataset consisting of 15,000 enzymes in Swissprot that have experimental evidence codes. This rigorous validation effort has also led to some significant improvements in the ModelSEED biochemistry database, beyond the well-curated set of template reactions that are normally used by KBase’s metabolic modeling engine.

Enzymes in Swissprot have historically been annotated using Enzyme Commission (EC) numbers, which are far from ideal when needing to map to unique metabolic reactions for modeling. Some ECs are overly generic, forcing omission altogether, or to instantiate them as multiple unique reactions. The team is building support into the Ontology application programming interface (that translates from EC numbers and other annotation vocabularies to ModelSEED reactions) to filter out unbalanced, overly generic, or otherwise unsuitable reactions for metabolic modeling. In the longer term, the team may use the new Rhea reaction identifiers that are being curated into the Swissprot database as the reference dataset, which should provide for a much more direct mapping to ModelSEED reactions.

Once reaction probabilities are associated with all the genes in a genome, researchers can then sample from those probabilities to create an ensemble of metabolic models (Medlock, Moutinho, and Papin 2020). Each of these models can then be analyzed using the existing gapfilling and modeling tools (including support for the next generation of modeling tools that the KBase team is developing), eventually resulting in an ensemble of Flux Balance Analysis solutions, reflecting the uncertainty in the underlying enzyme annotations. The team will develop a set of analysis tools to study this ensemble of solutions, using clustering, averaging, analysis of alternative pathway solutions, etc. This will result in higher quality metabolic network reconstruction, but also in much greater insight in the sources of uncertainty in the network, enabling prioritization of how to most efficiently reduce that uncertainty by additional manual curation or experimental data.

This work will provide the SFA and other KBase users a principled way to weight annotation sources against each other, enable better metabolic modeling of hard-to-annotate organisms and pathways, allow analysis of uncertainty in the resulting models network structure or behavior, and provide an infrastructure on which to build more sophisticated machine learning techniques in KBase.

References

Griesemer, M., et al. 2018. “Combining Multiple Functional Annotation Tools Increases Coverage of Metabolic Annotation,” BMC Genomics 19(1), 948.

Medlock, G. L., T. J. Moutinho, and J. A. Papin. 2020. “Medusa: Software to Build and Analyze Ensembles of Genome-Scale Metabolic Network Reconstructions,” PLoS Computational Biology 16(4), e1007847.

Funding Information

This work was performed under the auspices of the U.S. Department of Energy at Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 and supported by the Genomic Science program of the Biological and Environmental Research (BER) Program under the LLNL μBiospheres SFA, FWP SCW1039.