The KBase Knowledge Engine: Ecosystem Classification Prototype
Paramvir S. Dehal1* (firstname.lastname@example.org), Marcin P. Joachimiak1, Ziming Yang4, William J. Riehl1, Sean Jungbluth1, Meghan Drake3, Shane Canon1, Dylan Chivian1, Filipe Lui2, Christopher Neely1, Priya Ranjan3, Shinjae Yoo4, Benjamin Allen3, Jason Baumohl1, Kathleen Beilsmith2, David Dakota Blair4, John-Marc Chandonia1, Zachary Crockett3, Ellen G. Dow1, Janaka N. Edirisinghe2, José P. Faria2, Jason Fillman1, Tianhao Gu2, A. J. Ireland1, Roy Kamimura1, Keith Keller1, Dan Klos2, Miriam Land3, Erik Pearson1, Gavin Price1, Boris Sadkhin2, Samuel Seaver2, Alan Seleman2, Gwyneth Terry1, Sumin Wang1, Pamela Weisenhorn2, Qizhi Zhang2, Elisha M. Wood-Charlson1, Robert Cottingham3, Chris Henry2, and Adam P. Arkin1
1Lawrence Berkeley National Laboratory; 2Argonne National Laboratory; 3Oak Ridge National Laboratory; and 4Brookhaven National Laboratory
One of the primary goals of the Department of Energy Systems Biology Knowledgebase (KBase) is the generation and application of biological knowledge from analytical results. To that end, the KBase Knowledge Engine (KE) will leverage existing and novel machine learning and bioinformatics tools to build up such knowledge from the growing body of results from analysis done using KBase and made publicly available. Ultimately, the project seeks to predict the key taxa, functions, ecosystem features, and their interactions. To accomplish this, researchers begin by developing (1) classifiers for identification of key determinants of ecosystems; (2) phenotype and trait predictors; and (3) robust pangenomes and their relationships across the microbial tree of life.
Microbial life is a critical component of Earth’s ecosystems, and the taxonomic and functional information from environmental genomics can provide insights into microbial roles in the environment. However, comparing this data across metagenomes can be challenging, and furthermore, abundance differences may not reflect important functional differences between environments. As a first KBase KE prototype, researchers aimed to use machine learning to: (1) build and evaluate robust ecosystem classification models using standardized data from ~32,000 metagenomes; and (2) identify important classification features and how they relate to understanding of environments on Earth.
Relying on standardized data from the European Bioinformatics Institute MGnify resource, researchers constructed feature tables associating metagenome sample environment labels with Gene Ontology (GO) term abundance, InterPro (IPR) domain, and predicted taxonomy profiles. Through a series of reusable data preparation and cleaning techniques, input data was generated for reliable model training. Hyperparameter tuning was performed on top multiclass classification methods and model performance was assessed with cross-validation. With a permutation analysis to extract feature importance from the top models, researchers obtained features important for classification and used these to construct trees and networks relating different ecosystems. Using relationships from the environmental classification as well as sample ecosystem outliers, the team interpreted model errors, including misclassifications as hypernyms and hyponyms, and were able to account for most model errors suggesting future improvements through better incorporation of classification semantics into model training. Researchers also identified a series of model predictions, which directly suggest sample relabeling, for example providing more specific terms for samples labeled as Environmental: Aquatic: Marine. Results provide a high-performance metagenome ecosystem classification model and enable model interpretability to learn important ecosystem indicator functions as well as ecosystem and function relationships.
This work is supported as part of the Biological and Environmental Research (BER) Program’s Genomic Science program. The DOE Systems Biology Knowledgebase (KBase) is funded by the U.S. Department of Energy, Office of Science, BER Program under Award Numbers DE-AC02-05CH11231, DE-AC02-06CH11357, DE-AC05-00OR22725, and DE-AC02-98CH10886.