The KBase Knowledge Engine: Ecosystem Classification Prototype

Authors:

Paramvir S. Dehal^1* (psdehal@lbl.gov), Marcin P. Joachimiak¹, Ziming Yang⁴, William J. Riehl¹, Sean Jungbluth¹, Meghan Drake³, Shane Canon¹, Dylan Chivian¹, Filipe Lui², Christopher Neely¹, Priya Ranjan³, Shinjae Yoo⁴, Benjamin Allen³, Jason Baumohl¹, Kathleen Beilsmith², David Dakota Blair⁴, John-Marc Chandonia¹, Zachary Crockett³, Ellen G. Dow¹, Janaka N. Edirisinghe², José P. Faria², Jason Fillman¹, Tianhao Gu², A. J. Ireland¹, Roy Kamimura¹, Keith Keller¹, Dan Klos², Miriam Land³, Erik Pearson¹, Gavin Price¹, Boris Sadkhin², Samuel Seaver², Alan Seleman², Gwyneth Terry¹, Sumin Wang¹, Pamela Weisenhorn², Qizhi Zhang², Elisha M. Wood-Charlson¹, Robert Cottingham³, Chris Henry², and Adam P. Arkin¹

Institutions:

¹Lawrence Berkeley National Laboratory; ²Argonne National Laboratory; ³Oak Ridge National Laboratory; and ⁴Brookhaven National Laboratory

URLs:

https://kbase.us

Goals

One of the primary goals of the Department of Energy Systems Biology Knowledgebase (KBase) is the generation and application of biological knowledge from analytical results. To that end, the KBase Knowledge Engine (KE) will leverage existing and novel machine learning and bioinformatics tools to build up such knowledge from the growing body of results from analysis done using KBase and made publicly available. Ultimately, the project seeks to predict the key taxa, functions, ecosystem features, and their interactions. To accomplish this, researchers begin by developing (1) classifiers for identification of key determinants of ecosystems; (2) phenotype and trait predictors; and (3) robust pangenomes and their relationships across the microbial tree of life.

Abstract

Microbial life is a critical component of Earth’s ecosystems, and the taxonomic and functional information from environmental genomics can provide insights into microbial roles in the environment. However, comparing this data across metagenomes can be challenging, and furthermore, abundance differences may not reflect important functional differences between environments. As a first KBase KE prototype, researchers aimed to use machine learning to: (1) build and evaluate robust ecosystem classification models using standardized data from ~32,000 metagenomes; and (2) identify important classification features and how they relate to understanding of environments on Earth.

Relying on standardized data from the European Bioinformatics Institute MGnify resource, researchers constructed feature tables associating metagenome sample environment labels with Gene Ontology (GO) term abundance, InterPro (IPR) domain, and predicted taxonomy profiles. Through a series of reusable data preparation and cleaning techniques, input data was generated for reliable model training. Hyperparameter tuning was performed on top multiclass classification methods and model performance was assessed with cross-validation. With a permutation analysis to extract feature importance from the top models, researchers obtained features important for classification and used these to construct trees and networks relating different ecosystems. Using relationships from the environmental classification as well as sample ecosystem outliers, the team interpreted model errors, including misclassifications as hypernyms and hyponyms, and were able to account for most model errors suggesting future improvements through better incorporation of classification semantics into model training. Researchers also identified a series of model predictions, which directly suggest sample relabeling, for example providing more specific terms for samples labeled as Environmental: Aquatic: Marine. Results provide a high-performance metagenome ecosystem classification model and enable model interpretability to learn important ecosystem indicator functions as well as ecosystem and function relationships.

Funding Information

This work is supported as part of the Biological and Environmental Research (BER) Program’s Genomic Science program. The DOE Systems Biology Knowledgebase (KBase) is funded by the U.S. Department of Energy, Office of Science, BER Program under Award Numbers DE-AC02-05CH11231, DE-AC02-06CH11357, DE-AC05-00OR22725, and DE-AC02-98CH10886.