KBase Science and Infrastructure Updates

Authors:

Benjamin Allen³, Jason Baumohl¹, Kathleen Beilsmith², David Dakota Blair⁴, John-Marc Chandonia¹, Dylan Chivian¹, Zachary Crockett³, Ellen G. Dow¹, Meghan Drake³, Janaka N. Edirisinghe², José P. Faria², Jason Fillman¹, Tianhao Gu², A. J. Ireland¹, Marcin P. Joachimiak¹, Sean Jungbluth¹, RoyKamimura¹, Keith Keller¹, Dan Klos², Miriam Land³, Filipe Liu², Erik Pearson¹, Gavin Price¹, Priya Ranjan³, William Riehl¹, Boris Sadkhin², Samuel Seaver², Alan Seleman², Gwyneth Terry¹, Sumin Wang¹, Pamela Weisenhorn², Ziming Yang⁴, Shinjae Yoo⁴, Qizhi Zhang²; Shane Canon¹ (scanon@lbl.gov), Paramvir S. Dehal¹, Elisha M. Wood-Charlson¹* (elishawc@lbl.gov), Robert Cottingham³, Chris Henry², and Adam P. Arkin¹

Institutions:

¹Lawrence Berkeley National Laboratory; ²Argonne National Laboratory; ³Oak Ridge National Laboratory; and ⁴Brookhaven National Laboratory

URLs:

https://www.kbase.us

Goals

The Department of Energy Systems Biology Knowledgebase (KBase) is a knowledge creation and discovery environment designed for both biologists and bioinformaticians. KBase integrates a large variety of data and analysis tools, from DOE and other public services, into an easy-to-use platform that leverages scalable computing infrastructure to perform sophisticated systems biology analyses. KBase is a publicly available and developer extensible platform that enables scientists to analyze their own data within the context of public data and share their findings across the system.

Abstract

Science Updates

Science Focus Areas (SFAs) and university collaborators have been testing and releasing new data and functionality in KBase, especially around improving genome quality, functional prediction of microbial communities, and making data and tools accessible to everyone.

The Ecosystems and Networks Integrated with Genes and Molecular Assemblies (ENIGMA) SFA is integrating long-read sequencing and isolate polishing tools into KBase and is collaborating with KBase to host a training workshop on laboratory and bioinformatics methods that support the generation of high-quality isolate genomes. Next, a major collaborative development area includes model-driven phenotype prediction and mechanistic analysis within KBase. This pipeline starts with a wide range of new tools designed to predict potential functions from protein sequence. New genome annotation pipelines, Distilling and Refining Annotation of Metabolism (DRAM) and Snekmer, expand annotation to new specialty areas of metabolism and offer alternative function hypotheses for difficult to annotate genes. Improvements to the KBase infrastructure were made to support multiple alternative theoretical annotations for genes developed with the Systems Biology Approach to Interactions and Resource Allocation in Bioenergy-Relevant Microbial Communities SFA. Growth phenotype data offers a means of discerning which of these alternative annotations is correct, but this data is commonly not available for many genomes. The KBase Knowledge Engine (KE) team addressed this by developing machine-learning–based tools to predict phenotypes based on genome annotations. Following that, researchers determine which combinations of gene annotations lead to the best agreement with predicted phenotype. The Phenotypic Response of the Soil Microbiome to Environmental Perturbations SFA developed an algorithm for automatically fitting metabolic models to predicted (and observed) phenotype data. Further validation of proposed annotations can be done using protein structure–based evidence, which is now supported by a collection of KBase tools that import protein structure data for KBase proteins from the Research Collaboratory for Structural Bioinformatics Protein Data Bank. Finally, all of these model-driven workflows are enhanced by significant improvements to the ModelSEED metabolic model reconstruction and analysis tools in KBase. Together, these tools seamlessly interoperate, offering greatly enhanced understanding of genome metabolism, with more accurate and quantitative energy metabolism and improving phenotype prediction accuracy from 56% on average for draft models to 72% accurate.

This pipeline is being applied to a growing collection of high-quality datasets loaded into KBase from collaborators. For example, the Genome Resolved Open Watershed (GROW) project contains 178 metagenomes, 50 metatranscriptomes, and 2,093 metagenome-assembled genomes that are available and linked to rich sample metadata in KBase. Another example is the Plant-Microbe Interfaces SFA, working on KBase apps to simplify the isolate selection process for constructed community experiments, adding >550 isolate genomes to KBase, and supporting integration of high-value datasets (Biolog data, BacDive database).

Infrastructure Updates

KBase has also undergone some infrastructure improvements over the past year. KBase has continued to dramatically improve bulk upload support, which now enables upload of large datasets using a spreadsheet to specify object names, filenames, and metadata.

A central goal of KBase is to put users’ data in context of all data in the platform, allowing users to quickly find and prioritize all relevant data. To further this goal, the project is introducing “Collections,” high quality curated data sets and an interface for rapidly matching and sub selecting data sets based on their relationship to a user’s data. These collections will initially focus on high quality, highly relevant datasets especially from the DOE community and will enable users to see relationships between their data and these collections to enable new insights and drive further analysis.

KBase is also working to ensure community contributions are tracked and credited appropriately. KBase assigns Digital Object Identifiers (DOIs) for Narrative workflows that have been made static and documented for publication, which connects KBase research products to the broader publishing infrastructure. KBase will soon enable researchers to have their KBase DOIs visible as part of their individual ORCID record, and KBase will soon be able to report data reuse numbers to DataCite for all Narratives that have received a DOI for publication.

Finally, in collaboration with the DOE Joint Genome Institute through a co-development effort, KBase and Integrated Microbial Genomes (IMG) have generated a mapping of non-redundant protein sequences to UniRef 100 clusters which enables IMG users to identify and link directly to identical sequences in KBase. These platforms continue to work together to improve data connections, which will guarantee that data ownership and embargo periods are honored, even after data has been transferred between platforms.

Funding Information

This work is supported as part of the Biological and Environmental Research (BER) Program’s Genomic Science program. The DOE Systems Biology Knowledgebase (KBase) is funded by the U.S. Department of Energy, Office of Science, BER Program under Award Numbers DE-AC02-05CH11231, DE-AC02-06CH11357, DE-AC05-00OR22725, and DE-AC02-98CH10886.