The Genomic Science program (GSP) of the Office of Biological and Environmental Research (BER), within the U.S. Department of Energy Office of Science, supports systems biology research on microbes, plants, plant-microbe interactions, and environmental microbial communities. Understanding and harnessing the metabolic and regulatory networks of plants and microbes will enable their design and re-engineering for improved energy resilience and sustainability, including advanced biofuels and bioproducts.
The widespread adoption of high-throughput, multiomic techniques has revolutionized biological research, providing a broader view and deeper understanding of cellular processes and the biological systems they drive. In pursuit of predictive modeling and genome-scale engineering of complex biological systems important for bioenergy, GSP-supported research generates vast amounts of complex omics data from a wide range of analytical technologies and experimental approaches. These data span many spatiotemporal scales, reflecting the organizational complexities of biological systems, and present significant computational challenges for identifying causal variants that influence phenotype. Accurate modeling of the underlying systems biology depends on surmounting those challenges.
To construct coherent knowledge of the systems underpinning and governing the diverse phenomics and functioning of environmental and host-associated microbial communities, detailed characterizations are essential for community genomic, transcriptomic, proteomic, and other systems processes—from a variety of samples and conditions. Such characterizations necessitate the ability to combine large, disparate datasets of heterogeneous types from multiple sources, integrated over time and space, and to represent emergent relationships in a coherent framework.
The breadth of plant and microbial community datasets and the complexities in the integration of different data layers present enormous challenges. Innovative approaches are needed to work effectively with, and glean useful insights from, complex, integrated molecular omics data. Computational simulation and rigorous hypothesis testing depend on the ability to incorporate multiple experimental and environmental conditions and associated metadata. Currently, the generation rate of multiomic datasets greatly exceeds the ability to analyze, integrate, and interpret them.
In fiscal year 2020, BER solicited applications proposing innovations in data integration approaches and new software frameworks for the management and analysis of large-scale, multimodal, and multiscale data. The program sought novel computational tools that will lead to scalable solutions for omics analysis, data mining, and knowledge extraction from complex datasets (experimental and calculated). Also sought were capabilities that are interoperable and effective for computationally intensive data processing and analyses for directing systems-level investigations. To aid the interpretation of multimodal data for environmental sciences, BER encouraged research focused on the enhancement of existing software or approaches already broadly used by the genomics community. Applications were requested on research topics focused on the development of novel computational, bioinformatic, statistical, algorithmic, or analytical approaches, toolkits, and software. These capabilities include:
Inferring Gene Function. [Courtesy Georgia Institute of Technology]
The ability to predict the function of a protein-coding gene from its sequence is a grand challenge in biology. The goal of this project is to create a computational infrastructure to infer gene function from gene sequence using informatics, multiscale simulation based on highperformance computing (HPC), and machine-learning pipelines. Accurate gene annotation using computational methods will facilitate genomic science breakthroughs essential to understanding and harnessing life processes in bacteria, fungi, and plants. The incorporation of information from state-of-the-art structural modeling and simulation methods, together with evolutionary analysis and systems biology databases, will make this possible. This synergy, paired with HPC-enabled bioinformatics and machine learning, will provide vastly more powerful methodologies, tools, and results in gene annotation than previously existed. This project’s success will advance one of BER’s primary missions—translating nature’s genetic code into predictive models of biological function—and will be facilitated by HPC resources provided by DOE leadership computing facilities.
Improving Predictions of Microbial Community Function. [Courtesy University of Colorado Denver]
Advances in DNA sequencing and associated genome-enabled, high-throughput technologies have made the assembly of microbial genomes and partial genomes recovered from the environment routine. In theory, computational inference of the protein products encoded by these genomes and the associated biochemical functions should allow for the accurate prediction and modeling of key microbial traits, organismal interactions, and ecosystem processes that drive biogeochemical cycles. In practice, however, a lack of scalable computational annotation tools means these outcomes are rarely achieved without expert manual curation, which scales extremely poorly. Scalable annotation should be informed directly by the time-consuming manual curation protocol of protein and metabolic annotations that researchers often enlist to understand and model microbial community metabolism.
The objective of this project is to innovate upon and scale up expert curation approaches to create a probabilistic annotation framework in the DOE Systems Biology Knowledgebase (KBase). This framework will enable easy-to-use and metabolism-centric protein annotation. Microbial proteins and metabolites function in the context of complex systems of inter- and intraorganismal interactions. To infer, understand, and model microbial and ecosystem traits and processes of relevance to biogeochemical cycling, protein and metabolite function need to be encoded, inferred, and studied in the context of this interconnected and dynamic network of interactions. Current approaches ignore these interactions and rely almost exclusively on limited sequence homology methods to infer protein function.
Finally, functional annotations are dynamic and can—and should—evolve when new evidence arises. Thus, annotations should be curatable, versioned, and probabilistic. Dissemination of new methods in KBase will make these annotations widely available to the scientific community. This research will ultimately enable prediction of microbial phenotypic traits and interorganism interactions in the context of community metabolic models for a wide variety of ecosystems important to biogeochemical cycling and bioenergy.
Key outcomes will be integrated within a new “Annotation Collective” in KBase for robust protein function and metabolism inference. This collective will include:
Identifying Signaling Small Molecules. [Courtesy Carnegie Mellon University]
Microbial communities are regulated through the interactions between their microbial members. Most signal transduction pathways in the microbiome are known to be modulated through small-molecule products of microbial biosynthetic gene clusters (BGCs). Advances in 16S and shotgun metagenomics have revolutionized understanding of the microbial composition of various communities and their BGCs. Preliminary results from this and other research show that environmental metagenomes contain thousands of BGCs with uncharacterized small-molecule products that potentially play roles in signal transduction.
The aim of this project is to develop computational techniques for discovering these small molecules and characterizing their role in signal transduction. Recent analysis of tens of thousands of public isolated genomes and metagenomes has identified over 330,000 BGCs included in the Integrated Microbial Genome Atlas of Biosynthetic Gene Clusters (IMG-ABC). However, connecting these BGCs to their molecular products has not kept pace with the speed of microbial genome sequencing (fewer than 1% of the BGCs from IMG-ABC are connected to their molecular products). Discovering the chemical structure of these BGC products is the first step toward characterizing their activity. Moreover, many of these products might have novel chemistry or modifications, shedding light on the functionality of biosynthetic enzymes. The overarching goal of this project is to develop a high-throughput platform for determining the molecular products of BGCs in IMG-ABC using mass spectral data. The expected outcome is a catalog of microbial small molecules that play roles in signaling in plant-associated microbial communities, along with their BGCs.
The research team recently developed computational techniques, including Dereplicator and Dereplicator+, for the identification of known small molecules and their variants from tandem mass spectra. Searching billions of mass spectra from publicly available datasets, such as Global Natural Product Social molecular infrastructure (GNPS), has revealed thousands of known small molecules and their novel variants. However, the majority of GNPS spectra remain unannotated. As a step forward, the research team hypothesizes that many of these unannotated spectra are the product of BGCs in microbial genomes and metagenomes. Building upon the team’s previous work developing new computational tools for discovering natural products from mass spectral and genomic data, the immediate plan of this project is to develop new algorithms to elucidate the structure of novel signaling peptide natural products (PNPs) from microbial communities and to construct a catalog of signaling PNPs and their BGCs. This will be achieved by:
While the research team’s computational methods are designed for general discovery of novel PNPs, special emphasis will be placed on the discovery of signaling PNPs from plant-associated microbes. All the data, software, and results developed during this project will be available through the GNPS infrastructure.
Proposed Workflow for MetaboTandem Analysis Software Toolkit. [Courtesy University of Arizona and DOE EMSL]
During the past decade, advances in different omics technologies such as metagenomics, metatranscriptomics, metaproteomics, and metabolomics have revolutionized biological research by enabling high-throughput monitoring of biological processes at the molecular and organismal level and their responses to environmental perturbation. Metabolomics is a newer and fast-emerging technology in systems biology that aims to profile small compounds within a biological system that are often end products of complex biochemical cascades. Such compounds are the link from genome, transcriptome, and proteome to the phenotype. Thus, metabolomics provides a key tool in the discovery of the genetic basis of metabolic variation.
Despite advancements and increasing accessibility of multi-omics technologies, integration of multi-omics data in analysis pipelines remains a challenge, especially in the environmental field. In addition, there are still many associated bottlenecks to overcome in metabolomics before measurements will be considered robust. The overarching objective of this project is to optimize the analysis of complex and heterogeneous biological and environmental datasets by developing a user-friendly, open-source metabolomics data analysis pipeline that is integrable with other multi-omics datasets. These toolkits will be written in Python language and will incorporate well-established and community-specific software known as packages. Users can run the software as a stand-alone toolkit or through the DOE Systems Biology Knowledgebase (KBase). A website with a catalog of existing software products and best practices will be established. The website will link to DOE’s Environmental Molecular Sciences Laboratory (EMSL) and Joint Genome Institute (JGI), where the experimental data will be housed. The project’s large-scale multi-omics data integration approach is highly relevant to KBase’s mission of achieving a predictive understanding of the role of compounds in diverse biological and environmental systems and will allow the scientific community to improve biological and metabolic genome-based predictive models by integrating “true” metabolic evidence. This research will further promote a new streamlined workflow-based approach for metabolomics and multi-omics data integration and interpretation that promotes transparent data analysis and reduces the technical expertise required to perform data import and processing.
Improving the Understanding of Microbial Communities in Soil. [Courtesy University of Montana and Pacific Northwest National Laboratory]
Research for this project is motivated by the need to understand soil communities that play a key role in the plant-soil dynamic, with impacts on food- and fuel-crop production. To understand the roles of these microbial communities, it is vital to maximally annotate their genomic and functional capacity, yet the majority of data from newly acquired microbiomes remains unannotated.
This project will focus on the development of a novel method for incorporating nongenomic information into the process of annotating genomic sequence (Aim 1), and two complementary strategies building on recent advances in alignment-based and alignment-free labeling (Aims 2 and 3). In combination, these approaches are expected to substantially increase the completeness of labeling for difficult-to-annotate microbiome datasets. In addition to designing new methods, the research team will develop and release open-source software products that, where appropriate, will be integrated into existing frameworks such as the DOE Systems Biology Knowledgebase (KBase) for maximal benefit to the DOE community and the European Bioinformatic Institute’s (EMBL-EBI) annotation systems for broader reach.
Next-Generation Statistical Methods for Soil Microbial Communities. [Courtesy University of Wisconsin, Madison]
Microbial communities are among the main driving forces of biogeochemical processes in the biosphere. In particular, many critical soil processes such as mineral weathering and soil cycling of mineral-sorbed organic matter are governed by mineral-associated microbes. Understanding the composition of microbial communities and the environmental factors that play a role in shaping this composition is crucial to comprehending soil biological processes and predicting microbial responses to environmental changes.
To identify the driving factors in soil biological processes, researchers need robust statistical tools that can connect a set of predictors with a specific phenotype. However, innovations in statistical theory for biochemical and biophysical processes have not matched the increasing complexity of soil data. Existing statistical techniques have four main drawbacks: They (1) perform poorly on high-dimensional, highly sparse data, such as soil metagenomics; (2) ignore spatial correlation structure, which can be a key component in soil-related data; (3) do not provide valid p-values under high-dimensional settings, preventing detection of significant factors driving the phenotype of interest; and (4) tend to focus on abundance matrices to represent microbial compositions. Abundance data matrices are inherently flawed because they do not allow for easy propagation of statistical uncertainty in the data pipeline. For example, sequences are rarely a 100% match in the reference-based Operational Taxanomic Unit (OTU) tables, which is especially troublesome for soil samples due to high microbial diversity and uneven distribution. Moreover, compositional data is restricted to sum to one, which affects how proportions behave in different experimental settings (i.e., changes in proportions in the microbial composition do not necessarily reflect actual biological changes in the interactions).
This project’s objective is to pioneer for soil omics data the development of next-generation statistical theory (accompanied by open-source, publicly available software). The research team’s novel statistical methods will overcome existing challenges in standard approaches in three ways:
By harnessing the power of big data through revolutionary new statistical theory in sparse learning and post-selection inference, the team’s work will produce tools that can better understand the drivers in soil biological phenotypes to provide new insights into fundamental biological processes. The deliverables of this project are: