Integration of Enzyme Function Initiative Tools in the KBase Platform

Authors:

Nils Oberg², Douglas Mitchell², John Gerlt², and Christopher Henry^1* (chenry@anl.gov)

Institutions:

¹Argonne National Laboratory; and ²University of Illinois Urbana–Champaign

Goals

Protein families of unknown function are a significant challenge facing the DOE BER research community because they prevent comprehensive metabolic reconstructions of both individual microorganisms and microbiome systems. While many tools in KBase and elsewhere today permit the discovery of completely new protein families, unfortunately very few tools exist, particularly in KBase, to study the function of these families. Fortunately, the Enzyme Function Initiative (EFI; http://enzymefunction.org) offers a suite of tools specifically designed to address this important problem. In this project researchers are working to fully integrate the EFI toolset into KBase, with complete ties to DOE BER sequencing sources including all sequence data in KBase as well as the JGI IMG Database.

Abstract

Advances in computational methods and DNA sequencing now allow for single projects to generate tens to hundreds of metagenome sequences and potentially tens of thousands of isolate or metagenomically assembled genomes (MAGs) from diverse ecosystems. In theory, computational inference of the protein products encoded by these genomes, and the associated biochemical functions, should allow for the accurate prediction and modeling of key microbial traits, organismal interactions, and ecosystem processes that drive biogeochemical cycles. Unfortunately, the rate and generation of metagenomes, isolate genomes, and MAGs, along with related multiomic datasets, currently far outpaces the ability to translate these genome-enabled findings into ecosystem-informed predictive knowledge.

One of the most significant challenges currently inhibiting the understanding of complex biological systems from genomic and multiomic data is the staggering number of proteins that have completely unknown functions. About 50% of the proteins encoded by the genes in complete microbial genomes, and an even higher proportion of those encoded by microbial genes from environmental samples, cannot be reliably assigned a function. These unknown functions translate into large gaps in the metabolic reconstructions, prevent researchers from explaining more than 25% of most metabolomes, and obfuscate the functional interdependencies that guide the structure of all microbiome systems. Despite these challenges, virtually all of the functional annotation tools currently available in KBase and other platforms focus largely on assigning functions to proteins that are very similar to other proteins of known function (e.g. via propagation of function based on close sequence homology). Because the sequence boundaries between functions cannot be specified in the absence of orthogonal information, homology-based annotations often are incorrect. Tools are needed that are designed to integrate multiple sources of evidence to decode the functions of uncharacterized protein families and understand the limits of annotation propagation.

The Enzyme Function Initiative (EFI) toolkit is designed to fill this exact niche in protein function discovery. The EFI tool pipeline is comprised of three analysis steps: (1) generation of sequence similarity networks (SSNs) enabling the semi-automated reconstruction of high-quality protein families built around any protein sequence of interest (EFI-EST; https://efi.igb.illinois.edu/efi-est/); (2) parallel exploration of the genome neighborhood context of a protein family across a diverse set of input genomes to discover potential functionally linked gene products/enzymes that can be used to infer novel enzymatic functions and metabolic pathways (EFI-GNT; https://efi.igb.illinois.edu/efi-gnt/); and (3) determination of metagenome abundance of clusters in the SSNs for protein families using chemically guided functional profiling (CGFP) to discover the physiological/environmental context in which the proteins are expressed (EFI-CGFP; https://efi.igb.illinois.edu/efi-cgfp/). The EFI tools also provide links to structure data in Protein Data Bank (PDB) to gain further clues about protein function.

Researchers are deploying the EFI tools into the KBase platform, with re-engineering on the backend to permit a seamless integration with the KBase database of isolate, reference genome, metagenome, and MAG sequences. This will greatly enhance the value of the EFI tools to the DOE BER research community; greatly expand the ability of these tools to access more diverse sequence data and annotation sources (e.g. IMG); significantly ease the long term maintenance of the EFI tools by linking to the KBase data update cycle; and greatly enhance the capacity for users of the KBase platform to study protein families of unknown function.

Funding Information

Argonne National Laboratory is managed by UChicago Argonne, LLC for the U.S. Department of Energy under contract no. DE-AC02-06CH11357. This program is supported by the U. S. Department of Energy, Office of Science, through the Genomic Science program, Office of Biological and Environmental Research, under FWP PRJ39217.