Genomic Science Program
U.S. Department of Energy | Office of Science | Biological and Environmental Research Program

From Bulk Organic Matter Profiling to Specific Metabolite Identification: Improving Metabolomics Data Analysis, Annotation, Interpretation, and Integration


Christian Ayala-Ortiz1* (, Sumudu Rajakaruna1, Yuri E. Corilo2, Jordan Rabus2, Dalal Alharthi1, Malak Tfaily1


1University of Arizona; 2Environmental Molecular Sciences Laboratory (EMSL); and Pacific Northwest National Laboratory (PNNL)


Over the past decade, advances in different omics technologies such as metagenomics, metatranscriptomics, metaproteomics, and metabolomics have revolutionized biological research by enabling high-throughput monitoring of biological processes at the molecular and organismal level and their responses to environmental perturbation. Metabolomics is a newer and fast-emerging technology in systems biology that aims to profile small compounds within a biological system that are often end products of complex biochemical cascades. Despite its importance, metabolomics has long been overshadowed by other omics and while metabolomics has not always been considered a standard tool in environmental and microbiome science, it can augment the power of genomics, transcriptomics, and proteomics by providing a functional snapshot of all upstream biological processes, thereby filling in gaps left behind by genomics and proteomics. The overarching goal of this project is to develop user-friendly and open-source tools (MetaboDirect and MetaboTandem) to optimize, streamline, and improve current data analysis pipelines for metabolomics data sets from complex and heterogeneous samples. All tools will be available for other researchers and practitioners who can replicate or extend the work.


Metabolites constitute the chemical currency of environments as they are used, transformed, and exchanged by the microorganisms found within a system. Understanding how different perturbations change the metabolome will allow improvement of the knowledge of the mechanisms used by the microbial communities that drive ecosystem functions. However, fully characterizing the metabolome is challenging due to its complexity and heterogeneity, often requiring multiple approaches with different objectives. One approach is the bulk characterization of metabolites or organic compounds in a system through the use Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS). While the high-resolving power of this instrument allows for molecular-level characterization of organic matter from diverse environments, it generates hundreds of millions of data points that need to be processed and visualized. In response, the comprehensive, open-source, command-line based tool MetaboDirect was developed based upon years of analytical expertise with diverse sample types. The current version of MetaboDirect was published to be fully compatible with the output of the molecular formula assignation software, Formularity. Efforts are in place to integrate this pipeline with CoreMS, the comprehensive mass spectrometry framework being developed by EMSL at PNNL. Liquid chromatography tandem mass spectrometry (LC-MS/MS) is another approach that can be used for metabolome characterization of complex environmental samples that, unlike FT-ICR MS, is commonly used to get the true identity of the metabolite. However, LC-MS/MS suffers from additional challenges related to data processing and metabolite annotation and identification. Researchers are developing MetaboTandem as a package and a shiny app for the easy and fast analysis, annotation, and visualization of LC- MS/MS data that has a special emphasis on providing a comprehensive metabolite annotation that goes beyond the use of in-house database to include a combination of publicly available databases and in silico prediction tools.

While MetaboDirect and MetaboTandem were/are being developed to address two challenges with metabolomics data (Challenge 1: Easy-to-use software; Challenge 2: Effective data visualization), team members acknowledge that the metabolomics field continues to struggle with two additional challenges (Challenge 3: Metabolite annotation and identification and Challenge 4: Big data). As such, researchers have been working on combining analytical chemistry, computer science (ML/AI), and statistics to develop bioinformatics tools that address these two additional challenges. As such the team is not currently testing multiple ML-based algorithms to putatively identify unannotated metabolites. Accessing the information hiding in the unidentified metabolites will prove to be key to the effective integration of metabolomics with other omics data.

Funding Information

This research was supported by the DOE Office of Science, Office of Biological and Environmental Research (BER), grant no. DE-SC0021349.