Genomic Science Program
U.S. Department of Energy | Office of Science | Biological and Environmental Research Program

HypoRiPPAtlas: an Atlas of Hypothetical Natural Products for Mass Spectrometry Database Search


Yi-Yuan Lee1,2, Mustafa Guler1* (, Desnor N. Chigumba3, Shen Wang1, Neel Mitta1, Cameron Miller1, Benjamin Krummenacher1, Haodong Liu1, Liu Cao1, Aditya Kannan1, Keshav Narayan1, Samuel T Slocum4, Bryan L Roth4, Alexey Gurevich5, Bahar Behsaz1, Roland D. Kersten3, and Hosein Mohimani1


1Carnegie Mellon University; 2Cornell University; 3University of Michigan; 4University of North Carolina; and 5St. Petersburg State University



Recent analysis of hundreds of thousands of public microbial genomes has resulted in the discovery of over a million biosynthetic gene clusters (BGCs; Hadjithomas et al. 2016; Blin et al. 2016; Kautsar et al 2021). Gene-to-molecule approaches are therefore urgently needed for microbial and plant natural product (NP) discovery in light of rapidly growing microbial and plant genetic resources. Currently, the NPs for the majority of BGCs remain unknown. Global natural product social (GNPS) molecular networking infrastructure harbors billions of mass spectra of NPs with unknown structures and biosynthetic genes. In order to bridge the gap between large scale genome mining and mass spectral datasets for NP discovery, researchers developed HypoRiPPAtlas, an atlas of hypothetical NP structures, which can be readily used for in silico database search of tandem mass spectra.

HypoRiPPAtlas is constructed by mining the genomes of 22,671 microbial strains from the RefSeq database using seq2ripp, a novel machine learning tool for prediction of ribosomally synthesized and post-translationally modified peptides (RiPPs). Seq2ripp outperforms currently existing RiPP mining tools in identification of known MiBIG RiPPs from genomic inputs. Searching the hypothetical molecules from the Atlas against 46 mass spectral datasets from GNPS resulted in the discovery of numerous RiPPs, including two novel lassopeptides and one lanthipeptide from Streptomyces sp. NRRL B-2660, WC-3904 and WC-3560. Moreover, seq2ripp discovered ten plant RiPPs including elaeagnin, a member of a new BURP-domain-derived RiPP class with a novel post-translational modification (PTM) from silverberry (Elaeagnus pungens). By addressing the fundamental challenge of predicting structures from NP biosynthetic genes, the HypoRiPPAtlas approach has therefore the potential to close the gap between biosynthetic genes and their natural products in genomic NP discovery, which could be extended to other NP classes in the future by implementing corresponding biosynthetic logic.

HypoRiPPAtlas and the seq2ripp pipeline are both publicly available at Users can examine hypothetical RiPPs mined from publicly available genomes and upload their own paired genomic and mass spectral datasets to launch custom seq2ripp runs.


Hadjithomas, M. et al. 2016. “IMG-ABC: New Features for Bacterial Secondary Metabolism Analysis and Targeted Biosynthetic Gene Cluster Discovery in Thousands of Microbial ” Nucleic Acids Research 45, D560–5.

Blin, K., et al. 2016. “The AntiSMASH Database, a Comprehensive Database of Microbial Secondary Metabolite Biosynthetic Gene Clusters.” Nucleic Acids Research 45, D555–9.

Kautsar, S. A., et al 2021. “BiG- SLiCE: A Highly Scalable Tool Maps the Diversity of 1.2 Million Biosyn-thetic Gene Clusters.” GigaScience 10(1), giaa154.

Funding Information

This research was supported by the U.S. Department of Energy award U.S. Department of Energy award DE-SC0021340.