Crosslinking The Department of Energy Systems Biology Knowledgebase (DOE-KBase) and the Research Collaboratory for Structural Bioinformatics Protein Data Bank (RSCB PDB) to Support Protein Function Discovery
Qizhi Zhang1, Claudia Lerma-Ortiz1, Dennis Piehl2, Brinda Vallat2, Shuchismita Dutta2, Janaka N. Edirisinghe1, Samuel Seaver1, Stephen K. Burley2, and Christopher Henry1* (firstname.lastname@example.org)
1Argonne National Laboratory; and 2RCSB Protein Data Bank, Rutgers, The State University of New Jersey
Systems Biology Knowledgebase (DOE-KBase) and the RCSB Protein Data Bank (PDB) offer synergistic functionality to investigate and engineer proteins. The collection of systems biology data and tools in KBase enables scientists to analyze their datasets in the context of public data and share their findings. RCSB PDB provides access to >200K experimentally determined, rigorously validated, and expertly biocurated 3D structures of proteins and nucleic acids within the open PDB archive. The RCSB.org web portal provides a variety of tools for searching, analyzing, and visualizing 3D biostructure data together with annotations from ~50 public resources. RCSB.org now supports parallel delivery of >1M computed structure models from AlphaFold DB and RoseTTAFold. The objective of this project was to lay the foundation for interoperation of these two resources, streamlining the ability of users to leverage structural biology data and workflows provided by RCSB PDB within the KBase platform.
Many projects currently funded by DOE BER aim to mechanistically understand a wide range of complex biological systems with the ultimate goal of supporting the rational manipulation, prediction, and design of these systems. The large fraction of proteins with unknown or incompletely characterized function is one of the greatest impediments to this goal. Structural biology is central to resolving and understanding protein function, particularly with the advent of the AlphaFold2 and RoseTTAFold algorithms for rapidly predicting new computed structure models (CSMs) of proteins with accuracies comparable to that of low-resolution experimental structures. Yet, structural biology approaches are greatly amplified when combined with systems biology data and tools. Toward this end, the KBase and RCSB PDB teams collaborated to develop a series of applications within the KBase platform that leverage the powerful capabilities of RCSB.org data-delivery and API services for integrating PDB data into systems biology and structural biology workflows across KBase and RCSB PDB.
Researchers demonstrate these newly developed workflows with two exemplar scenarios. In the first scenario, researchers identify genes encoded by the bacterium Micrococcus luteus that are responsible for producing proteins capable of degrading pyridine, a toxic compound found in coal tar. This workflow demonstrates combined use of transcriptomics, mechanistic modeling, and chemoinformatics to propose candidate genes for a novel biochemical pathway for pyridine degradation. Researchers then apply the new KBase-RCSB PDB pipeline to: (1) rapidly search the PDB for experimental structures that are homologous to the candidate gene products; (2) seek experimental structures of proteins co-crystallized with pyridine to identify pyridine binding sites; and (3) import and view AlphaFold2-generated structures for the candidate gene products, comparing each predicted structure with the closest experimental structures represented in the PDB archive. Researchers then conducted structure motif searches at RCSB.org to further characterize the pyridine binding site and perform structure comparisons of the AlphaFold predictions with the collection of experimental PDB structures and CSMs available at RCSB.org. Ultimately, binding site analyses aided in the performance of docking simulations in KBase to refine the gene candidates for the novel pathway.
In the second scenario, researchers exemplify concerted use of KBase and RCSB PDB to discover the unknown pyrimidine reductase (EC 220.127.116.11), a key enzyme in the Riboflavin (vitamin B) biosynthesis in Arabidopsis thaliana. Using annotation, modeling, and gapfilling tools in KBase, the team confirmed that the gene encoding pyrimidine reductase was not identified in the Arabidopsis annotation selected for this scenario. Researchers applied two newly developed KBase-RSCB PDB interface apps. The PDB-Import PDB Metadata into KBase Genome app enabled the team to query PDB for experimental structures that match sequences of any of the gene products in the entire Arabidopsis genome. Doing so exposed two Arabidopsis gene products with significant similarity to multiple experimental structures of microbial proteins in PDB currently annotated as pyrimidine reductases. The team used the Query RCSB Databases for Protein Structures app to import and view these structures, as well as offer links to views of these structures in RCSB.org. Then used a capability within RCSB.org to compare the experimental structures of interest with Arabidopsis AlphaFold2 CSMs now available on RCSB.org. The team used the RCSB.org pairwise structure alignment tool to determine that AlphaFold2 CSMs of both Arabidopsis candidate gene products aligned to distinct portions of a microbial pyrimidine reductase structures housed in the PDB. Although both Arabidopsis proteins are structurally similar, a detailed 3D analysis revealed that AT3G47390 lacks essential zinc-binding residues within its putative deaminase domain. Taken together with the observation that AT4G20960 had already been identified as the deaminase led to the hypothesis that AT3G47390 is the pyrimidine reductase (which was confirmed in publications).
The poster will display all KBase Narratives and RCSB tools applied in each of these scenarios, which are also described in detail in a publicly available training workshop on YouTube: https://www.youtube.com/watchv=vs_UyhhtSFk&list=PLHib7JgKNUUf8Z8jSK57FsJrms94w paL0&index=1Z
Argonne National Laboratory is managed by UChicago Argonne, LLC for the U.S. Department of Energy under contract no. DE-AC02-06CH11357. This program is supported by the U. S. Department of Energy, Office of Science, through the Genomic Science program, Office of Biological and Environmental Research, under FWP PRJ34888. RCSB PDB Core Operations are funded by National Science Foundation (DBI-1832184), U.S. Department of Energy (DE-SC0019749), and National Cancer Institute, National Institute of Allergy and Infectious Diseases, and National Institute of General Medical Sciences of the National Institutes of Health under grant R01GM133198.