The scientific objectives of the Genomic Science program require a highly coordinated application of expertise that transcends traditional disciplinary boundaries. As such, one of the program's most challenging but critical goals is the creation of robust computational frameworks for data integration, analysis, and sharing that can accommodate the wide variety of heterogeneous data streams being generated across the Genomic Science community. These frameworks include not only the various types of omics data (as well as meta-omics variations) discussed earlier in this report, but also data derived from nonomics-based analytical technologies for quantitative physiological analysis, physicochemical measurements of environmental factors, and a vast array of other experimental data types.
Data-specific needs for Genomic Science program research often revolve around tracking high-throughput experimental and contextual environmental data; developing tools for capturing and archiving large and complex datasets; and generating innovative new approaches for analysis, distillation, and integration of systems biology data. Tracking the data requires a Laboratory Information Management System (LIMS) appropriate for specific project needs to monitor experimental cycles, track samples and workflows, and collect internally compatible data from varying instrument types. Data capture and storage present considerable challenges, and the volume and complexity of data generated by systems biology research often require new technologies and bioinformatics approaches permitting rapid data storage, retrieval, and transfer at very large scales. The generation of raw data only begins the cycle of scientific inquiry. Improved data-distillation strategies for filtering out noise and compressing noncritical information, as well as identifying biologically meaningful data subsets, are critical for enabling subsequent cycles of analysis, integration, and modeling.
The process of modeling and simulation attempts to build a more integrated understanding of the dynamic nature of biological systems and enable scientists to test their knowledge via computerized "virtual experiments." Creating models that predict biosystem response to untested conditions requires continued emphasis on quantitative details such as kinetic constants, enzyme activities, and dynamic metabolic measurements underlying functional biological processes. Continuing to build on well-developed model organisms such as Escherichia coli, Saccharomyces cerevisiae, and Arabidopsis thaliania is important, as is extending this area of research to develop predictive models of biological function in a broader class of organisms. Moving beyond the level of individual organisms, new mathematical and machine-learning methods are needed to address biological variables at the community scale and understand evolving interactions with external signals from the environment. As more powerful resources for high-performance computing become available, the amount of biological data produced by high-throughput experimental approaches grows at an even faster pace. Although this data continues to yield insights into and improve the quantitative understanding of biological systems, incorporating detailed molecular, biochemical, physiological, and structural information into biological models and simulations remains a major challenge.
Given the data-intensive nature of Genomic Science research, all supported projects are required to generate data management and integration plans that emphasize an iterative approach to data analysis and lead to a predictive understanding of the biological system(s) under investigation. Developing these types of plans serves not only the objectives of the individual project, but also facilitates the collaborative sharing of resulting data across the broader community via mechanisms such as the DOE Systems Biology Knowledgebase..The long-term success of the Genomic Science program, and systems biology in general, depends on achieving high levels of data and information integration and sharing. BER has established an information and data sharing policy requiring public accessibility to all publishable information.
Achieving a predictive understanding of biological systems is a daunting challenge and requires the integration of immense amounts of diverse information—functional descriptions assigned to DNA sequence, molecular interactions, images of molecules or physical structures within an organism, and details about the environment in which an organism lives. These information types typically have not been integrated and compared, and the heterogeneous mix of data emanating from the Genomic Science program will span diverse environmental conditions, spatial scales (nanometers to kilometers), and temporal scales (nanoseconds to decades). To address this grand challenge, DOE has funded a Systems Biology Knowledgebase (KBase) to facilitate a new level of scientific inquiry.
KBase is the first large-scale bioinformatics system to enable users to upload their own data, analyze it (along with collaborator or public data), build increasingly realistic models, and share and publish their workflows and conclusions. With features beyond those of a database or workbench, KBase aims to provide a knowledgebase: an integrated environment where knowledge and insights are created and multiplied. KBase's enterprise-class computing capabilities enable data integration and analysis at a powerful scale. This environment will foster the intellectual engagement and inherent creativity of a collaborative scientific community needed to decipher biological principles underlying complex systems. These concepts are driving KBase design:
Related BER Research Highlights