Genomic Science Program
U.S. Department of Energy | Office of Science | Biological and Environmental Research Program

ENIGMA Long Read Sequencing and Assembly for Microbial Genomes: KBase Integration for Assembly and LISA Workshop

Authors:

Lauren Lui1* (lmlui@lbl.gov), Torben Nielsen1, John-Marc Chandonia1, and Paul D. Adams1,2

Institutions:

1Lawrence Berkeley National Laboratory (LBNL); and 2University of California–Berkeley

Goals

Achieving a causal understanding of a microbial system requires mapping mechanisms by which organisms grow, cooperate, and compete in complex environments. These mechanisms include ecological phenomena and abiotic factors that influence behavior and survival. One of the critical requirements for reaching this level of understanding is fully resolving the genomes of the community so that the functional roles specified by their genomes can be assayed and discovered. While the challenges of gene functional annotation and linking genotype and phenotype loom beyond simply obtaining genomes, the underlying challenge at the present remains to generate high-quality genomes for microbial isolates. The base genome along with its relative abundance constitute the most important foundational data needed to infer and parameterize models of microbial system dynamics.

Abstract

The Ecosystems and Networks Integrated with Genes and Molecular Assemblies (ENIGMA) Science Focus Area (SFA) has spent time developing pipelines for sequencing and assembly of long read data from microbial isolates and metagenomes to help achieve the goal of casual microbial ecology. Researchers have developed the capability to isolate diverse organisms, extract the high-molecular weight DNA needed for single-molecule long-read sequencing, and perform the sequencing using Oxford Nanopore Technologies MinION and PromethION sequencers. To characterize the microbial diversity and activity at the Oak Ridge Reservation at Oak Ridge National Laboratory, ENIGMA anticipates isolating thousands of bacteria and archaea, as well as generating spatio-temporal series of fully resolved enrichments and metagenomes from the site. These sequencing projects assist the goals of linking genotype to phenotype and understanding the temporal, dynamic, and complex factors influencing microbial community structure and activity at the research site. ENIGMA uses isolates to help link genotype to phenotype by analyzing genomes in conjunction with transposon mutant libraries, metabolomics, and growth condition data. High-quality genomes are essential for these types of experiments and ENIGMA science.

Researchers are currently adding new functionality to DOE Systems Biology KnowledgeBase (KBase) by implementing tools for using long-read data for assembly of isolates and methylation detection. By developing workflows within KBase, these tools will be more broadly available across the ENIGMA SFA and to other scientists, especially for scientists that do not specialize in computational methods. These apps and workflows will enable ENIGMA, as well as other DOE SFAs and microbiologists to (1) address scientific questions that would otherwise be infeasible with isolate assemblies using only short reads, (2) track provenance of data and methods used for assembly, and (3) share assemblies across the SFA for collaborations. By providing this new functionality in KBase, a foundation will be provided for further extensions in KBase to support developments in long-read technology. Currently, the genome assembler Unicycler has been released as a KBase app and Flye is in development. Filtlong, a read quality tool, and Polypolish, an assembly polishing tool, are in beta.

In Summer 2023, the team will be holding the Long-read Isolate Sequencing and Assembly (LISA) Workshop at LBNL. This workshop will teach participants how to go from a microbial isolate to sequencing to assembly of a genome. The workshop will have a wet lab session to learn how to extract high-molecular weight DNA, make nanopore libraries, and run a MinION sequencer. A separate computational session will be held on base calling of nanopore data, genome assembly using long read sequencing data, and genome annotation using KBase. Scientists from computational, modeling, and bench backgrounds are encouraged to attend both sessions. This workshop will be designed to accommodate learners from these diverse technical backgrounds.

Funding Information

This material by ENIGMA SFA Program at Lawrence Berkeley National Laboratory is based upon work supported by the U.S. Department of Energy, Office of Science, Biological and Environmental Research (BER) Program under contract number DE-AC02-05CH11231.