Predicting Gene Functions in Plants with Single-Cell Genomic Data
Prakash Raj Timilsena1*, Sai Deepak Gattidi1, Jose Dinneny2, and Song Li1
1Virginia Polytechnic Institute and State University; and 2Stanford University
The rapid sequencing of genomes and transcriptomes in bioenergy crops and other plant species has outpaced the rate at which gene functions can be accurately annotated. Wet-bench validation for gene functions is very laborious and time consuming for non-model species. Even in a model organism like Arabidopsis thaliana, majority of the gene functions have not been validated with wet lab experiments. Traditional computational methods for assigning gene functions largely rely on sequence homology which could not account for gene expression activities in different tissue or cell types. In this project, researchers tested whether single-cell gene expression data can be used to improve the gene function annotation. The team compared bulk- and single-cell RNA seq datasets from roots and assessed the performance of seven machine learning algorithms for predicting gene functions. Researchers found that random forest works the best among these methods. The team further asked whether single-cell genomic data can provide additional information because expression data from more diverse cell populations are captured by single-cell (sc)RNA-seq as compared to bulk RNA-seq. Surprisingly, bulk RNA-seq were found to have better accuracy in predicting many gene functions as compared to scRNA-seq data. A comparison of scRNA-seq datasets from different tissues showed that leaf scRNA-seq data provides higher accuracy in predicting the chloroplast and photosynthesis related genes as compared to root scRNA-seq data. This observation suggests that the specificity of the information content in single-cell datasets from different tissues is biologically relevant. Because of the diversity of cell types captured by scRNA-seq, research found that an increasing number of Uniform Manifold Approximation and Projection clusters may help to improve the prediction accuracy for single-cell data. The future direction of this work is to incorporate stress responsive scRNA-seq data and regulatory networks (DAP-seq) information to further expand the prediction of novel gene functions in oil seed crops. Experimental validation for selected genes will be performed in the coming years.
This project is supported by DOE-BER, DE-SC0020358, and DE-SC0022985.