RNA-seq Science Case: NSG helps reveal complex differential gene and transcript expression
Life Sciences, Genomics
RNA-seq (RNA Sequencing) uses the capabilities of next-generation sequencing to reveal a snapshot of RNA presence in the nucleotide of a cell at a particular time. This technique provides the ability to study gene expression, gene fusion, and mutations with more resolution than it was possible using microarrays technology. Samples of two groups with different conditions are compared to determine genetic differences that might explain the observed phenotype differences. For example, samples can be obtained from a group of patients with a given disease, and from another group without that disease, and further combined to generate a list of candidate genes or mutations that might be associated with the disease. Another type of study compares data acquired before and after application of a specific treatment.
Finding associations between genes, their expression, and the phenotypes is essential to understand the processes of disease, their diagnosis, and a treatment – for example with drugs. The findings from in-vitro experiments need to be confirmed with expensive wet-lab experiments, therefore there is large interest in optimizing the data analysis to provide a short list of highly relevant candidate genes.
The analysis of RNA-seq data involves various steps that can be roughly divided into two parts: discovery of candidate genes and annotation of genes. The first part begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts. These steps are implemented with a Galaxy workflow following TUXEDO pipeline described in [Trapnell et al, 2012] (see figure and publication list below). This pipeline has been translated into a WS-PGRADE workflow to reach higher scalability via execution on a grid infrastructure. The second part, for gene annotation, performed with a Taverna workflow, consists of enrichment of metadata about the list of genes generated by the first group using additional databases and ontologies.
RNA-seq experiments must be analyzed with robust, efficient and statistically principled algorithms. Fortunately, the bioinformatics community has been hard at work developing mathematics, statistics and computer science for RNA-seq and building these ideas into software tools. RNA-seq analysis tools generally fall into three categories: (i) those for read alignment; (ii) those for transcript assembly or genome annotation; and (iii) those for transcript and gene quantification. Two popular tools are widely used that together serve all three roles. TopHat (http://tophat. cbcb.umd.edu/) aligns reads to the genome and discovers transcript splice sites. These alignments are used during downstream analysis in several ways. Cufflinks (http://cufflinks.cbcb.umd.edu/) uses this map against the genome to assemble the reads into transcripts. Cuffdiff, a part of the Cufflinks package, takes the aligned reads from two or more conditions and reports genes and transcripts that are differentially expressed using a rigorous statistical analysis. These tools are gaining wide acceptance and have been used in a number of recent high-resolution transcriptome studies.
RNA-seq experiments can serve many purposes, one of the most used cases is a workflow that aims to compare the transcriptome profiles of two or more biological conditions, such as a wild-type versus mutant or control versus knockdown experiments. For simplicity, such experiment can compare only two biological conditions, although the software is designed to support many more, including time-course experiments.
- Trapnell et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 7, 562–578 (2012)