RNAseq experimental design

From Bioinformatics Core Wiki
Jump to: navigation, search

When designing your RNAseq Differential Expression experiment, please take the following into consideration:

Biological vs. Technical Replicates

Biological variation arises from the different levels of gene expression between individuals or tissue or culture samples, even when environmental factors are kept as consistent as possible between these individuals or tissues or cultures. Technical variation arises from taking different, inaccurate measurements of the same sample multiple times. Generally, one is interested in expression level differences caused by different treatments (drugs, genotypes, environmental conditions, etc.), or the "treatment effect." One might also be interested in the degree of biological variation. The main problem that needs to be addressed when designing experiments is that every measurement will be affected by biological variation, technical variation, and variation due to treatment, all at the same time. The trick is to try to design experiments so that one can make supportable claims about one or more of these distinct sources of variation.

The expensive solution to this problem is to sequence many technical replicates for each biological sample, and many biological samples (replicates) for each combination of treatments. In general, though (or, with mature, well executed technology), biological variation is larger than technical variation. And in practice, one often gains a feel for the technical variation associated with a particular technology over the course of multiple experiments. In the case of RNAseq analyses, we recommend favoring biological replicates over technical replicates. Without biological replicates, one cannot make the inference (at least, solely based on RNAseq results) that any amount of observed differential expression is due to the treatment. This is because expression differences could exist between individuals independent of the treatment effect. Furthemore, one cannot infer that any putative treatment effect generalizes to the larger population; other individuals may respond differently to the same treatment.


One strategy that can be employed to separate (unconfound) the effects of treatment from biological variability is sample pooling. For example, if one is only interested in the effects of nutrition on jackalope antler development signaling pathways, one might pool the cDNA from three individuals fed a standard diet, and pool cDNA from three individuals fed a vitamin-enriched diet. While there might be a good chance that two individuals might display expression levels (for individual genes) that show a "false" trend (up-, instead of down-regulation with vitamins, for example) simply by random chance, it's less likely that the means of several individuals will display false trends. However, without knowing the extent of the per-subject variability of each gene (information that one loses upon pooling), there is no way to say (a priori) how many individuals need to be pooled to avoid seeing a false trend more than, say, 5% of the time. Therefore, unless one has prior knowledge about the extent of biological variability (which can change with treatment), we recommend against sequencing unreplicated pools. Using multiple pools for each treatment group may reduce bias from biological variability and increase statistical power, but this hasn't been studied extensively for RNAseq experiments.


Barcoding means attaching a known sequence of 4 to 12 nucleotides to the 3' ends of the Illumina (or other NGS technology) adapter sequences. Then, one can mix several libraries together to be sequenced on one lane or run, and separate the sequences by recognition of their barcodes afterwards, in silico. This may be desirable if a run produces several times more reads than one needs in order to sample a transcriptome well (this may be the case with SOLiD, but is not generally the case, for eukaryotes, with Illumina lanes). However, if one wants a full lane's worth of reads, barcoding can still be useful. If one barcodes, say, 6 libraries, then sequences 6 lanes, each a mixture of all 6 libraries, then one can separate and recombine all the reads in silico afterwards, to get a full lane's worth of reads for each library. The benefit is that if any lane has sequencing problems, fails completely, or has some weird bias, then no single treatment group or replicate is affected more than any other. In other words, one can unconfound the bias caused by lane- or run effects, by barcoding and cross-pooling.

Reference Sequences

In general, a lane's worth of 85 bp paired-end Illumina reads (~150 bp times ~30M read pairs = 4.5Gb) is enough coverage to do de novo assembly and get a pretty good reference transcriptome for a non-model organism. And with that same data, counting can be done to assess differential expression. On the other hand, if a good reference genome exists, in general 40 bp single-ended reads are sufficient to obtain unique mappings of reads to genes, and assess differential expression. A lane of 85 bp PE reads is more than twice as expensive as a lane of 40 bp SE reads (but please consult the DNA Technologies Core website for current pricing information).

Other Resources

  • The ENCODE Project's standards for RNAseq experiments.