Total RNA sequencing of FFPE samples detects long non-coding RNAs

Tech Note

Formalin fixation and paraffin embedding (FFPE) of tissue biopsies is a commonly used method to preserve tissue specimens resulting in a wide variety of globally accessible samples. These samples represent an almost endless biorepository for DNA, RNA and protein analyses waiting to be explored. During FFPE preservation, RNA undergoes chemical modifications and fragmentation, with further degradation during prolonged storage. Therefore, FFPE-derived RNA has often variable and typically low quality, low yield and a high degree of degradation. This makes the analysis of RNA derived from FFPE samples particularly challenging.

Unlike other sequencing methods, total RNA sequencing enables an unbiased view on the entire transcriptome (excluding miRNAs) of a given sample, as it does not rely on specific capture probes, polyadenylation of transcripts or other modifications. Therefore, total RNA sequencing enables the study of virtually all RNA species, including messenger RNAs and long non-coding RNAs (lncRNAs). The lncRNAs are emerging as important regulators of tissue physiology and are associated with disease processes, including cancer. Analyses of the lncRNA landscape may assist understanding the biology of these disease processes and have potential for future biomarker development.

This tech note describes the technical assessment of the workflow with the focus on the quantification of lncRNAs.

Performing total RNA sequencing of FFPE samples

To evaluate the performance of a total RNA sequencing approach with FFPE samples, we conducted total RNA sequencing of 4 colorectal cancer and 4 matching normal colorectal tissue samples. Quality of the isolated RNA of these samples was, typically for FFPE samples, very low with DV values (fraction of fragments above 200 nt of total fragments in a sample) between 15 and 30%, which indicate strong fragmentation of RNA (Figure 1).

RIN, DV200 and RNA electropherogram of FFPE samples
Figure 1: FFPE samples used in this tech note.
A, RIN and DV200 (Agilent Bioanalyzer) as RNA quality indicators of 4 matched tumor-normal FFPE samples.
B, Representative RNA electropherogram taken from colon cancer sample 1.

Libraries for total RNA sequencing using Illumina's TruSeq stranded total RNA sequencing chemistry with 100ng of total RNA input were successfully generated and paired-end sequenced on a NextSeq 500 sequencer with a read length of 75 nucleotides. On average, more than 90 million reads were obtained per sample, ranging from 89 M to 95 M. Reads were aligned to a curated reference genome (based on Ensemble 75) extended with annotation of long non-coding RNAs (LNCipedia 3.1). Alignment was carried out using TopHat, quantification using HTseq and normalization as well as differential gene expression using DESeq2. Due to the low quality of the RNA, only around 50% to 60% of the obtained reads could be mapped to the genome.

To assess the reproducibility of this workflow, library preparation for one FFPE sample was repeated and the resulting sequencing data analyzed as a technical replicate. A very high correlation (Pearson correlation coefficient r=0.99) between the normalized read counts of both replicates was observed (Figure 2 A).

Next, for expression analysis, data of all 8 samples was normalized to equal sequencing depth. At a sequencing depth of approximately 90 million reads per sample, over 15,000 mRNAs and over 9,000 lncRNAs were detected with on average more than 10 normalized read counts (Figure 2 B). Differential expression analysis comparing the 4 colorectal tumor and 4 normal samples was performed and identified 2,296 significantly differentially expressed mRNAs and 781 lncRNAs (FDR <5%; Figure 2 C).

Among the differentially expressed mRNAs, several genes were identified that are known to play an essential role in colorectal cancer and are included in the KEGG colorectal cancer pathway* (Figure 2 C). Additionally, functional annotation using Gene Set Enrichment Analysis (GSEA) showed that the differentially expressed genes are significantly enriched in different cancer types or cancer-related pathways (e.g. cell cycle regulation, cell proliferation, hepatoblastoma pathway, etc.; FDR <5%).

Highly sensitive and reproducible detection of genes
Figure 2: Highly sensitive and reproducible detection of genes.
A, Technical reproducibility. Normalized read counts of a technical replicate of a low quality colon cancer sample. R = Pearson correlation coefficient. Blue = below 10 normalized read counts, green = more than 10 normalized read counts.
B, Detected and differentially expressed genes. Average number of detected genes with more than 10 normalized read counts (left) and number of differentially expressed genes (right). Light blue = mRNAs, dark blue = lncRNAs.
C, Volcano plot of differentially expressed genes. Grey = genes with FDR >5%, green = significantly differentially expressed mRNAs with FDR <5%, yellow = significantly differentially expressed lncRNAs with FDR <5%, pink = significantly differentially expressed mRNAs with FDR <5% included in the KEGG colorectal cancer pathway.
*http://www.genome.jp

An example of induced expression for MYC, a gene involved in colorectal cancer and for a lncRNA, are shown in Figure 3. As total RNA sequencing also captures pre-mRNA transcripts in which introns are not removed (yet) during splicing, a fraction of reads aligns to intronic regions (Figure 3). Furthermore, total RNA sequencing enables detection of reads that cover splice junctions and could be further used to inspect alternative splicing events.

Example of induced expression in the MYC gene (A) and lncRNA AC078802.1-2 (B)
Figure 3: Example of induced expression in the MYC gene (A) and lncRNA AC078802.1-2 (B).
Sashimi plots visualize the expression in colon sample 1 (pink) in contrast to matching normal colon sample 1 (green). Gene models in grey, with boxes representing exonic regions. Connecting lines indicate reads that cover splice junctions.

 

Comparison to mRNA capture sequencing with FFPE samples

To further assess the performance of total RNA sequencing on FFPE samples, the results were compared to mRNA capture sequencing data of the same samples (see Tech Note "Study the transcriptome in FFPE tissue using mRNA capture sequencing"). To this end, mRNAs were selected that could be identified with both methods (n=20,814). This number is strongly reduced compared to the over 80,000 genes that can be detected with total RNA sequencing, as the probe-based mRNA capture approach does not cover all coding genes (98.3% of the RefSeq exome of the hg19 reference genome) and no lncRNAs.

Data for all 16 samples were normalized according to the common mRNAs and expression analysis performed with DESeq2 (Figure 4). In total, more than 14,000 genes were reproducibly detected with on average more than 10 normalized read counts, if sequenced at the same sequencing depth. Around 95% of the mRNAs were identified by both methods (Figure 4 A).

Comparison of mRNA capture and total RNA sequencing
Figure 4: Comparison of mRNA capture and total RNA sequencing.
A, Detected mRNAs. Of 20,814 common mRNAs between mRNA capture and total RNA sequencing, 14,121 were identified with both methods with more than 10 normalized read counts on average, 857 were identified only with total RNA sequencing, 502 only with mRNA capture.
B, Concordance of identified mRNAs. Fold changes (tumor vs normal) of differentially expressed mRNAs identified with total RNA sequencing were compared to the corresponding fold changes obtained with mRNA capture. 89% of these are matching both in direction (up- or downregulated) and log2 fold change difference (grey), 11% were concordant in direction but differed by more than 1 log2 fold change (green), and below 1% were not concordant (pink).

Differential gene expression was performed for these two datasets (mRNA capture and total RNA sequencing) separately and revealed 2,805 genes differentially expressed with total RNA sequencing. When compared to their counterpart values retrieved from mRNA capture sequencing, around 89% were concordant in both log2 fold change difference and direction (up- or downregulated), around 11% were in agreement with regard to the direction, but differed by more than 1 in log2 fold change difference, and less than 1% were not concordant.

Taken together, both mRNA capture and total RNA sequencing are able to detect a highly similar and largely overlapping number of genes and identify them as differentially expressed, despite the drastically different underlying methods, and therefore confirming the results obtained by total RNA sequencing on FFPE samples.

Conclusion

Total RNA sequencing enables an unbiased and comprehensive view on the transcriptome of a given sample, including RNA species that cannot be analyzed by other sequencing methods. Biogazelle has optimized a workflow utilizing total RNA sequencing for clinically relevant and challenging FFPE samples. Applying this approach on a matched colorectal cancer and normal colon tissue set, many genes were identified in a highly reproducible manner, including known genes involved in this cancer type. Among the identified and significantly differentially expressed genes, also many lncRNAs were discovered. Despite emerging knowledge of the involvement of lncRNAs remains to a large extent uncharted territory with great potential to serve as future biomarkers.

Download this Tech Note

Subscribe to email updates