- Knowledge center
In this talk, hosted by Twist Bioscience during ESHG 2019 (Gothenburg, Sweden), Jo Vandesompele, PhD - CSO at Biogazelle, discusses the use of Twist Bioscience probes by Biogazelle, for the custom enrichment of non-coding RNAs from human biofluids.
Thanks very much for the kind introduction and for inviting me to present some of our work. Biogazelle is a genomic services company and we're focusing tremendously on RNA, and we truly believe in the power of RNA as it accurately reflects cellular states.
I sometimes refer to DNA, and please take no offense here, as a static painting. And here this is Clara, the five year old girl, daughter of Peter Paul Rubens, a famous Flemish painter and to RNA as a real time still images from a video. And here I just show you a few of these snapshots from a video from a girl, Lotta, Dutch girl. I'm not sure if some of you saw the video. It's available on YouTube where the father has documented her every day by taking a few snapshots. While the DNA of the girl has not changed over time, her expression clearly has, and I think it's safe to state that RNA is probably one of the most informative and dynamic information layers to study.
Most of you are probably familiar with expressing or studying RNA as full changes, a little bit more in one sample versus another sample or a little bit less than one group versus another group. And then you can do differential gene expression also for single genes, pathways. It's beautifully visualized as heat maps or do even fancy gene set enrichment analysis to understand if the pathway is repressed or activated. And that's all good. But I think it's important to understand that there are is also a structural component of RNA and by doing RNA sequencing obviously we can deduce and learn from the structural information embedded, encoded in the transcriptome. We can detect mutations, alternative splicing, fusion genes, circular RNAs, etc, if we have the proper processing tools to do so. And at Biogazelle we definitely have workflows, quality control to exploit the structural information embedded in the RNA.
It's also safe to say that we have finally entered this era of precision medicine where we try to provide the right drug to the right patients at the right time. And obviously molecular tests are crucially important in that. And liquid biopsies are probably becoming, or have already established themselves as the cornerstone of precision medicine for obvious reasons; they are easy to obtain, they confer low risk to the patient, they enable serial profiling and really allow us to do longitudinal studies. For those of you in the field of oncology, a liquid biopsy is probably reflecting the entire tumor load compared to a very biased biopsy and the liquid is full of biomarker potential.
There is a flavor for everyone. We have cell-free nucleic acids, we obviously love RNA. DNA is there as well, circulating tumor cells, extracellular vesicles or exosomes, and tumor-educated platelets. But the challenge of measuring RNA in liquid biopsies is quite hard and I will address it in the next slide.
I just want to explain in one slide how the RNA is actually ending up in the liquid biopsy and it's actually a combination of active processes and passive release of RNA. So obviously RNA is a sensitive molecule if there are RNases present, but some of the RNA is quite well protected: bound to proteins or encapsulated in vesicles or platelets or other containers. That's why it is possible to detect RNA in liquid biopsies. There is even some evidence that RNA may be circulating naked and especially the circularised form through to back splicing, which is pretty robust against exonucleases. RNA may have a communication function like hormones. Whatever its function is in that fluid extracellularly, I think it's important to understand that we can actually exploit RNA as a biomarker, not only from its abundance levels but also from the structural information encoded in that extracellular RNA. I already told you the challenges in messenger RNA sequencing of liquids are numerous. It's not too difficult to sequence small micro-RNAs, but these long messenger RNAs because they are to some extent degraded or fragmented in the liquid biopsy. We have a lot of unwanted RNA that we don't want to sequence, such as coming from hemoglobin or ribosomal RNA. The input is typically low. We get on average samples that are 0.2ml, so 200μl. You can imagine there's not much nucleic acid in there, although from an RNA perspective it's at least 10 times as high as DNA, potentially offering some sensitivity advantages for doing liquid biopsy analysis focusing on RNA. And to complicate matters even to the fullest, the genome is pervasively transcribed and we have sense and antisense overlapping transcripts, so we need library prep methods that can handle all these challenges.
The solution in our perspective is pro-based cDNA enrichment or RNA capture sequencing, and we have not invented that technology. It was actually a decade ago when custom enrichments of cDNA fragments has been published by Levin and colleagues and then gradually the number of genes that were captured from RNA samples increased to actually doing the full exome only five years ago, but that was still on high quality and large quantities of RNA. The first clinical sample was actually done in 2014, the first exome capturing on RNA-derived from FFPE and at Biogazelle we were probably the first doing whole exome capturing of RNA from a liquid biopsy (200 μl). So how does the technology really work? Before I proceed, I just want to indicate that all these capture focused methods, they repeatedly document that you have higher sensitivity in your readout. You can reveal higher transcriptome complexity and that this structural information is faithfully maintained in the RNA sequencing data to allow you to do fusion gene analysis and mutation analysis.
So we in fact make a stranded, double stranded cDNA library and by stranded I mean that the transcriptional direction is preserved in the library to make a distinction between these sense and anti-sense overlapping transcripts and then we enrich like we do for exome sequencing at the DNA level, the exons of interest from that complex, highly dynamic range cDNA to enrich the exons of interest for sequencing. It's a pretty laborious workflow at least in the original implementation. Later in my talk, I will give you a few flavors of more improvements of that workflow and we actually capture about 400,000 probes to do the full human transcriptome in enrichment. We typically sequence 50 million reads, which shows to be sufficient for a liquid biopsy because there is not much added value in doing more. There always is but I think it's the law of diminishing returns and we've done about a thousand cases so far, mainly in the field of oncology. Bear in mind in the original implementation of the workflow it was the double capture round which we are now abandoning to a single round capture, but I will come back to that later on.
This was actually the first data that we generated three years ago where we did matched FFPE plasma and serum from two lung cancer and two colon cancer patients, and we were actually shocked by the result. In green, you see the number of genes that are reproducibly detected in FFPE but also in plasma, here in pink and in dark blue, serum we have about 7,000 to 8,000 genes reproducibly detected in only 200 microliters of fluid and that was quite encouraging to continue that workflow.
The reproducibility is pretty good. As you can see on the left, the trueness or the resolution is very good. In 80% of the genes we can detect a 25% difference, so that's really tiny. So the reproducibility and the resolution is pretty high.
The gene body coverage is excellent. Please note that we have lower coverage in the UTRs because the probes were not designed to capture UTR and the cumulative gene diversity indicated that we did not specifically enrich highly abundant genes, so the complexity was quite well in our sequencing data.
That's all good and well, but there are still, or there were, limitations to the original target enrichment workflow. To name a few, only protein coding exomes of messenger RNAs were enriched, at least in the embodiment of the original implementation. Only human RNA was enriched. Obviously these two limitations came from the probe set that we commercially acquired. There is no room for customization where we wanted more or fewer genes or other genes. I think most importantly the were no probes against spike-in controls and these have been crucial to do workflow controls and do some normalization, especially in the field of liquid biopsies.
So in the next part of my presentation I want to illustrate with three or four success cases where we have used Twist Bioscience probes to accommodate some of these challenges. The first example is a custom probe set of about 3000 probes direct against 170 spike-in RNAs to supplement the messenger RNA probe set and we'll give you an example how we apply that to fully document the human biofluid RNA Atlas and also an ongoing study called the Extracellular RNAs Quality Control study. Then I will show you an example in which we used a custom probe set of about 450,000 Twist probes against 50,000 human long non-coding RNAs to finish off with a benchmarking study in which we compared the Illumina enrichment with the Twist enrichment.
The human biofluid RNA Atlas project is a collaboration with the Illumina, Ghent university, and a few other academic centers in which we want to deeply sequence all human different biofluids. I wasn't even aware of so many biofluids available. There are 22 and we want to deeply sequence and catalog all RNAs present extracellularly in these different fluids is actually the work is coordinated by a talented PhD student, Eva Hulstaert from Ghent University. You can imagine as the fluids are so diverse, the RNA concentration is very different and the integrity of the RNA is quite different in these human biofluids, so we need controls to make data comparable.
That's why we spike-in a set of 78 so-called sequin RNAs in the fluid prior to extraction and then we spike also 92 well known simple spiking controls, ERCC to the RNA eluate and by combined look at spikes that we add prior to extraction and ones that we add after extraction, we can accurately quantify the purification efficiency as well as the relative RNA concentration, which is crucial if we start learning what the differences are in RNA amounts in these different biofluids.
Of course, I don't have time to discuss the entire outcome of the study but I just want to give you one outcome is that there is a huge difference in RNA concentration in these different human biofluids and this is only possible if you add spike-in RNAs to do that careful quantification. I want to draw your attention to aqueous humor, it's an eye fluid, cerebral spinal fluid, and sweat as prototype fluids which have a very low messenger RNA concentration in contrast to breast milk, seminal plasma, and tears which have about a thousand fold higher concentration of RNA per same volume of fluid. What it means we don't know, but we were actually conducting studies together with key opinion leaders in these different fluids, and we tried to come up with reasons why we would use some fluids for certain diagnostics or for some particular diseases as well as we tried to see if the organ or the tissues that produce or conveyed the fluid is actually reminiscent in the RNA patterns in the fluid. This is an ongoing study but I think it's quite fascinating.
To finish this one off, I know and I do understand that plasma is a very important liquid biopsy fluid, but I want to just provoke your thoughts in that plasma may not always be the best fluid depending of course on the disease you're studying and this is actually the data from matched urine and plasma from prostate cancer patients, which is a logical example, but here we clearly document that if we look at prostate cancer or prostate specific RNAs, collected from the tissue databases that the RNA concentration, again controlled with the spike-in molecules is much higher in urine compared to plasma. So for some applications, and prostate cancer is perhaps one, it makes much more sense to look at urine compared to plasma.
Now let's switch gears a little bit. It's the Extracellular RNA Quality Control study again using these 3000 Twist probes to capture the spike-in RNAs and it's actually a very large study, but I will only discuss today the RNA extraction or the impact RNA extraction kits have on performance or the RNA pattern output. It's actually a study which consists out of three phases. The first phase is we do so-called single factor pre-experiments in which we want to document the impact that these pre-analytical variables have, to come to a phase two and almost full factorial in which we're going to combine these different pre-analyticsl variables and finally do some real analysis like variance component analysis and a real biomarker study. We have actually finished phase one, it's already two years and it's a project where 30 people are collaborating so it's quite hefty and after summer we'll start phase two.
As pre-analytical variables are key in liquid biopsy studies, I just want to provide you a sneak preview on the outcome of that study and of course all data will be made publicly available. This is the summary of the experimental setup in which we compared nine different RNA extraction kits for fluids, that are marked for fluids at their minimal in yellow and their maximum input volume in dark yellow. Don't know as the real color, I'm colorblind, but you get the point, and then you see that we have different eluate volumes again as specified by the kits. I just want to draw your attention that for some kits, this particular one we have about a 300 fold enrichment of the fluid down to the RNA elute volume. For other kits, we don't enrich, 100 microliters is the input and 100 microliter of RNA is the output so you don't enrich- and you can of course imagine that it has a dramatic impact on RNA concentration and yield and it does.
Again, using the spike-in controls, we can actually quantify the amount of RNA that is co-purified in or purified in these different kits. There's about a 100 fold difference in RNA concentration from the best performing kit versus the worst performing kit.
Of course, this is RNA concentration. You can concentrate your sample if you wish to, like precipitation. So probably a more fair comparison is yield, but still depending on the kit you use. So this is RNA elute volume controlled, there is about 40, 50 fold difference in RNA yield.
So if you're into liquid biopsy studies where you have small input volumes, you better make sure you use the kit that will maximize your sample in terms of detection sensitivity. And finally, I think what probably matters a lot if you do sequencing, you have a high gene yield. So you see that we have for the worst performing kit about two to 3000 genes reproducibly detected in 200 microliters of plasma versus up to 15,000, so the yield is quite different and I think we definitely have made our choice, which kits we will continue with.
Switch gears again. So these are the Twist 3000 probes to enrich spike-in controls. Now we're going to switch to a very customized set of about half a million probes geared towards other genes than protein-coding genes. It may come as a surprise to some of you, but the majority of the human RNAs or genes actually do not code for proteins. Protein-coding genes is this green part, 21,000 genes. Then we have in pink micro-RNAs, but the vast majority is actually non-coding or long non-coding RNs subdivided, subdivided into two groups. The low confidence ones that you say, yeah, are these really genes, are they really non-coding, and then in bluish 50,000 high confident long non-coding RNAs. The problem is that we all love these genes because they have great biomarker potential, they're excellent therapeutic targets, but in classic sequencing workflows, they often escape detection because half of them are not polyadenylated. So we really need a workflow and enrichment of course is perfect to do so to enrich them from the background because on top of it, they're also low abundant. So if you work with low abundance RNAs in liquid biopsy, which is already lowly concentrated, that is fragmented, you really need to rely on enrichment to pull out that low signal from the background.
So we designed probes against LNCipedia. It's a Ghent university effort, it's probably the most comprehensive human link RNA catalog, it's published in several papers and Nucleic Acid Research. It resulted in about 40 or 50,000 probes which Twist synthesize for us targeting about 52,000 long non-coding RNA genes.
So obviously we didn't want to sacrifice probes to the most challenging samples which are liquid biopsies. We first did human brain and I think we were pretty impressed with the performance. On the left side you see classic probes against human protein coding genes but also in pink here, now addressing these lncRNAs, we have pretty good reproducibility. We detect a lot of genes over a large dynamic range and we're quite happy and confident to continue for liquid biopsy studies. And these are results, snapshot of our results on plasma. Again in green protein coding genes, these are non-Twist probes. This is a pink Twist probes targeting the long non-coding RNAs.
I agree with you, the reproducibility is somewhat lower compared to protein coding genes, but bear in mind these are very low abundant genes in only 200 microliters of plasma. So I think we are pretty happy with the performance, especially if you look at the bottom where we say, look, we're only going to keep the data if it's reproducibly detected in two technical replicates in this study and above an FPKM value of one. I think we definitely have data that is pretty useful. We're now conducting large scale studies to see the power of sequencing lncRNAs in various fluids and see if they can help us do better diagnosis in different human diseases.
To finish this one off just to show you the power of enrichment even more for low abundant genes, this is a well known long non-coding RNA, it's called GAS5, where I show you the coverage in the different exomes from plasma samples. Two replicates here and you see that the enrichment is pretty good on target. It's relatively uniform and these arches really denote alternative splicing, and this is very much what we have expected because every single exome of a long non-coding RNA is alternatively spliced and you really create a lot of diversity in terms of splicing and this just jumps in your eye from a liquid biopsy and plasma is was quite encouraging to continue this type of research.
And finally I want to provide you some of the data and results we generate by comparing, let's say the latest and the greatest of Twist Bioscience enrichment workflow.
I think the differences with what we have done so far, which I presented are quite numerous. So we used to work with single-plex, now we can do an 8-plex with Twist. The hybridization and the washing buffers are different and they're proprietary to Twist. We used to do a dual capture round, now we rely on single capture and we used to work with shorter probes, 80-mers and now we work with 120-mer probes. To really make data very comparable, I think you can introduce a lot of biases in doing these kinds of benchmarking studies, it's really important that we rely on the same amount of data. So we did sub-sampling to 5 million reads, and this is actually work done by Annelien Morlion, another talented PhD student at Ghent university, and we compared platelet-rich plasma with platelet-free plasma, so quite divergent in terms of RNA content. And we had sufficient number of replicates do robust assessment of comparing these two different workflows. So when we assessed PCR duplication level, we already noticed differences. You may be shocked, and we've seen PCR duplication levels from the previous speaker. Bear in mind, this is extremely low input. So from a liquid biopsy, the duplication levels are high, typically around 80% for platelet-rich plasma, 90% and higher in platelet-free plasma. We know it's there, but you can deal with it. You simply remove the PCR duplicates. But I want to draw your attention to the fact that with Twist we have about 5-7% lower PCR duplication levels, which is good. And then you may say, well, 5% that's not a lot. Well, it actually matters a lot, especially in these high duplication levels, for instance, if we then use that information to remove the duplicates and only contain the unique fragments, you see that for instance, if we look at platelet-free plasma that we have about 40% more unique reads on target compared to the other enrichment technologies. At least for us, this suggests that the enrichment is much more sensitive and we have a larger diversity of molecules and fragments in our ultimate library, and this is important if we want to use that type of data for, for instance, mutation analysis. The differences, the limit is smaller for platelet-rich, which is a richer transcriptome, but it's still 30% more unique reads, comparing this complete Twist workflow with what we have been doing in the past.
Also, when we look at how deep we can sequence into the transcriptome and how many genes we detect, we have about 10% more genes with a Twist workflow and we have about 4-5% which is actually 50% more, deeper sequencing probing into the transcript because the enrichment is most likely more efficient and more sensitive.
From a high level comparison, I think we see similar performance for some of the performance metrics such as strandedness repeatability, but we definitely see better performance in terms of PCR duplication levels, gene detection count, and the depth into the transcriptome that we cover with this enrichment workflow. I think most importantly, why we like the Twist Bioscience workflow so much is the higher level of customization and the flexibility. So the projects that are now planned or ongoing, because we've seen the successful benchmarking data, is actually a combined human and murine exome enrichment from plasma, from murine PDX models. Even more challenging because we have combined human and mouse RNA and only 60 microliters of plasma. Very challenging. And we want to do that combined enrichment and learn which amount of RNA from the human tumor that's implanted into the mouse model is actually leaking and what can we learn from that? Another one is we're going to completely redo the LNCipedia, so it's a LNCipedia 5.2. We are currently finishing the designs and have them synthesized as soon as possible. And then also because of the level of flexibility, we are doing a custom enrichment for rat genes. It's about 1300 genes and you might say, well that's easy, right? And it's even tissue. I just want to draw your attention, these are extremely rare transcripts. Collectively they consume 0.07% of all the reads if you would do a classic RNA sequencing. So we really have to rely on extreme enrichment to address all our sequencing power to these rare genes.
To conclude my contribution to the seminar, I hope I've convinced you that all human biofluids, and there are many, that contain RNA and likely reflecting health and specific disease states. We're actually doing followup studies with our collaborators. We have developed and benchmarked, capture sequencing workflows to probe this extracellular RNA. And we offer that as a service. And then the Twist Bioscience enrichment provides a very flexible and high quality solution for custom target enrichment. Leads me to thank my colleagues at Biogazelle, Ghent university, and also Illumina and Twist for the valuable contribution and their interest in what we do. I may need to rush off soon after the seminar, so I'm happy to take questions now if you want to approach me. My email address is there if you have followup questions. So don't be shy to send an email. Thank you very much for your attention.