According to Gene Myers (near) perfect genome assembly is within reach for any organism of your choice.
Time will tell if he’s right, but being an influential bioinformatician who has made key contributions in sequence comparison algorithms such as BLAST, whole-genome shotgun sequencing and genome assembling, one will think he knows what he’s talking about!
In a conference at the PRBB auditorium today, he explained to a mixed audience of biologists and computer scientists how, after a few years dedicated to other issues (mostly image analysis), he was now coming back to sequencing with great excitement. The reason: PacBio RSII. This sequencing device is able to produce very long reads (of more than 10,000bp!) and has a couple of other characteristics that can potentially make full assembly possible: although error rates are high (10-15%) they are random, not like with other techniques that tend to make always the same errors. And sampling is also random. This randomness and the length of the reads mean that, with enough sequencing coverage, you can always get the right sequences.
So now all we need, Myers says – apart from waiting for the cost of the PacBio to go down, which he promised will happen soon (4x in one year) – is to build an efficient assembler. He talked about what he and some colleagues have been doing in that sense. The main element is a ‘scrubber’ to clear and edit the reads while removing as little data as possible. Because his point was that even though people have been focusing on the assembly, the real problem is the data, the contaminants, chimeras, excessive error rates,… So he presented his personal ‘data cleaner’, DAscrub, soon to be released.
You can read more details about his recent work on this in his blog,
In the meantime, his advice to the world – stop the 10,000 genomes project right away and wait a couple of years to have better sequences!
Xavier Estivill and his Genomics and Disease research group at the CRG are trying to find the genetic causes of complex diseases using the latest genomic technologies. Focused on central nervous system diseases and on non-coding RNAs, he is also involved in international sequencing projects such as the International Cancer Genome Consortium (ICGC). Hear him explain his research in this short video!
Fátima Al-Shahrour, from the CNIO in Madrid, came last week to the PRBB to give a talk entitled “Bioinformatics challenges for personalized medicine”. She explained what they do at her Translational Bioinformatics Unit in the Clinical Research Programme. And what they do is both exciting and promising.
They start with a biopsy of a tumour from a cancer patient who has relapsed after some initial treatment – they concentrate mostly in pancreatic cancer, but it would work with any, in principle. From this sample, they derive cell lines, but also – and they are quite unique in this – they generate a personalised xenograft. That is, they implant the human tumour in an immunocompromised mouse, creating an ‘avatar’ of the patient. After passing it from one mouse to another (they do about 60 mice per patient), they extract the tumour to analyse it by exome sequencing (and sometimes gene expression data, etc). They then have about 8 weeks to find, using bioinformatics, druggable targets that they then test on the avatar. Those drugs that work on the mouse are then given to the patient.
The advantages of this system are many and obvious: not only the in vivo model can be used to validate the hypothesis generated by the genetic analysis, but we basically have a personalised cancer model for a patient in which we can try as many drugs as we want. It can be cryopreserved, so we have unlimited access to the sample. And, since cancer is not a disease we can cure yet, but instead patients must keep checking out for possible relapses, metastasis, or resistances to treatment, keeping the mouse in parallel with the patient can help predicting how the patient will react to all these: whether he will develop resistance to the drug, which other mutations might appear, etc.
But there are several disadvantages, too. One is hinted in Fátima’s talk title: the bioinformatics analysis of the tumours to find which mutations are important (drivers) in the disease and which can have drugs that affect them is challenging, not the least because an individual cancer genome can have hundreds to thousands of mutations.
Perhaps the biggest barrier is that, at the moment, making these avatars is inefficient, very expensive and slow. And since the patients who are benefit from this technology are already in a very bad clinical condition, many of them don’t get to live enough to enjoy those benefits. But there are some successful cases, and Fátima mentioned a couple. In one case, a man with pancreatic cancer who was treated with mitomycin after all the tests in his avatar, survived more than 5 years, when he had been given 1 year at the most.
So there is hope in the field of personalised medicine, despite the fact that this is still not standard, and won’t probably be for the near future. And, as someone in the audience mentioned, in an ideal future, we might even have personalised prevention, according to our genetic makeup. Wouldn’t that be great?
A report by Maruxa Martinez, Scientific Editor at the PRBB
Does your research imply having to deal with a huge amount of high-throughput data? Are you worried about the interpretation of your Illumina sequencing data? Illumina’s Genome Analyzer (GA) and HiSeq instruments are currently the most widely used sequencing devices. If you use them or are thinking of using them, you might be interested in having a look at the latest paper coming from Heinz Himmelbauer and his colleagues at the CRG ultrasequencing unit and published in Genome Biology. Find out about the errors and biases they report to make sure your data analysis is of the highest quality!
Minoche AE, Dohm JC, Himmelbauer H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems. Genome Biol. 2011 Nov 8;12(11):R112 [PDF]
Michael Snyder is the director of the Yale Center for Genomics and Proteomics, as well as Professor at Yale University. He studies protein function and regulatory networks using global approaches and high-throughput technologies, such as genomics and proteomics. During his visit to the PRBB he told us about the latest insights into human variation.
What are the pros and cons of high-throughput technologies?
There’s no question they are helping us advance in our knowledge. With genomics or proteomics experiments we discover things we would not have discovered by studying individual genes, and we have learned some basic principles out of these large datasets. Of course, there’s also an information overload and a lot of the data are still uninterpretable, but that makes it fun!
When will I be able to have my genome sequenced?
Nowadays you can already get it done, if you have enough money, and I am sure all of us will have the opportunity to have our own genomes sequenced at some point at a reasonable price.
And would that be useful?
Not that much right now, but the more genomes we have, the more useful they will be, because we can compare them and learn much more. Of course there are also ethical issues about the possibility of discrimination in employment and health insurance because of your genetic influences, and that is something that we will have to deal with first.
What have been the big surprises of biology in the last 15 years, after the human genome and Encode projects?
The extent of divergence in gene regulation has been a big surprise – there’s a plethora of transcription factor binding sites (TFBS) in the genome, many more than expected. And they change so quickly between species.
How can we be so different from chimps, if we are 98.5% identical at the genetic level?
I would say the difference between species is probably at the gene regulation level, rather than at the gene level. We have pretty much the same genes, but they are regulated differently and expressed at different times. They also interact with different proteins.
How about the differences between males and females – at the molecular level?
One curios thing we have found is a difference in the expression of genes involved in osmotic stress. This could explain the physiological differences between men and women with respect to heart attacks and other cardiovascular diseases, which tend to be more frequent in men.
You work on pretty much anything: from yeast to human, on genes, RNA and proteins, from a single protein to whole cellular networks… what do you find most fascinating?
I like them all, this is the nature of biology! We know almost nothing now compared with what we are going to learn in 20 years. There is so much to learn, and we follow whatever makes more sense to solve a problem. Yeast, for example, is very good to work out new technologies before using them in other models, or to solve basic problems. It’s naïve to just look at one thing. We have to look at nature at many levels.
This is an interview published in Ellipse, the monthly magazine of the PRBB.
Despite all having the same DNA content, each cell is different. The phenotypic differences observed between cells depend on the differences in the RNA transcript content of the cell. And this variability of transcript abundance is the result of gene expression variability, which has been studied for many years and is usually measured using DNA arrays, but also of alternative splicing variability. Indeed, changes in splicing ratios, even without changes in overall gene expression, can have important phenotypic effects. However, little is known about the variability of alternative splicing amongst individuals and populations.
Taking advantage of the popular use of RNA-seq (or “Whole Transcriptome Shotgun Sequencing”), a technique that sequences cDNA in order to get information about a sample’s RNA content, a team of researchers at the CRG have recently published in Genome Research a statistical methodology to measure variability in splicing ratios between different conditions. They have applied this methodology to estimates of transcript abundances obtained from RNA-seq experiments in lymphoblastoid cells from Caucasian and Yoruban (Nigerian) individuals.
Their results show that protein coding genes exhibit low splicing variability within populations, with many genes exhibiting constant ratios across individuals. Genes involved in the regulation of splicing showed lower expression variability than the average, while transcripts with RNA binding functions, such as long non coding RNAs, showed higher expression variability. The authors also found that up to 10% of the studied protein coding genes exhibit population-specific splicing ratios and that variability in splicing is uncommon without variability in transcription.
Even as they accept the limitations of their work (e.g. RNA-seq is still very new and not completely understood, and the data in which they base their analysis belongs to the first and only human RNA-seq studies published so far), the authors conclude that “given the low variability in the expression of protein coding genes, phenotypic differences between individuals in human populations are unlikely to be due to the turning on and off of entire sets of genes, not to dramatic changes in their expression levels, but rather to modulated changes in transcript abundances”.
The researchers, led by Roderic Guigó, present in the same paper a new methodology to find out the relative contribution of gene expression and splicing variability to the overall transcript variability. They estimated that about 60% of the total variability observed in the abundance of transcript isoforms can be explained by variability in transcription, and that a large fraction of the remaining variability can likely result from variability in splicing.
Guigó, last author of this paper, has recently received an ERC Advanced Grant, the most prestigious given to scientific projects in Europe, in the category of Physical Sciences and Engineering. The 2 M € awarded over five years will allow his team to carry out the study of RNA using massively parallel sequencing techniques.
Gonzalez-Porta M, Calvo M, Sammeth M, Guigo R. Estimation of alternative splicing variability in human populations. Genome Res. 2011 Nov 23; [PDF]
On the second day of the conference, some more interesting talks at the “Computational Biology of Molecular sequences” X CRG Symposium taking place at the PRBB Conference Hall. I will focus on one talk of each of the sessions (genome regulation, RNA analysis and genome annotation), although all were very interesting!
Ron Shamir (Tel Aviv University) presented Amadeus, a software platform for genome-scale detection of known and novel motifs in DNA sequences, and explained some of the findings they have done with it. He also presented his new book “Bioinformatics for biologists”, which will surely be very useful for many biologists drowning in today’s sea of data and tools for analysing it.
Anna Tramontano (Sapienza University), the only female out of the 20 invited speakers and a very well-known figure in the protein world, gave her first RNA talk ever, as she presented it. She talked about a new method for controlling gene expression: a long ncRNA which contains 2 miRNAs within its sequence, and which competes with those miRNAs on binding to their target genes.
Tim Hubbard (Sanger Institute) gave so much information in 45 min that was hard to keep track of it all. He started with the catch 22 of reference genomes: we want it to be complete, but we don’t want it to change… the proposed solution: to keep the reference genome and to release patches with ‘novel’ information or with corrections (the ‘fix’ patches) whenever we get more information. Now, this means that alignment algorithms will need to be aware of patches, he warned the audience.
He then moved on to the costs of sequencing a human genome (5000 pounds, as per October 2011) and said that every 2-4 years the cost drops by 10 times! With this ever-lowing costs, he said, in the UK there has been quite a lot of movement regarding future policies on genomic medicine. And the main question is: what is the health economic value on having all this information, of sequencing the whole population? Nobody knows that yet, but according to Hubbard, one day the cost of sequencing will go low enough and the usefulness of the information will grow enough so that they will both meet and make it viable.
He finally presented the ITFoM (IT Future of medicine) project, one of the six funded by the Future and Emerging Technologies (FET) flagship programme of the EU – which has the goal of “encouraging visionary, “mission-oriented” research with the potential to deliver breakthroughs in information technology with major benefits for European society and industry”. The ITFoM project is expected to run for at least 10 years and will receive funding of up to € 100 million per year. Considering what they aim to do – integrating all available biological data to construct computational models of the biological processes that occur in every individual human – they will certainly need that much money… Just consider this fact: to cover all the ‘cancer genomes’ appearing every day, we would need to sequence a new genome every 2 seconds!
So, that was my own pick of the day. Of course, much more happened at the meeting. You can find summaries of all talks and much more at the Symposium’s website http://2011symposium.crg.es/
And if you are interested in Computational Biology, don’t miss two upcoming events also at the PRBB:
- XIth Spanish Symposium on Bioinformatics 2012, 23-25 of January 2012 (#jbi2012)
- Recomb2012 16th Annual International Conference on Research in Computational Molecular Biology, 21-24 April 2012
Report by Maruxa Martinez, Scientific Editor at the PRBB