April 12, 2013 § 4 Comments
Is it weird for a Spaniard to have red hair? The typical stereotype for a Mediterranean person is brown-skinned, not too tall and with dark hair. I do not seem to fit all those stereotypes very well, except for the dark hair. At least so I thought until I posted this picture of my beautiful family on my Facebook profile:
Of the five of us I am the only one without red hair. Seeing this picture really brought it home to me, it was strange that everyone of my three children had inherited my wife’s ‘recessive’ red hair!
I did not give a lot of importance to this until my colleague and Facebook friend Dave Adams, who happens to lead a research group at the Wellcome Trust Sanger Institute, asked me whether I had checked the MC1R gene.
The protein encoded by the MC1R gene is found in melanocytes, the cells that give hair and skin their color. The variants associated with red hair alter the protein’s function, tipping the balance of pigment production in melanocytes from black-brown eumelanin to red-yellow pheomelanin .
Dave is well aware of my efforts to crowdsource my genome data analysis and those of my blood relatives (parents, siblings and aunts and uncles). Since I have had my exome done, and following Dave’s suggestion, I looked for the animo acid changes he suggested (r151c, r160w and d294h) in the MC1R gene. Below you can see some of the comments of our conversation on Facebook:
I have a VCF file for all variations in my genome available in figshare for public download. I searched the file for the 89978527-89987385 interval in which the MC1R gene is located in chromosome 16 and found:
16 89986091 rs11547464 G A
This indicates that in position 89986091, there is a small change of one letter (SNP rs11547464) that makes my DNA in that position differ from the one of the human genome reference. The reference genome has a G whereas I have an A.
I also looked at my 23andMe genotype using myKaryoView, which also includes this rs11547464 SNP, and found that my genotype is ‘AG’. Doing some research with this I found that AG in the rs11547464 SNP encodes a missense change on the protein sequence (R142H), making me a ‘carrier’ state for ‘red hair’ .
More information about the relation of this SNP the phenotype showed that this mutation has been shown to be deleterious  and that this MC1R variant is “functional” .
According to Dave, I am a carrier for this red hair mutation and presumably my wife is homozygous for another variant with my kids being compound heterozygous. This means that perhaps my wife has another variant somewhere that also contributes to my children having red hair.
This explains, at least partly, how my offspring’s red hair is so strong, something that in principle should be self evident from the picture above. There is something satisfying though about being able to confirm the obvious with scientific evidence.
January 21, 2013 § Leave a Comment
Readers may remember the Crowdfunding Campaign that we run to collect funds to sequence the genomes of the Corpas family. We are pleased to announce the immediate release of our personal exomes (the coding portions of our genes) currently under a CC-BY license, just for issues of compatibility of license. At this point you have permission to use these data in any way you wish as long as you attribute it to the Corpas family.
Where is it available?
We have decided to make the data available through figshare because it makes the data immediately citable, providing a doi identifier. So here is where the trio data can be downloaded:
Please note that the above data only include the latest sequencing data from our family: exome data from mother, father and daughter. Previous released data from son’s exome are here:
Why do we release our personal exomes?
When my family and myself made our genotypes available on the Internet, we immediately received results from researchers from around the world who took our data for analysis and came back with interesting insights. As a result of this, we have been able to learn much about ourselves. I have reported this in a previous entry on this blog entitled “Benefits for Publishing Family Genomes on the Internet“. We now follow the same principle: if we make our exomes available for people to analyse them, we can expect that some researchers may come back with interesting results.
What new data do we actually release?
Fastq files for whole exome sequencing from the Corpas family: mother, father, daughter. The data comes from 3 saliva samples. Exome capture was performed using Agilent SureSelect Human All Exon 44.
The captured material was sequenced using Illumina’s HiSeq technology.
The data is expected to have 30X effective mean depth per sample, having removed adaptor pollution and low quality sequence.
What do we ask in return?
We do appeal to the good will of potential users to report back to us anything interesting they might find.
How big are the files?
They are huge. On average they are about 1 Gb per file and we have 6 of these. That means that it can take several hours for each file to be downloaded. Please be patient!
Where can I get them?
The top link is for mother, father and daughter. The botton link is for son.
How did we get our personal exome sequenced?
Completely independently. If you want to know the story on how I did it myself, please refer to my blog entries “Getting My Genome Sequencing Done” Part I and Part II. As it is implied there, we managed to get my personal genome sequenced by knocking on quite a few doors and then finding someone who would sponsor us to do so. In fact, part of this exercise’s aim was to prove that it is possible now a days for ordinary citizens to get their genomes sequenced if they so wish. We now go step ahead by publishing our whole exomes on the Internet.
August 6, 2012 § 1 Comment
After a few more than 40 days, we are delighted to close the crowdfunding genome project fundraing collection. We have been very lucky to raise $3,526.59 USD. Yes, this is not the $20,000 that we needed to carry out the whole genome sequencing for 5 members of my family, but this money will be enough to do the exome sequencing of my parents and sister. Certainly this is a step ahead in our adventure to understand ourselves better.
For those who would like to know where the funds come from, they all come from family and friends.
We are planning to create a website where all of the data, publications and stories related to our genome findings will be collected. This website will be made public sometime in the fall of 2012.
June 5, 2010 § 5 Comments
This is not an exhaustive list, but rather a compendium of current problems that I encounter on a regular basis. This post might be especially useful for students who want to find a challenging problem for their research or simply anyone interested to know some of the science that goes on at the Wellcome Trust Genome Campus and beyond.
- To understand genome variation. How to explain variation within and between species? What are the mechanisms that produced those changes? How can those changes explain different susceptibilities to diseases and traits?
- To predict a genotype given a phenotype. How to correlate phenotypic terms to specific mutations? How to encode phenotypes in a computationally friendly format?
- To understand genetic heritability of complex diseases like Alzheimer’s, Parkinson’s or Stroke. GWAS studies have shown that the contribution of any one gene to specific complex diseases is meager or marginal in most cases. What models are needed for modeling mutation leading to disease? What pieces of the puzzle are missing?
- To optimally manage the data resulting from large scale experiments. How to store this data and make it accessible? Where to store it? Locally? In the cloud? How to make sure that no important data is lost?
- To optimally integrate data from disparate sources for analysis. Should we use federated systems? How to combine the ever-growing number of formats? What software to use to make possible such analyses? How to visualize this data more intuitively?
- Data privacy and accessibility. As more and more sensitive data is produced for analysis of patients’ genomic disorders, how not to hamper reproducibility of experiments? At the same time, how can we protect the privacy of patients? How to secure systems where sensitive data is stored?
- Understanding the effects of epigenetics in molecular regulation and disease. What mechanisms are available for molecular regulation? How does it affect gene expression? What molecular agents are involved in epigenetics regulation?
- Understanding the role of RNAs as enzymes and regulatory entities. How many different kinds of RNA are there? What is their function? How did they evolve?
- How do transmembrane proteins fold? Given a protein sequence, can we predict their final 3D functional state? How does the celular membrane affect the folding process? What helper molecules are involved to make sure that the protein folds correctly?
- Automatic extraction and text mining. Given the current mass of scientific literature, how can we extract automatically this knowledge from text? How close can we get for computers to “understand” human language? How to structure scientific literature to make it more machine-readable?
Sure I am missing many other important topics. I do apologize for those that I missed. Feel free to add your own if you wish.
September 28, 2009 § Leave a Comment
Array-CGH (Comparative Genomic Hybridasation) is becoming a common method used for analysis of patients’ genomes. Array-CGH works by taking a reference genome covering the whole human genome sequence, cutting it into thousands of pieces and orderly attach them to a chip. These pieces are called probes and are usually on the range of 500-2000 DNA bases long. A saliva or blood sample is then taken from the patient and its DNA is also chopped into thousands of pieces in suspension with a solvent. The array is then washed with the suspension containing the patient’s DNA.
DNA is a double chain of nucleotide bases where one chain complements the other. Knowing one chain of the DNA, it is possible to know the other chain. In its natural state, a single DNA chain will tend to bind to its complementary chain. Thus, by washing the patient’s suspension with the array probes will make the patient’s DNA pieces bind to its complementary DNA in the array.
Array-CGH can be used to detect whether a patient has a region of the genome missing or duplicated. Probes attached to the chip emit a different color depending on their state of binding. Once the array is washed, most of probe spots will appear yellow, that is, all different probes of the reference genome are bound to the patient’s DNA. If a DNA region is missing in the patient, the complementary spots in the array appear in red. These changes appear in sequential order mapping to the reference genome missing in the patient. Depending on the genes that overlap to the deletion, different symptoms may appear in the patient.
The same happens if the array shows a series of green spots, indicating that a duplicated region of the genome has been found in the DNA of the patient. Because the gene content will be altered in the duplicated region, this may cause disease as a consequence of the over-expression of genes included in the duplicated regions.
Thus, using array techniques, we are now able to find deletions or duplications in the genome of a patient beyond the microscopic level, i.e. changes not directly observable. We are all familiar with the features of a patient with Down’s Syndrome. This syndrome is caused because there is an extra copy of chromosome 21 in the affected patient, due to a duplication of one of the two usual copies (Trisomy 21).
Most of the chromosomal deletions and duplications occur at the molecular level , not identifiable with microscopic techniques, as in the case of Down’s Syndrome. Up until recently most of the patients suspected of suffering from genomic diseases, i.e. diseases caused by pathogenic deletions or duplications, went undiagnosed because techniques did not allow detection beyond big chromosomal changes (like whole chromosomes). Techniques such as array-CGH now allow detection of chromosomal changes a thousand times smaller in length.
For a price of about £100 per array one can have one’s genome screened for chromosomal changes. In fact, it seems that most of the genetic changes between any two people (in terms of number of DNA bases) is dependent on the level of micro- deletions and duplications (called Copy Number Variations) , just the level we are now starting to handle with current analysis techniques. Next generation sequencing technologies are fast arriving that will allow the base-by-base complete sequencing of the DNA of people at price of $1000 in a short period of time .
 H.V. Firth, S.M. Richards, A.P. Bevan, S. Clayton, M. Corpas, D. Rajan, S. Van Vooren, Y. Moreau, R.M. Pettett, N.P. Carter (2009). DECIPHER: DatabasE of Chromosomal Imbalance and Phenotype using Ensembl Resources. The American Journal of Human Genetics.
 J. R. Lupski (2009). Genomic disorders ten years on. Genome Medicine
 Mardis E.R. (2006). Anticipating the 1,000 dollar genome. Genome Biology
February 8, 2009 § 1 Comment
Cloud computing is becoming a technology mature enough for its use in genome research experiments. The use of large datasets, its highly demanding algorithms and the need for sudden computational resources, make large-scale sequencing experiments an attractive test-case for cloud computing. So far I have seen cloud computing demonstrated using R (1). However, it remains to be seen a rigorous comparison of its performance using a BLAST (2) search and its ability to cope with ever-increasing databases and open source frameworks such as bioperl (3) or bioconductor (4).
Cloud computing claims to be a resource where IT power is delivered over the Internet as you need it, rather than drawn from a desktop computer (5), in a fashion seemingly similar to having your own virtual servers available over the Internet (6). Some of the most important aspects of cloud computing are:
* Software as a Service (SaaS): where you buy a software license for a determined period of time.
* Utility Computing: storage and virtual servers that IT can access on demand.
* Web Services.
My first exposure to cloud computing came of an email from Matt Wood (7), a newly established group leader at the Sanger Institute (8), announcing the Cloud Computing Group (9) in Cambridge, UK. At that point I had no idea of what it meant. When I attended the meeting at Cambridge University’s Centre for Mathematical Sciences (10), to my surprise I found there a very select audience, ranging from the director of IT at Sanger, Phil Butcher (11), one of the Ensembl (12) software coordinators, Glenn Proctor (13), and quite a few local start-up companies.
Among the presenters, we had Simone Brunozzi, from Amazon’s Cloud Computing (14). I think he had an interesting story to tell: how Amazon, a well known company, is now involved in the business of cloud computing and selling it. Apparently, this technology they sell was developed for Amazon’s own business. Among their main challenges was to be able to address the capricious shopping habits of customers, with orders peaking around Christmas and quite flat the rest of the year. These trends required rapid adaptability of computational resources. The idea of cloud computing fitted well with their business model of e-commerce: you don’t need to care about where your computation is done, the only thing you care about is that you have the needed resources and do not have to pay for them when you don’t need them. One of the things that stroke me about Amazon’s presentation was that they would not tell us the number of processors they had at their disposal.
When it comes to using cloud computing for genomics research, prices may be quite expensive when they add up. The bioinformatics field, greatly influenced by the open-source movement, is not likely to rush to join Amazon’s cloud. Private efforts trying to make money out of human genome technology have remained rather unsuccessful to date: think of Celera Genomics or Lion Bioscience. I am skeptical of the bioinformatics community adopting cloud computing unless open source ideals are embraced: i) allowing people to develop and contribute to the technology if and when they want to, ii) allowing total openness in terms of its achievements and pitfalls and iii) making it free to use for everyone. I do not think that making it free does not mean there is no margin for profit. Think of the profitability of free-to-use technologies such as java (15) or MySQL (16), both components of SUN Microsystems’ (17) business.
Despite the promise of potential benefits for the bioinformatics community, the way the cloud is being portrayed does not conform the ideals of free access and openness. Unless these ideals are implemented to some extent, I see it difficult for the cloud to take root in the bioinformatics field and become a new standard platform for genome research.