Bye Sanger, Hello TGAC
February 14th, 2012 § Leave a Comment
After 3.5 exciting years at the Wellcome Trust Sanger Institute, working as senior developer for the DECIPHER database, it was time to start a new venture. As of February 13th, I am the Project Leader of Plant and Animal Genomes at The Genome Analysis Centre (TGAC). TGAC is specialized in the study of plants, microbes and animal genomes with the view to facilitate the development of new genomics-based biology in academic and commercial sectors.
In my new role at TGAC I will be leading the group of computational biologists working on plant and animal genome analyses within the Genome Analysis Team. Our aims will include the organization of the analysis of sequence generated at TGAC, engaging with internal and external collaborators, nationally and internationally. A lot of the work will focus on (but will not solely be restricted to) the analysis of RNA-seq, ChIP-seq and Bisulphite sequencing data for the purposes of understanding how genes are regulated.
Coming Opportunities
I will be expecting to have openings in my group in the near future for student projects (Masters and PhD) as well as research associates and technicians. Meanwhile, if you are interested in joining or simply discussing ideas or potential projects in the the broad areas of transcriptomics, epigenomics and gene regulation, you are always welcome to drop me a line.
Converting Genes and Genomic Features From NCBI36 to GRCh37
January 10th, 2012 § 2 Comments
The Human Genome is a like map where features and genes are mapped to. As techniques improve, our fine-grained resolution for that map increases and new versions are released every few years. When a new coordinate reference map (or assembly) for the Human Genome is released, it produces lots of headaches for those who work in the field as it means that the locations of genes, chromosomal bands and other features like Single Nucleotide Polymorphisms (SNPs) or Copy Number Variation (CNVs) change.
In order to have the most up-to-date version for the Human Genome set of genes and features sometimes it is necessary to convert from one assembly to another. In the past I have written a tutorial on how to remap from NCBI36 to GRCh37 human assemblies using liftOver. In this tutorial I present a simple step-by-step guide for feature remapping using NCBI’s remapping tool.
Important:
Please make sure you know in advance the assembly to which your aberration data is currently mapped to. If by mistake you remap an aberration already in GRCh37 to GRCh37 you will get new coordinates for the region mapped to the wrong coordinates.
The NCBI provides a web facility to convert coordinates from one assembly into another. To convert coordinates using their genome remapping service do the following:
- Make sure that your data is in BED format, e.g. “chr3 100000 999990 myId0000123” -> CNV aberration in NCBI36/hg18
- Please note that each field is separated by a tab and each line by a character return. Please follow this strictly or the remapping tool may throw an error.
- Add as many lines as aberrations you would like to remap
- Go to the NCBI Remap page:
- Select “Organism for source data” Homo Sapiens, “Source Assembly” NCBI36 (hg18) and “Target Assembly” GRCh37 (hg19)
- Please leave all “Remapping Options” (Minimum ratio of bases that must remap, etc) with default values
- Select for “Input format” BED, “Output format” Same as input
- Paste your aberration in the input box where it says “Paste data here” and hit submit at the bottom of the page
- Wait until results are returned
- To retrieve results download “Mapping Report”, which is in excel format or alternatively Mapping report Sample in the results page
Please note that your aberration may remap to more than one location. I recommend that you manually check the coordinates and select the most appropriate of the doubly remapped aberration in the new assembly. Please also note that your aberration may not remap because the region is partially or entirely deleted in the new assembly or split in GRCh37. In this case I recommend that you use another start or end point position, maybe use the start/end of alternative probes until you find a region where it maps.
Another possibility could be to look at the genes for the region in the old assembly and select a region in GRCh37 that includes the same genes as in NCBI36. Each of these solutions requires careful deliberation and may not be applicable to your particular case (e.g. genes in different chromosomes would not allow remapping based on genes).
Scientific Announcements Don’t Get Noticed Where They Should
December 8th, 2011 § Leave a Comment
Wouldn’t it be nice if the event you are trying to promote needed to be posted only once? What if there was a central repository for dissemination of announcements that was accessible and permanently up-to-date? Wouldn’t it be great if your blog or website could show relevant professional announcements without having to enter them?
Unfortunately, people around the world are still trapped in the paper-based office paradigm when wanting to disseminate announcement information. Again and again they post their announcement to different places knowing that it will only reach a partial share of all potentially interested readers. They add data and clog online databases as no centralized repository is available for posting or getting information. Despite the great number of hours of work lost by millions of people trying to post, scientific organizations have been extremely slow to embrace community-shared announcement curation.
We (Rafael Jimenez and I) are promoting the creation of a community of organizations and people to lead iAnn, a centralized collaboration platform that coordinates curation efforts among scientific organizations. iAnn increases access to announcements through its dissemination tools, which have been designed specifically to integrate posts across many different websites with minimal effort. iAnn allows you to post your event, course, piece of news only once to a central repository, which is then disseminated seamlessly to relevant scientific organizations or websites according to keywords, dates or geographical location.
If you think iAnn is of interest to you please contact me (see contact information on the right) or wait for future developments that are about to come in Manuel Corpas’ Blog. Currently we are in a development phase for the project and would like to hear from potential users or scientific organizations if they have any thoughts or suggestions on the matter. Our aim is to change the way anyone posts and finds relevant information about any given professional field. iAnn promises to help many users keep up-to-date with relevant announcements more effortlessly. Perhaps from now on websites will be better able to have most of the events, courses, seminars, news, etc. that users would expect to find in them.
Beware of Gene Names in Excel
November 5th, 2011 § 4 Comments
For the past few days I have been trying to compile the list of gene names that is the most complete possible. To start with, I was given an initial list of genes in an excel file that was taken from the HUGO Gene Nomenclature Committee (HGNC). Unfortunately, the gene names were pasted from the original source (HGNC) to an Excel spreadsheet without modifying the expected format of the column cells. This led to Excel trying to “help” with the formatting of the value inserted, changing those gene names that are similar to dates to an actual date. In the bioinformatics field, misnaming a gene can lead to disastrous consequences such as misdiagnosis of a causal gene in a clinical setting. Thus:
Beware of pasting gene names in an Excel spreadsheet with a default format, as these may be changed into dates.
From my current list of 19,026 genes that I have compiled as of now, here are the names of the genes that have been automatically changed by Excel into dates. In the table below, the first column denotes the date the gene name is changed to, the middle column the Ensembl ID of the gene and the right column the actual name that was changed by Excel into a date.
Sep-01 ENSG00000180096 SEPT1 Sep-02 ENSG00000168385 SEPT2 Sep-03 ENSG00000100167 SEPT3 Sep-04 ENSG00000108387 SEPT4 Sep-05 ENSG00000184702 SEPT5 Sep-06 ENSG00000125354 SEPT6 Sep-07 ENSG00000122545 SEPT7 Sep-08 ENSG00000164402 SEPT8 Sep-09 ENSG00000184640 SEPT9 Sep-10 ENSG00000186522 SEPT10 Sep-11 ENSG00000138758 SEPT11 Sep-12 ENSG00000140623 SEPT12 Sep-14 ENSG00000154997 SEPT14 Mar-01 ENSG00000145416 MARCH1 Mar-02 ENSG00000099785 MARCH2 Mar-03 ENSG00000173926 MARCH3 Mar-04 ENSG00000144583 MARCH4 Mar-05 ENSG00000198060 MARCH5 Mar-06 ENSG00000145495 MARCH6 Mar-07 ENSG00000136536 MARCH7 Mar-08 ENSG00000165406 MARCH8 Mar-09 ENSG00000139266 MARCH9 Mar-10 ENSG00000173838 MARCH10 Mar-11 ENSG00000183654 MARCH11 Dec-01 ENSG00000173077 DEC1
myKaryoView Paper Out
October 27th, 2011 § Leave a Comment
As of October 26th 2011, a paper about the myKaryoView tool has been published in PLoS One. myKaryoView is a genome browser specifically designed for visualization of Direct-to-Consumer (DTC) personal genetic data. We look forward to receiving feedback from users visualizing their own personal genomes and developers willing to extend further the code or simply make use of myKaryoView in a different context.
The paper is freely available and open access.
Citation: Jimenez RC, Salazar GA, Gel B, Dopazo J, Mulder N, et al. (2011) myKaryoView: A Light-Weight Client for Visualization of Genomic Data. PLoS ONE 6(10): e26345. doi:10.1371/journal.pone.0026345
A Genome Blogger Manifesto
September 28th, 2011 § 9 Comments
Have you ever wondered why some people have no reparation in sharing their genetic profiles? Why do they openly talk about something supposedly so private? I believe that no contradiction exists between wanting to protect one’s privacy yet sharing one’s genomic data with the world. I am more concerned about the information that Facebook collects about my profile than my genome data (provided that I live in a country where there I public health).
Sharing and comparing one’s genome with other personal genomes is a matter of necessity if one is to shed light on the meaning of one’s personal DNA.
This is why I became a genome blogger myself. Why one should be constrained by the information that genomic test reports provide? No personal genome analysis report can ever be complete, they will always be influenced by the biases of whomever is providing such a report.
* * *
Although no formal document seems to have been produced on what the core values for genome blogging should be yet, core beliefs driving personal genome-sharing should be made explicit. Here I present an initial and inherently imperfect first attempt to put in writing of what I believe genome blogger values could be. I do not expect every fellow blogger to agree with them, but I hope that at least they inspire some debate. These are not a fixed set of rules; on the contrary, I expect this thinking to evolve with the genomics technology itself. I base some of the ideas below on Marcus Wohlsen’s ‘Biopunk’ book, Meredith Patterson’s ‘biopunk manifesto’, Misha Angrist’s ‘Here is a human being’ book and Pekka Himanen’s ‘Hacker’s ethics’ book.
Core Values for Genome Blogging
- Intelligent exploration, experimentation and trial to push the boundaries of knowledge are a right for ordinary people. The days in which genetic science was only done by university professors or people working in corporate labs are now over. Now everyone should have the power and legitimacy to be able to discover, develop and find new things about their own genome data. « Read the rest of this entry »
Getting My Genome Sequencing Done (Part I)
July 12th, 2011 § 4 Comments
Readers of this blog may have come across the experiment my family did with Direct-to-consumer (DTC) genetic testing. We analyzed all our samples using 23andMe kits and started sharing and writing about our personal genome data. This experience has changed me dramatically as a person and researcher. I started off as a bioinformatician with an interest in risks of genetic variants but now these experiences have helped me develop a real insight into the psychology of how these variants may impact on people’s reactions. As a family, we are truly experiencing a really positive and unexpected response from people contacting us via the Internet who are willing to tell us their findings about our family data.
After doing our whole genome genotypes, the next obvious step is to have our whole genomes sequenced. There is quite a lot of debate at the moment as to whether genome sequencing should be accessible to the general public and if so, to what extent. But I figured out that if “the rich and famous” can have their genome sequenced, perhaps with a bit of luck, the “ordinary and poor” (among which I include myself), could have a chance, even with zero budget. Zero budget for this exercise was an essential point of principle, given that we really would not be able to afford even a 10th of the price a genome currently costs (around $9,500; probably cheaper than this price by the time you are reading this).
I wasn’t sure how to do this, but I know that this might be possible and that we would get it done if we could. So I went onto the Internet and searched for whole genome sequencing. I found three potential good candidates that could do it on demand: Complete Genomics, the Illumina personal sequencing services and the Beijin Genomics Institute (BGI). So the first thing I did, I sent them an email. Given that we had no money to spend and that there is no such a thing as free lunch, we thought that we needed to offer something substantial in return since we were asking them to waive us the fee of ~$50,000. The only substantial thing we could really offer was publicity, so the following proposal was sent to those three companies via their websites:
Dear Sir/Madam: I would like to offer you a deal/proposal. My family would like to have their whole genome sequenced with your company. In exchange for releasing to the public openly and freely on the Internet our genomes we thought you could sponsor us. This action could attract *a lot of attention* to [company name], as this is a pioneering move. Currently a very limited set of people are actually interested in sequencing their genomes. The only way you can reach the ordinary citizen (sooner rather than later) is if ordinary people, like my family publish their experiences and pave the way. My family, an ordinary family, constitutes an example of what this technology could do for any ordinary person, not just a scientist, etc. In addition to this, I want to fully research all of the social/ethical implications that publishing this information can bring. We also hope to share this information with the world. Currently all my family has genotyped their genomes with 23andMe and put all this data in the Internet for free download: http://manuelcorpas.com/five-family-relatives-genome-download/ To our knowledge, this is the first time that anything like that has been done. In barely a month since this information has been published, four different analyses from specialists/hobbyists have reached us, making us learn that dad, for example, is lactose intolerant [1]. Our point is that now, with DNA sequence providers, the door opens for DIY genome mining. The power of the Internet and computers may bring this technology to computer savvy people. For example, our 23andMe genomes are now been taken by SNPedia and several other ancestry projects such as Eurogenes and Artemis: http://www.snpedia.com/index.php/User:Manuelcorpas http://bga101.blogspot.com/2011/03/mds-analysis-of-southern-europe.html http://dioegenesartemis.blogspot.com/2011/04/first-results.html Although up to date the information provided by 23andMe has not revealed any nasty surprises about our genomes, we are aware that now anyone can report new findings that were not initially discovered in our genomes. We believe, however, that as a family we can gain a lot more than lose by sharing our genome data with the world. I believe my proposal could bring a lot of exposure to [company name] and therefore would request whether you could consider this offer. Best wishes, Manuel
Illumina never got back to us. Looking around we learned that their policy is that sequencing should be done with medical prescription. Fair enough.
I couldn’t wait long so I continued researching the matter and found Complete’s contact phone number on their website, so I rang them. To my surprise I was put through and the person was very polite with me and keen to listen to what I had to say. Since I have learned that Complete had already sequenced and published 69 genomes, available via this website:
http://www.completegenomics.com/sequence-data/download-data/
Among these genomes there is a multigenerational family with a bigger pedigree than the one I was proposing. This obviously meant that our offer wasn’t as innovative as we initially thought of. It seems that Complete Genomics will not do (at least for the time being) “Direct-to-consumer” business, but that still, their goal is to become the “Intel Inside” for human genome sequencing efforts, the technology underlying most human genome analyses. I thought that that was a cool objective if attainable.
I still didn’t give up. I tried to see whether there was a chance that Complete might change their mind, so I wrote to them about our incredibly interesting experience of family dynamics and family communication issues while discussing our personal genomes. So far we have not been lucky enough to get our genomes sequenced for free. Despite not achieving our outcome, there is a lot we have learned on the way though. What an interesting experience.
This is the end of part one on Getting My Genome Sequencing Done.
[1] This information was actually available in our 23andMe reports, but we missed it initially. We learned about this condition with the SNPedia tool Promethease
Benefits for Publishing Family Genomes on the Internet
June 6th, 2011 § 6 Comments
It has been for a long while since I’ve been wanting to write about the stuff that Mike Cariaso, founder of SNPedia, has been doing with my family genotypes. Initially, he performed their data analysis with Promethease for assignment of traits and annotation to observed SNPs. More recently, he has also developed a tool for visualization and comparison of genotypes between different people. He has used my family’s and Manu Sporny’s genotypes as test cases.
This is an unanticipated benefit we have experienced as a family for publishing our genomes on the Internet. Using Promethease’s report we were able to learn that dad is lactose intolerant. The fact that he did not like milk and had not taken milk in years kind of made sense when we discovered that his two SNPs rs4988235(C;C) and rs182549(C;C) make him unlikely to digest lactose with 70% probability. This result regarding lactose intolerance was in fact in the 23andMe report but we missed it.
It is clear that Direct-to-consumer genetic companies do try to cater to the non-expert, i.e. the majority of its customer base. The novel SNPedia visualization tool will be an useful addition to those of us who strive to DIY our own discoveries about our personal genomes data.
Using his visualization tool, when I compare all my SNPs with those of my sister’s, I find that 68% of mine are identical to hers, a total of 389,250 (see below).
Note that the graph is using a logarithmic scale. Of all our analyzed SNPs, 25% are halfmatch (i.e. one of the alleles is common to both of us) and about 2% are conflicts. Example of conflicts may include different SNPs with the same position. This, according to Mike, may not be an accident. Because I know that we were analyzed in two different array platforms, version 2 and version 3 respectively, I can now tell the number of SNPs that are different between both of us, i.e. not present in either genotype. Of the total 0.5 Million plus SNPs in my genome about 29,082 do not match hers.
The other nice feature this tool provides is an actual graphical representation of chromosomal SNPs in a map of pixels, colored consistently with the above notations: light blue means match, dark blue halfmatch, red conflict and grey different SNPs:
The above figure shows two representations for chromsome/chromosome comparison between my chromosome 1 and my sister’s. Clearly most of the area is light blue, indicating complete match. Also the number of differences, halfmatches and conflicts are reported. Clicking on any of these links, one can find the actual SNPs in conflict, getting an output that looks like this:
1 rs9729550 1 1125105 CC AA 2 rs12142199 1 1239050 GG AA 3 rs7531583 1 1696020 GG AA 4 rs6681938 1 1771080 CC TT 5 rs41307846 1 1949559 GG -- 6 rs3128296 1 2058766 TT GG 7 rs262654 1 2079386 AA GG 8 rs262688 1 2103425 GG TT 9 rs6659405 1 2362949 TT GG 10 rs4648482 1 2739781 CC TT 11 rs2483266 1 3225901 CC TT 12 rs868688 1 3290667 TT CC 13 rs10492939 1 3292731 AA GG 14 rs2493268 1 3298358 TT CC 15 rs871822 1 3302774 GG TT 16 rs12024847 1 3310659 TT CC 17 rs2821017 1 3510731 GG AA 18 rs3765761 1 3620336 CC TT 19 rs3765766 1 3624520 TT CC 20 rs4233262 1 4136842 CC TT 21 rs966321 1 4215064 GG TT 22 rs964715 1 4216644 TT CC 23 rs1390136 1 4241703 CC TT 24 rs4654545 1 4425464 TT CC 25 rs446529 1 4695274 CC TT
This table shows that for the first SNP, rs9729550, I have CC while my sister has AA.
In conclusion, Promethease and the SNPedia visualization tool is helping me learn more about my SNP genotype results, complementing the information that I initially got from my Direct-to-consumer provider. Hopefully I will be able to do some additional research based on the results hereby obtained.
If you want to see my family’s genomes with Mike Cariaso’s tool you can find it here:
Don’t forget to send me any exciting findings that you might encounter!
A Warning Sign for Biomedical Databases
May 25th, 2011 § 9 Comments
Users of the highy popular OMIM database (On-Line Mendelian Inheritance in Human) [1] may have noticed that NCBI [2] is not providing further funds to sustain OMIM’s development. One of the reasons for halting the funding may have to do with curation work not deemed worthy of funds. Funding agencies might have thus started a trend to not willing to dedicate funds for curation of database entries.
The flip side of this is the nascent trend to outsource database annotation to the general public. Databases like Rfam [3] or Pfam [4], two popular RNA and protein family databases, have adopted the strategy of outsourcing their annotation to Wikipedia. Realizing that it is impossible to keep up with the literature, an attempt was made by Rfam to seed Wikipedia with database-specific information. They then developed a system to collect Wikipedia text from created entries periodically to repopulate back the corresponding RNA entry. The price they had to pay was losing control on what gets entered into the Wikipedia entry. However, benefits seem to outstrip this loss of control, including ready access to an army of casual annotators and a dramatically increased exposure of the database itself (Wikipedia consistently ranks top of the list for most RNA family searches in Google). This means that their chances of having up-to-date content is increased, as well as better awareness of the resource, justifying future cycles of funding.
Something that started as an experiment in Rfam seems to be spreading to other databases as they begin to assess how to address their annotation bottleneck. It seems that outsourcing annotation of Biomedical databases to Wikipedia is a solution worth considering as curation practices continue evolving to cope with current fund shortages. Generalized lack of funding for research and the establishment of community wiki-style annotation practices may mean that funding agencies may be ever more reluctant to provide funding for database curation. Perhaps this is the time to start rethinking future plans for those of us who care about biological databases and their contents. Is now the time ripe for embracing Wikipedia to the full?
[1] http://www.ncbi.nlm.nih.gov/omim
[2] http://www.ncbi.nlm.nih.gov/
[3] http://rfam.sanger.ac.uk/
[4] http://pfam.sanger.ac.uk/
[5] http://www.wikipedia.org/


