Computational Biology in Wikipedia Needs Serious Work

May 19th, 2011 § 2 Comments

Wikipedia features as one of the resources with greatest impact in disseminating knowledge. A search in Google for computational biology returns the Wikipedia entry #1 in the hit list (Figure 1). A search for biological databases in Google, again, it returns the corresponding Wikipedia entry top of the list. Search for genomics, proteomics or metabolomics. Still, the top result is Wikipedia.

Google Search Results for "computational biology"

This behavior of appearing top of the hit list in Google happens for most things that are searched in Wikipedia. In fact Wikipedia currently ranks #5 in the list of most visited sites on the Internet. This prominence on Google searches are the result of the great number of links that compose any Wikipedia entry, which in turn is linked by many other entries within and outside Wikipedia.

The success of Wikipedia, initially attributable to the experiment of engaging a community-wide effort to provide accurate and accessible information to the general public has led to a massive development of the resource. Even in scientific circles it features as an important source of reasonably up-to-date reference knowledge [1].

Figure 2 shows a snapshot of the current computational biology article in Wikipedia. It does not mention any of the breakthroughs the field has experienced since the advent of the Human Genome Project or Journals or even Scientists who have shaped the field.

Wikipedia Entry for Computational Biology

The ability to engage a community-wide effort and the high ranks any entry in Wikipedia occupies in a Google search make it an ideal vehicle for development of dissemination of the significance of Computational Biology. It is the people who work in this field that are ultimately responsible to make sure that their findings and work are known to the tax payer.

[1] http://wellcometrust.wordpress.com/2011/05/18/being-a-scientist-in-the-age-of-wikipedia/

Experiences with Personal Genetics: A Family Journey

April 17th, 2011 § Leave a Comment

The above is the title of a talk I will be delivering at this year’s OpenTech (21 May 2011),  a conference whose objective is to provide a forum of discussion for “people who work on things that matter“. Here is an outline of what I’ll be presenting:

Direct-to-consumer genetics testing is a new field of commercial activity that makes genome screening available to the general public. Test results are delivered on line via a password-protected account contextualized with state of the art inferences about the individual’s clinical features, disease risks and ancestry. Interpretation of results is limited to the information supplied by the provider and usually not accompanied with genetic counseling. Custodians of genetic information may not have the necessary skills to interpret results, let alone interpret results for others. This talk presents a personal journey of a genome bioinformatician acting as genetic counselor for his whole family, yet with no formal training to do so. Becoming custodian of genetic information for a whole family resulted in unanticipated situations and reactions that are hereby presented. As the utilization of these tests become ever more widespread, it is hoped that these experiences provide useful insights to new customers of genomic technology who try to understand their own genes.

For more information on this conference click on the image below.

Millions of Genomes

February 15th, 2011 § Leave a Comment

This was that title of a talk recently given by Richard Durbin at the Wellcome Trust Sanger Institute. Excitement and expectation, reassured by a continuous trend of exponential growth, made inspired listeners feel the same way Google or Facebook employees must have felt at their company’s peak time.

Some numbers presented by Richard gave context to the startling prediction that by 2015 millions of individual genomes will be sequenced. This is in fact the expected number if the current pattern of growth continues. Ten years have now been celebrated after the draft for the first Human Genome was released in 2001. By 2006, with next generation sequencing in full swing and sequencing centers churning out many gigabases per week, tens of genomes had been sequenced. Today the number of individual genomes is in the order of thousands, meaning that every year a 4 fold growth is predicted. Extrapolating this estimation to five years from now makes thus the number of genomes sequenced 1024 times (45) our current number, hence millions of genomes.

Having such an incredible amount of data will clearly create challenges which we are just beginning to find. How are we going to hold all this data when processing capacity in computers “only” grows 2 fold every year? The answer is that as more genomes become available, an individual’s data will not be stored in its totality but only the differences that define his/her particular variations.

Although many genomes may have been sequenced by now, accessing them is not a trivial matter. Stored in many different places, with different restrictions and inconsistent levels of detail, the bulk of this data is likely to remain at least mildly challenging to handle.  Results of investigations will certainly be accessible, but think of the effort it could cost to access every single database containing public individual genome data. I do not believe that a great number of genomes will be optimally researched unless more straightforward and standardized access protocols are put in place, something that currently is lacking. Times for excitement are reasonably justified, yet base pair to bedside medicine may be delayed if current data sharing procedures are not streamlined.

Personal Genomic Software: A Review of What Is Available

February 14th, 2011 § 5 Comments

Readers may have seen that a few previous entries in Manuel Corpas’ Blog have been dedicated to myKaryoView, a personal genome visualization free software. In this post I review some of the software that is currently available for analysis of personal genomes. These are all free third party packages independent from providers such as 23andMe, Navigenics or deCODEme.

Andrew Scheidecker’s Personal Genome Explorer apparently is the first piece of software that was created for analysis of 23andMe personal data. This is a console application that allows 23andme data import, deCODEme data import, SNP database import from SNPedia, analysis of genome based on SNPedia metadata and random genome generation based on population frequency data.

I found that Personal Genome Explorer is a light-weight application that can be easily downloaded and installed. A lot of potential information can be extracted and browsed from a database based on SNPedia data. I tried to upload my own 23andMe chromosome 16 with file extention ‘.txt’ and unfortunately it did not recognize it or gave a clue as to what kind of extension it accepts.

Personal Genome Explorer showing randomly selected SNPs

SNPTips is a firefox plugin extension that allows customers of 23andMe to access their SNP genotype information. SNPTips allows one to hover the mouse cursor over the SNP id in any article text or webpage. Clicking on the SNP icon it creates, a pop up window appears with one’s genotype (i.e. the DNA letters found in your analysis) with links to SNPedia, Google Scholar and dbSNP. I tried to upload my 23andMe chromosome 16 and it worked quickly and neatly. Unfortunately it does not allow simultaneous visualization of more than one personal genome.

Enlis Genome is another tool that can be downloaded as a console. The interface is quite intuitive and it managed to upload my chromosome 16 SNPs in about a couple of minutes. The report it gave back was very neat. However it seemed to provide a very similar kind of information to what is already available to 23andMe customers. The main added value I could find in this tool was that it colated most information provided by a 23andMe’s customer report into a sort of document that can be easily handled. It was unfortunate though that the report concluded I was female. How it infers my gender when I only provided autosome data puzzles me slightly.

 

My results for Enlis Genome uploading my 23andMe chromosome 16.

myKaryoView is to my knowledge the only personal genomics tool that allows navigation and visualization of this genetic data directly as a genome browser. myKaryoView uses the DAS technology, which makes it capable of representing any available DAS source together with one’s genome, such as known genes, OMIM genes, normally variant regions, etc. Currently, adding one’s genome into a DAS source is a process that requires expert knowledge of another tool called easyDAS. Once the DAS source for one’s genome is created, the url where the DAS source genome is located can be added to myKaryoView for exploration via its interface. myKaryoView does not require any download for installation, as it is a web tool, and many personal genomes can be navigated at the same time.

myKaryoView showing my personal genome SNPs in green for a subregion of 10q11.23

The Perfect Tool

If I was able to pick the strengths of each of the reviewed softwares and put them together into one piece I would choose the richness of SNP information from the Personal Genome Explorer, the ease for uploading one’s genome from SNPTips, the reporting capabilities of Enlis Genome and the navigation and visualization capabilities of myKaryoView. Since all of these implementations are already available, the winner of this software “market” will be the one that combines all of these strengths in manner that is easily accessible to lay people. I think 23andMe has a lesson to teach in terms of making accessible to all of us the ability to analyze one’s genome and reporting the relevant information succintly.

Conclusion

Several tools are now available specifically tailored to the analysis and discovery of information related to one’s personal genome. Not a single tool is perfect and to some extent all require some computer and biology knowledge in order to properly operate and understand them. This is clearly not the ideal situation for lay people who are curious to know a bit more about their own personal genome. Certainly if all the strong points of each of the above were combined a much better tool and service to the community could be rendered. Personal genome coders: it’s time to join forces!

The Meaning of Red

February 5th, 2011 § Leave a Comment

A previous post in Manuel Corpas’ Blog in March 2010 noted that there was disagreement among the leading Copy Number Variation (CNV) repositories in one small but significant detail. Some of them displayed gains in green, others in blue. The same with loses: no consensus existed in the way deleted regions were colored. Not agreeing to such an obvious standard was troublesome for users, especially when comparing data from different resources.

I am pleased to know that decision makers in DECIPHER, the Database of Genomic Variants, ISCA, the NCBI and the UCSC Genome browser have finally agreed on a common color scheme that defines gains in blue and loses in red. To be more precise, here are the hexadecimal colors:

  • #0000FF (blue): gain
  • #FF0000 (red):  loss

Part of the drive in agreeing to this standard has been prompted by some users affected of color blindness who complained that they were not able to distinguish between red and green. This trigger accelerated the change.

Having a common standard coloring scheme will not only help color blind people but all users. A consistent way of illustrating gains and loses means all users will be able to grasp the science more quickly. This is great news for the whole community and a cause for celebration.

Remapping from NCBI36/hg18 to GRCh37/hg19

February 2nd, 2011 § 2 Comments

Given the huge response I have at work about remapping features into another assembly, I present here an adapted version for how to remap a feature from NCBI36/hg18 to GRCh37/hg19 using UCSC’s liftOver tool.

Important:

Please make sure you know in advance the assembly to which your aberration data is currently mapped to. If by mistake you remap an aberration already in GRCh37 to GRCh37 you will get new coordinates for the region mapped to the wrong coordinates.
UCSC’s Genome Browser provides a web facility to convert coordinates from one assembly into another. To convert coordinates using their liftOver tool do the following:

  1. Make sure that your data is in BED format, e.g.  “chr3     100000  999990  myPatientId0000123” –> aberration in NCBI36/hg18
  2. Note that each field is separated by a tab and each line by a character return. Please follow this strictly or the remapping tool may throw an error.
  3. Add as many lines as aberrations you would like to remap.
  4. Go to the liftOver page
  5. Select “Original Assembly” Mar. 2006 (NCBI36/hg18) and “New Assembly” Feb. 2009 (GRCh37/hg19)
  6. Leave all other parameters (Minimum ratio of bases that must remap, etc) with default values
  7. Paste your aberration in the input box where it says “Paste in data” and hit submit
  8. To get results, scroll down the page and click on the “View Conversions” link.
  9. Here is the result I get:
chr3  125000      1024990     myPatientId0000123

Please note that your feature may not remap because the region is partially or entirely deleted in the new assembly or split in GRCh37. In this case I recommend that you use another start or end point position, maybe use the start/end of alternative probes until you find a region where it maps. Another possibility would be to look at the genes for the region in the old assembly and select a region in GRCh37 that includes the same genes as in NCBI36. Each of these solutions require careful deliberation and may not be applicable to your particular case (e.g. genes in different chromosomes would not allow remapping based on genes).

I hope this is helpful.

A Genetic Poem

December 24th, 2010 § Leave a Comment

A poetry contest is being organized by 23andMe. The five lucky winners each get a free entry to the 2011 edition of the Personalized World Medicine Conference. The rules are simple, include at least 5 words from a list provided and send it by December 31st. Any formats and number of tries are allowed. Unfortunately I am not eligible because it is only open for US residents and travel costs are not covered.

Since Christmas is a time that tends to be free of distractions for me :-) I decided to give it a try and write a genetic poem, even though I am not eligible for the prize. I will share it for now but I might take it down later if I feel more embarrassed about it.

My Genes and Me

Genes between probes
Vary my transcription
Condition their relation
And length of ear lobes

My credit card spelled out
The questionnaires went
I too gave consent
To have my test done, no doubt

A saliva drop fell
As the parcel came home
I shook it all well
The lab got it, I logged on

Risks and Ancestry I found
Some were fine, some were high
I Scrolled up, I scrolled down
Browsed data through the night

Then something else happened
A 5th cousin had chatted
That shared my haplogroup
Was she part of my troupe?

Yet it was all the genotypes
That 23andMe found alike
Without it I wouldn’t know
Such tales of my chromosomes

For how can a little SNP count
If these traits are all mine
My phenotypes all defined
When I had just DNA out

And if this was not enough
Hope you’ll let me certify
GWAS is good for some stuff
Even if it’s hard to identify

myKaryoView v2: Navigate Your Own Genome!

December 16th, 2010 § Leave a Comment

Following on the release of a Nature article on the rise of genome bloggers, in which Manuel Corpas’ Blog is linked, I would like to take this opportunity to announce the release of myKaryoView v2, an open source visualization software for personal genomics. Combining Rafael Jimenez’s and my own efforts, we have significantly augmented myKaryoView’s capabilities to allow users to visualize their personal genomes.

Visualization of one’s own personal genome is done via Bernat Gel’s easyDAS tool. This tool converts files with biological annotations into a DAS source. DAS sources can be thought of as tracks in a genome browser. The beauty of DAS is that it does not require any data to be stored locally and, as long as the reference coordinates are the same, any kind of biological features can be easily integrated.

Exploring My Own Genome With myKaryoView

23andMe analyses results report that I have a 28.1% risk of developing prostate cancer as opposed to a 17.8% average risk in males. This risk is calculated analysing the genotypes of 12 SNPs. The SNP marker rs10993994 shows the greatest risk among the 12 reported markers, a 1.3 increased odds. This SNP is located in 10q11, near the MSMB gene and the found allele (T) has been shown to affect its expression levels, decreasing its cancer suppressor function [1].

Having no history of prostate cancer in close relatives, I wanted to find more information about this SNP in order to confirm results. My whole genome profile, containing > 570,000 SNPs, was downloaded from 23andMe and a DAS source was created using easyDAS. The resulting data source was held privately in my newly created easyDAS account. Once easyDAS creates a new DAS source, the data is available through a URL. I pasted the URL for my genome data into the myKaryoView interface, selecting all accompanying tracks to be shown in its zoom view.

I typed the ‘MSMB’ gene myKaryoView’s query search box and once results were returned,  I zoomed out to have a better overview of my 10q11 chromosome region, shown below.

Visualization of MSMB region with myKaryoView

My genomic profile is the bottom track with SNPs in green. The top track in purple corresponds to genes involved in mendelian inheritance diseases (taken from OMIM), in red all existing genes, blue and green normal CNV regions and in yellow somatic mutations found in cancer (from the COSMIC database). I clicked on the gene and SNPs feature bars to find further information. Clicking on the MSMB gene feature, I found that this gene’s start position is 51219559, only 57 bp after the rs10993994 SNP position. The track with yellow features (COSMIC) also contained four reported mutations for MSMB (MSMB:ENST00000358559), indicative of the involvement of this gene in cancer, but all of them within the genes exons, i.e. not outside the gene. Following MSMB’s link to OMIM revealed also its implication in prostate cancer.

By seeing all these data sources in myKaryoView I feel more confident with the validity of 23andMe’s reported risk. It is true that all the sources visualized in myKaryoView can be found if searched for in the Internet. The merit of this tool is, I think, that it provides a one stop shop for a first step in analyzing original data sources for one’s personal genomic results.

About myKaryoView

myKaryoView is a web tool for visualization of genomic data specifically designed for direct-to-consumer genomic tests that uses publicly available data distributed throughout the Internet. It does not require data to be locally held and it is capable of rendering any feature as long as it conforms to a standard protocol named DAS. Configuration and addition of sources in myKaryoView can be done through the interface. myKaryoView should be considered a prototype and not a finalized tool. Here offer a proof of principle of myKaryoView’s ability to display personal genomics data with 23andMe genome data sources. Prior to publication, please acknowledge Rafael Jimenez and Manuel Corpas if using myKaryoView.

[1] Proc Natl Acad Sci U S A. 2009 May 12;106(19):7933-8. Epub 2009 Apr 21.

myKaryoView: First Open Source Visualization Software for 23andMe Data

September 1st, 2010 § Leave a Comment

myKaryoView Logo

Following my previous post on the First Publicly Available Genome Via DAS I would like to present an open source software that Rafael Jimenez and myself have developed for visualization of genomic data. Here we have it configured to display 23andMe data as a test case. We call it myKaryoView and it is available for free use and download. Its website is located at the following address:

http://mykaryoview.com

myKaryoView works in most contemporary browsers without lengthy installations and uses publicly available data distributed throughout the Internet via DAS. This means that there is no need to hold the data locally and that it is capable of visualizing any data as long as it is available via DAS. In order to visualize 23andMe data, myKaryoView requires the set up of a DAS source, which currently limits myKaryoView’s usage to those familiar with this technology. However, configuration and addition of sources are extremely simple and the amount of data able to display is limited only to the time of request completion and data rendering.

Here we show myKaryoView to display personal genomics data with a dummy 23andMe genome data source. This source is based on real 23andMe results data from my own genome, randomly modified in a manner that is irrecognizably different.

The myKaryoView website shows an implementation that allows search of genome data via gene name or genome coordinates. For example, type in the search box 1:2000000,6000000 and hit “Submit Query”.

myKaryoView Zoom and Chromosome views.

The figure above shows results of that query, with two tracks containing the source from 23andMe with dummy data plus genes for a subchromosomal region in chromosome 1, Start: 2000000, End: 60000000. Gene names and SNP data and are shown in red and blue respectively. Different color shades indicate the density of annotation for any given point. If the “Gene Names” data track name is clicked, a popup window appears with a link “Display Original Data Source” that allows the download of the raw data from its DAS source. Any feature can be clicked for retrieval of specific information contained in the DAS source. Here a blue SNP mark is clicked and a popup window appears describing the selected SNP and a link to its corresponding dbSNP entry.

A simple manual explaining how to install and configure myKaryoView to show different data sources is provided from the website. myKaryoView is still in beta testing and any feedback is welcome. We have some plans for the near future for myKaryoView, which we will reveal in due time. Meanwhile I hope you find it interesting and useful.

By the way, the claim that this is the First Open Source Visualization for 23andMe data is, of course, arguable.

Where Am I?

You are currently browsing the Bioinformatics category at Manuel Corpas' Blog.

Follow

Get every new post delivered to your Inbox.