Visualizing Your DNA Genotype Profile in a Blanket

March 6th, 2012 § 1 Comment

I have encountered a rather ingenious idea in Ben Landau’s blog. You may have read that Mike Cariaso, founder of SNPedia, created a visualization graph with the genotype of my family where my family member genotypes were compared against each other. Visualization patterns were created that compared each chromosome against every other. To show how this actually looks, I have taken from Mike’s tool an image that shows the comparison of 23andMe genotypes from my mom and dad (x and y axis respectively), each pixel being a SNP and different colors representing match (light blue), half match (dark blue) and conflict (red).

Chromosome 1. Comparison of Corpas mum and dad 23andMe genotypes using SNPedia's visualization tool.

It seems that Ben has taken this idea further and designed a blanket that incorporates chromosomal patterns for a complete 23andMe genotype. I quote here the description for this blanket from the Ben’s blog:

First Gift is a precious blanket which compares the digital DNA data of a child with their parents. If the child’s genes are edited, these changes will mask the parent’s DNA with synthesized DNA. The blanket itself represents a sacred and fragile heirloom, where tampering with it could potentially lead to frayed edges and uncertain outcomes. This first genetic gift will be with the child for life, and will also be inherited by future generations.

Although technically speaking this visualization shows comparisons between any two individuals, and not between the two parents and child as it is mentioned in the blog, I am still amazed at the craftiness and ingenuity of this idea. And since it uses data from the Corpas family dataset, I hereby report it in Manuel Corpas’ Blog. Here is another unpredicted and surprising effect from publishing our family genomes on the Internet.

To finish this blog entry, I borrow from Ben a complete profile view of my sister’s genotype patterned in his ‘First Gift’ blanket. According to him, the weaving of the blanket was done at the Tilburg Textile Museum in Amsterdam.

Blanket showing genotype pattern comparisons in my sister's 23andMe genotype. Each shape in theory corresponds to a chromosome comparison, although I still need to understand which chromosome represents each of the 25 shapes above (Humans have 22 autosomes + 1 sexual pair + mitochondrial).

Scientific Announcements Don’t Get Noticed Where They Should

December 8th, 2011 § Leave a Comment

Wouldn’t it be nice if the event you are trying to promote needed to be posted only once? What if  there was a central repository for dissemination of announcements that was accessible and permanently up-to-date?  Wouldn’t it be great if your blog or website could show relevant professional announcements without having to enter them?

Unfortunately, people around the world are still trapped in the paper-based office paradigm when wanting to disseminate announcement information. Again and again they post their announcement to different places knowing that it will only reach a partial share of all potentially interested readers. They add data and clog online databases as no centralized repository is available for posting or getting information. Despite the great number of hours of work lost by millions of people trying to post, scientific organizations have been extremely slow to embrace community-shared announcement curation.

We (Rafael Jimenez and I) are promoting the creation of a community of organizations and people to lead iAnn, a centralized collaboration platform that coordinates curation efforts among scientific organizations. iAnn increases access to announcements through its dissemination tools, which have been designed specifically to integrate posts across many different websites with minimal effort. iAnn allows you to post your event, course, piece of news only once to a central repository, which is then disseminated seamlessly to relevant scientific organizations or websites according to keywords, dates or geographical location.

If you think iAnn is of interest to you please contact me (see contact information on the right) or wait for future developments that are about to come in Manuel Corpas’ Blog. Currently we are in a development phase for the project and would like to hear from potential users or scientific organizations if they have any thoughts or suggestions on the matter. Our aim is to change the way anyone posts and finds relevant information about any given professional field. iAnn promises to help many users keep up-to-date with relevant announcements more effortlessly. Perhaps from now on websites will be better able to have most of the events, courses, seminars, news, etc. that users would expect to find in them.

Genomic Technologies in the Clinic: Challenges and Opportunities

October 9th, 2011 § Leave a Comment

Next Generation Sequencing (NGS) offers the promise of revolutionizing our ability to diagnose genetic disorders. Fuelled by the exponential decrease in the cost of sequencing, NGS can now be outsourced, making it accessible to labs with modest budgets. A personal exome (the sum of all coding regions in a genome) is currently priced at $999 by some providers. Although not as comprehensive as whole genome sequencing, exomes provide the ability to shed light on the origin of causative mutations lying on genes.

Exome Sequencing

                                    (by SarahKusala, CC-BY 3.0)

Getting the raw sequence data is the easy part. The challenging part is to extract and interpret clinically the genomic variation found in the raw data. The extraction of variants from raw NGS data can be influenced by many factors such as the sequence read depth, the alignment of reads and the variant calling algorithm. If one is to find the variants that may be of clinical relevance, filtering is required. This filtering may be performed by comparing genome data against data from the “normal” variation found in the 1000genomes project and dbSNP. Depending on the length of the mutation, there are three main kinds of variants: SNPs, indels and CNVs. SNPs constitute single point mutations (one DNA base), Indels insertions or deletions of up to about 1Kb and CNVs deletions or duplications from 1Kb to many megabases long.

It is well known, however, that many SNPs fall into locations that are far from genes, yet they can cause phenotypic effects. But assuming that one is looking at coding regions, many pieces of software have been developed to predict the effects of SNP mutations: stop codons, missense mutations and frameshifts.

Indels and CNVs are slightly harder to interpret clinically. CNVs can encompass many genes and their phenotypic effect cannot be clearly established unless several patients have been observed with a similar CNV. It is not uncommon for a normal individual to carry hundreds of indels and CNVs.

Challenges

One of the most important challenges in the clinic when implementing genomics is going to be how to deal with the huge amounts of data produced. There is going to be a great number of patients sequenced, all of them producing a huge number of genomic features of unknown significance. Given that in order to confidently interpret a rare variant it is needed to have evidence from several patients, it is not surprising that another big challenge is how this information is going to be shared. A lot more data about a patient means that the chances of personal identification are increased even if this information is anonymous. Thinking about a few routinely carried out tests today, it is possible to uniquely identify a person only with a handful of SNPs. Imagine when one possesses thousands of genomic variants from one patient.

Moreover, if this data is to be shared, a big challenge is going to be how it is going to be compared. Different labs have different Quality Control (QC) standards and different platforms. Each sequencing run may have different read depths and different levels of confidence in terms of whether a called variant is true. Another issue will be how the annotation of phenotypes will be carried out. There are phenotypic ontologies like the Human Phenotype Ontology, that allows a reasonably complete set of clinical descriptions. Nevertheless there is no guarantee that phenotypic descriptions even using the same ontology will have the same level of annotation. All these factors are going to need consideration when interpreting NGS in the clinic.

One of the main hurdles impairing the access of NGS to the clinic can also be the health system in the country. The UK seems to have been able for now to put together many state funded clinical labs to work together. Unfortunately, this would be unthinkable in countries like Spain, where instead of 1 unique health system, there are 17, as many as autonomous regions there are. Sequencing technologies require a lot of different sectors coordinating together in order to set up the appropriate platforms that guarantee the access of the technology, its proper interpretation and the protection of the patient’s privacy.

Opportunities

The other side of the coin is that these technologies are going to become increasingly affordable, not just for the rich countries but also for the emerging. The accessibility of this technology will make it ubiquitous in many labs around the world, not just to those looking for diagnosis of patients with genomic disorders. Expect sequencing routinely performed for cancer tissues and even at birth. Based on current estimates, it is likely that by 2020 there will be hundreds of millions of genomes sequenced.

Conclusion

Sequencing is going to revolutionize clinical practice. The degree to which it will revolutionize it depends on how we harness the challenges described above. There will be technical problems but also institutional ones that are more problematic to solve. The race for harnessing NGS in the clinical setting is on.

myKaryoView: First Open Source Visualization Software for 23andMe Data

September 1st, 2010 § Leave a Comment

myKaryoView Logo

Following my previous post on the First Publicly Available Genome Via DAS I would like to present an open source software that Rafael Jimenez and myself have developed for visualization of genomic data. Here we have it configured to display 23andMe data as a test case. We call it myKaryoView and it is available for free use and download. Its website is located at the following address:

http://mykaryoview.com

myKaryoView works in most contemporary browsers without lengthy installations and uses publicly available data distributed throughout the Internet via DAS. This means that there is no need to hold the data locally and that it is capable of visualizing any data as long as it is available via DAS. In order to visualize 23andMe data, myKaryoView requires the set up of a DAS source, which currently limits myKaryoView’s usage to those familiar with this technology. However, configuration and addition of sources are extremely simple and the amount of data able to display is limited only to the time of request completion and data rendering.

Here we show myKaryoView to display personal genomics data with a dummy 23andMe genome data source. This source is based on real 23andMe results data from my own genome, randomly modified in a manner that is irrecognizably different.

The myKaryoView website shows an implementation that allows search of genome data via gene name or genome coordinates. For example, type in the search box 1:2000000,6000000 and hit “Submit Query”.

myKaryoView Zoom and Chromosome views.

The figure above shows results of that query, with two tracks containing the source from 23andMe with dummy data plus genes for a subchromosomal region in chromosome 1, Start: 2000000, End: 60000000. Gene names and SNP data and are shown in red and blue respectively. Different color shades indicate the density of annotation for any given point. If the “Gene Names” data track name is clicked, a popup window appears with a link “Display Original Data Source” that allows the download of the raw data from its DAS source. Any feature can be clicked for retrieval of specific information contained in the DAS source. Here a blue SNP mark is clicked and a popup window appears describing the selected SNP and a link to its corresponding dbSNP entry.

A simple manual explaining how to install and configure myKaryoView to show different data sources is provided from the website. myKaryoView is still in beta testing and any feedback is welcome. We have some plans for the near future for myKaryoView, which we will reveal in due time. Meanwhile I hope you find it interesting and useful.

By the way, the claim that this is the First Open Source Visualization for 23andMe data is, of course, arguable.

First Publicly Available Personal Genome Via DAS

August 26th, 2010 § 3 Comments

You may have heard stories about some well known people to have released their genome for public use. I would like to convince you that now you don’t have to have a lot of money or being a public figure in order to do that. Companies like 23andMe and Navigenics provide the ability to get one’s genome tested for not a lot of money and get the results via a password protected website. The problem is that our current understanding of what these results mean are rather limited on their own. Thus having open collaboration platforms for citizen science using genomic data may be a step forward in helping understand one’s genetic testing results. Initiatives like DIYgenomics are already working on this concept.

You may wonder why making one’s genome released is useful. The answer is, in practical terms it is not. However, the concept of being able to do that I consider it to be a very interesting one. After all, one’s genome data on its own is hardly informative, but when compared with information like known genes, pathways or even other people’s genomes, it becomes much more interesting and opens up the possibility for real discoveries.

With this post I hope to prove that genomes can now be put on the web in a standard format like the Distributed Annotated System (DAS) where people can share and integrate them with other public data sources mappable to genome coordinates. DAS is an environment that is open source, decentralized and unregulated. So what is different here from what is being done already? Why is this significant? I can think of at least three reasons. 1) Flexibility: pretty much any genome annotation can be put up; 2) Integration capabilities: anything can be combined with anything else as long as they share the same coordinates system and 3) Data outsourcing: data is stored and maintained by DAS source owners elsewhere. Here is my story:

Last year I decided to get a 23andMe kit to have my genome analyzed. After results were delivered, I decided to download the data in raw format, consisting of >0.5M SNPs (single nucleotide polymorphisms) mapped to the NCBI36 genome assembly.

I wanted to experiment with this data from a bioinformatics point of view, so I decided to put my “genome” on the web for public access. Well almost. I did not put up my real genome, I created a randomly shuffled version of it (i.e. it does not resemble any recognizable trace to the real data). I put up this unreal data to make a point of principle.

Anyone in the world can thus access my randomly shuffled genome using an URL like this:

http://mykaryoview.com:9000/das/mykaryoview/features?segment=1:1,2000000

where after the token “segment=” in the above URL a chromosome type is specified [1-22, X, Y], followed by a colon, followed by the start and end position, separated by comma. Try the above URL with different chromosome number and coordinates and see what results you get!

This is what you get when requesting a valid URL with the genome coordinates

In the above figure you see different columns, denoting the SNP id, start and end positions, the genotype under the “Notes” heading and a link to the SNP’s corresponding entry in dbSNP.

Now that this genome is in a standard format, it can easily be integrated with any other publicly available data in DAS. As of this writing (26th August 2010) there are 139 data sources available in the DAS registry mapped to Human Genome coordinates. I may not be interested in them all, but certainly this is one of the greatest repositories of genomic data in just one shop. Leading providers of publicly available genomic DAS sources include Ensembl, the Database of Genomic Variants and ENCODE. Potential permutations of this data provides a range of possibilities for interrogation of biological hypotheses that is probably unparalled.

Now this shuffled genome is available for public use via a DAS web service. It will probably not be the last one to be put up and soon real 23andMe genomes will follow.

Open Tech 2010

July 25th, 2010 § Leave a Comment

I will be speaking at Open Tech 2010 in London (UK) on Friday 11 September 2010. My talk, entitled ‘Who Owns my Genome Data’, will be delivered at the Seminar Room (First Session, 10:30 a.m.). If you are planning to attend Open Tech 2010 this year, let me know and be sure to attend my talk!

The Power of Incidental Findings

July 11th, 2010 § Leave a Comment

Imagine that you are a geneticist that receives a patient in your clinic with a rare genetic disorder. Your patient is a 3 year old girl with severe learning dificulties. After looking at her sample under the microscope you find nothing of note and so you dismiss the case as inconclusive. Up until recently, that was the normal scenario for most cases in the genetics clinic. Today however, with next generation sequencing techniques, we are able to look at base pair level resolution, about 10,000 times the resolution microscopy can afford. At a later date, the resources become available to you and decide to carry out a next generation sequencing analysis for the patient and both parents.

With the results in hand, you identify a candidate mutation most likely to have caused the observed symptoms. You look then at the parents’ genotype to see whether the mutation is inherited and find that the father is not the real biological one, i.e. you are faced with an incidental finding of non-paternity. This is one example of pieces of information contained in the genetic material, not related to the diagnosis, but necessary to carry out the analysis. Incidental findings like these have the potential of dramatically affecting the lives of patients and their families.

How should we deal with incidental findings capable of dramatically affecting families? The answer is not simple. For instance, when non-paternity is identified, the case could be dismissed as non-analyzable to avoid the complex ethical ramifications of such a revelation on the family.

Let’s consider another incidental finding, such as detecting that one of the parents is a carrier of the disease that is causing the mutation in the patient. Such a finding may imply that future children from the couple may also inherit the mutation. Should parents be informed or should they be allowed to choose whether they find out or not?

There are many other incidental findings where strong arguments can be made either for or against informing the family. One example is the discovery of a malignant variant for the APOE gene in the patient, implicated in early onset Alzheimer’s. This is something for which the patient was not tested, but still it has the potential of having a life changing effect on the patients or their parents. Another example is the finding of a malignant mutation for BRCA2, which increases by 70% the chances of developing breast cancer in female carriers. There are many other incidental findings such as these where practitioners are undecided on how to manage that information sensibly.

What it is clear though is that incidental findings will have to be dealt with on an individual basis. Informing the family will depend on the possible effects of the information found as well as personal circumstances. Here I have just limited my exposition to a few common examples of current challenges, but as technology progresses and more patients and families are sequenced, many new conundrums are likely to appear, requiring new ethical debate.

Sending Sensitive Data Encrypted

July 8th, 2010 § Leave a Comment

The other day I was asked to find a way to send sensitive clinical data to another institute. How to make sure that the data is protected and only acessible to the right people? There are two aspects of protecting data, reflecting the different risks which the data may be exposed to:

  • data in transit (email “in flight”, web or FTP downloads, data sets on USB disks shipped by FedEx, etc)
  • data at rest (email arrived in recipient’s inbox, data copied to collaborator’s working disk, etc)

Here we will only explore the requirements for encrypting data in transit. The security of the data at rest is assumed to be taken care of by the collaborator or their IT staff, since it is outside one’s control.

There are various possible file transfer methods:

  • email – suitable for small files (typically up to 5MB although different sites impose different limits); no automatic encryption in transit
  • FTP or non-SSL password-protected web site – suitable for large files (in the GB range); no automatic encryption in transit
  • scp – suitable for large files; intrinsic encryption in transit; likely to encounter firewall issues
  • password-protected SSL web site – suitable for large files; intrinsic encryption in transit
  • USB disk – suitable for very large data sets (TB range); no automatic encryption in transit

When encryption is mandated (e.g. by a data access agreement) and the file transfer method does not provide encryption intrinsically, it is necessary to encrypt the data separately and transfer the encrypted file by the chosen method.

For ad-hoc or one-off data encryption, it is appropriate to encrypt a data set with a password (“symmetric encryption”, because the same password is used to encrypt and decrypt) which will be sent to the recipient by a separate means to the actual data. For example, if the data is shipped on a USB disk, the password could be sent by email, or given over the phone. Sending the password with the encrypted data defeats the object of encrypting it!

For regular or scheduled data transfers, public-key encryption may be suitable – and removes the need to send a password – but that will not be explored here due to the extra work in creating and managing keys.

A suitable encryption tool on Linux systems is gpg (the GNU Privacy Guard). The simplest usage is to prepare a single file containing the data in question using tar or zip, and then to encrypt that:

$ gpg -c bigfile.tar
gpg: gpg-agent is not available in this session
Enter passphrase:
Repeat passphrase:

$ ls bigfile.tar*
bigfile.tar    bigfile.tar.gpg

At this point, "bigfile.tar.gpg" is the encrypted file which is safe to transfer by email, FTP, or any other non-encrypted method. Note that the passphrase is not displayed while it is being entered; and that the encrypted file is typically smaller than the original due to compression in the encryption process. However it is necessary to have enough disk space to contain both the original and the encrypted data simultaneously, which may make this approach unsuitable for very large (TB) datasets.

The passphrase should be chosen with the same care as a computer login password. The Linux utility "pwgen" produces a selection of random passwords which may be useful in selecting a suitable passphrase.

The recipient will decrypt the file in a similar way:

$ gpg bigfile.tar.gpg
gpg: CAST5 encrypted data
gpg: gpg-agent is not available in this session
Enter passphrase:
gpg: encrypted with 1 passphrase
gpg: WARNING: message was not integrity protected

Note that if the passphrase is lost then it is vanishingly unlikely that the encrypted data can be recovered. Unless the passphrase is easily guessable, the encryption is sufficiently strong as to defeat most attempts to break it.

Written by Dr David Holland (WTSI), adapted by Manuel Corpas. Posted with Dr Holland's permission.

Biomedical Community-Wide Annotation Using Wikipedia

June 3rd, 2010 § 9 Comments

The pace of data generation is leaving far behind our ability to convert this data into usable knowledge. Even well funded biomedical databases find it increasingly difficult to keep up to speed. In order to tackle this problem, some databases have opted for increasing automation in the way data is deposited, reducing the time needed for interpreting results. The problem with this approach is that generated knowledge as a result is less accurate than manually annotated entries and of lower quality. Another potential solution has been to engage leading experts, creating a sort of consortium where they give some of their time to curate data entries that match their specialties. Unfortunately, engaging world experts in curating biomedical resources has not had a lot of success, with a few contributing a lot and many hardly ever dedicating any time to curation no matter how much they were fetched.

A new revolutionary idea has come from Alex Bateman‘s group to engage not just the community of experts but the whole of the Internet, using Wikipedia. One of his group’s databases, Rfam, which characterises RNA families, is now providing all of its annotation via Wikipedia. Wikipedia is already the leader reference resource for all kinds of information. It possesses the know-how and capability to mediate the curation of database entries as well as managing to have extremely resounding success in terms of gathering reasonably high quality knowledge.

After having a persuasive discussion with Alex, I decided to give it a try myself and add my very first entry to Wikipedia, which I thought it could potentially help the database I develop outsource its public/non-sensitive data annotation part.

I copied, edited and formatted parts of a non-sensitive entry (a Syndrome description) to Wikipedia. I learnt –contrary to what I expected- that as long as one has an account and no entry exists on the topic, a page can be added on the fly. So I added a page and started editing, copying and pasting.

It took me a bit of time to get used to some of the conventions and formatting tags used by Wikipedia but very early on I had help from Wikipedia ‘agents’. It really surprised me how quickly these agents picked up my entry and immediately made me know the criteria for making sure this Wikipedia entry achieves a high standard.

I learnt about important concepts in the Wikipedia context such as Notability and Conflicts of Interests. Apparently one cannot write about oneself for example, and personal opinions or articles are not accepted. So far this was OK for me although problems came when one of this agents pointed at some copywriting issues: I was trying to copy an entry of a website/database.

Blatant copy of public content from another website is considered a copyright violation unless a correct license is put in place and one ‘owns’ the data. In our case, the Creative Commons License, which is the one we hold, was not OK because although it lets public use of the information, it does not allow alteration. This means that people would not be able to edit my Wikipedia entry.

I must admit I felt intimidated at this point. Despite that, I was extremely impressed with the efficacy with which agents acted as well as how quickly they responded to my queries. I can understand why they have to be so tough so that they prevent abuse.

Overall I feel quite satisfied with what I have learnt in the process and I am extremely eager to keep exploring the use of Wikipedia for database curation. Of course this is just a try and our adopted solution for keeping up with current annotation may be something different in the end. However, it is worth a try.

Is Tide of Privacy War Turning Against Facebook?

May 14th, 2010 § 2 Comments

A series of articles in the NY Times and elsewhere are being written about the increasing disgust of people for Facebook’s privacy policies. Apparently, a bewildering tangle of options are needed in order to set one’s privacy configuration in your own profile. What is more, a group of 4 nerds in NYC have launched a cry to arms against Facebook, promising to develop a social network called Diaspora* that will not need users to surrender their privacy to be sold to third parties.

Whether this is going to be the beginning of a long battle for the holy grail of social network dominance or simply just another trifling spark against the giant it remains to be seen. What it is clear is that a social clamor is mounting up for their most basic instincts in search of privacy protection. The proof is that Diaspora* has been raising funds for this new venture in Kickstarter and 18 days to go for closing of this round they have already been promised 1273% of the money they asked initially.

Perhaps there are now clear signs that cyberusers are getting tired of being imposed rules by the big monopoly or simply they just would like to see new blood providing more self-control options. It is clear though that the battle for personal privacy in the web continues and that the Tide of War might be turning against Facebook. What it is not so clear though is how all this will affect the end user.

Where Am I?

You are currently browsing the Technology category at Manuel Corpas' Blog.

Follow

Get every new post delivered to your Inbox.