Visualizing Your DNA Genotype Profile in a Blanket
March 6th, 2012 § 1 Comment
I have encountered a rather ingenious idea in Ben Landau’s blog. You may have read that Mike Cariaso, founder of SNPedia, created a visualization graph with the genotype of my family where my family member genotypes were compared against each other. Visualization patterns were created that compared each chromosome against every other. To show how this actually looks, I have taken from Mike’s tool an image that shows the comparison of 23andMe genotypes from my mom and dad (x and y axis respectively), each pixel being a SNP and different colors representing match (light blue), half match (dark blue) and conflict (red).

Chromosome 1. Comparison of Corpas mum and dad 23andMe genotypes using SNPedia's visualization tool.
It seems that Ben has taken this idea further and designed a blanket that incorporates chromosomal patterns for a complete 23andMe genotype. I quote here the description for this blanket from the Ben’s blog:
First Gift is a precious blanket which compares the digital DNA data of a child with their parents. If the child’s genes are edited, these changes will mask the parent’s DNA with synthesized DNA. The blanket itself represents a sacred and fragile heirloom, where tampering with it could potentially lead to frayed edges and uncertain outcomes. This first genetic gift will be with the child for life, and will also be inherited by future generations.
Although technically speaking this visualization shows comparisons between any two individuals, and not between the two parents and child as it is mentioned in the blog, I am still amazed at the craftiness and ingenuity of this idea. And since it uses data from the Corpas family dataset, I hereby report it in Manuel Corpas’ Blog. Here is another unpredicted and surprising effect from publishing our family genomes on the Internet.
To finish this blog entry, I borrow from Ben a complete profile view of my sister’s genotype patterned in his ‘First Gift’ blanket. According to him, the weaving of the blanket was done at the Tilburg Textile Museum in Amsterdam.

Blanket showing genotype pattern comparisons in my sister's 23andMe genotype. Each shape in theory corresponds to a chromosome comparison, although I still need to understand which chromosome represents each of the 25 shapes above (Humans have 22 autosomes + 1 sexual pair + mitochondrial).
Scientific Announcements Don’t Get Noticed Where They Should
December 8th, 2011 § Leave a Comment
Wouldn’t it be nice if the event you are trying to promote needed to be posted only once? What if there was a central repository for dissemination of announcements that was accessible and permanently up-to-date? Wouldn’t it be great if your blog or website could show relevant professional announcements without having to enter them?
Unfortunately, people around the world are still trapped in the paper-based office paradigm when wanting to disseminate announcement information. Again and again they post their announcement to different places knowing that it will only reach a partial share of all potentially interested readers. They add data and clog online databases as no centralized repository is available for posting or getting information. Despite the great number of hours of work lost by millions of people trying to post, scientific organizations have been extremely slow to embrace community-shared announcement curation.
We (Rafael Jimenez and I) are promoting the creation of a community of organizations and people to lead iAnn, a centralized collaboration platform that coordinates curation efforts among scientific organizations. iAnn increases access to announcements through its dissemination tools, which have been designed specifically to integrate posts across many different websites with minimal effort. iAnn allows you to post your event, course, piece of news only once to a central repository, which is then disseminated seamlessly to relevant scientific organizations or websites according to keywords, dates or geographical location.
If you think iAnn is of interest to you please contact me (see contact information on the right) or wait for future developments that are about to come in Manuel Corpas’ Blog. Currently we are in a development phase for the project and would like to hear from potential users or scientific organizations if they have any thoughts or suggestions on the matter. Our aim is to change the way anyone posts and finds relevant information about any given professional field. iAnn promises to help many users keep up-to-date with relevant announcements more effortlessly. Perhaps from now on websites will be better able to have most of the events, courses, seminars, news, etc. that users would expect to find in them.
myKaryoView: First Open Source Visualization Software for 23andMe Data
September 1st, 2010 § Leave a Comment
Following my previous post on the First Publicly Available Genome Via DAS I would like to present an open source software that Rafael Jimenez and myself have developed for visualization of genomic data. Here we have it configured to display 23andMe data as a test case. We call it myKaryoView and it is available for free use and download. Its website is located at the following address:
myKaryoView works in most contemporary browsers without lengthy installations and uses publicly available data distributed throughout the Internet via DAS. This means that there is no need to hold the data locally and that it is capable of visualizing any data as long as it is available via DAS. In order to visualize 23andMe data, myKaryoView requires the set up of a DAS source, which currently limits myKaryoView’s usage to those familiar with this technology. However, configuration and addition of sources are extremely simple and the amount of data able to display is limited only to the time of request completion and data rendering.
Here we show myKaryoView to display personal genomics data with a dummy 23andMe genome data source. This source is based on real 23andMe results data from my own genome, randomly modified in a manner that is irrecognizably different.
The myKaryoView website shows an implementation that allows search of genome data via gene name or genome coordinates. For example, type in the search box 1:2000000,6000000 and hit “Submit Query”.
The figure above shows results of that query, with two tracks containing the source from 23andMe with dummy data plus genes for a subchromosomal region in chromosome 1, Start: 2000000, End: 60000000. Gene names and SNP data and are shown in red and blue respectively. Different color shades indicate the density of annotation for any given point. If the “Gene Names” data track name is clicked, a popup window appears with a link “Display Original Data Source” that allows the download of the raw data from its DAS source. Any feature can be clicked for retrieval of specific information contained in the DAS source. Here a blue SNP mark is clicked and a popup window appears describing the selected SNP and a link to its corresponding dbSNP entry.
A simple manual explaining how to install and configure myKaryoView to show different data sources is provided from the website. myKaryoView is still in beta testing and any feedback is welcome. We have some plans for the near future for myKaryoView, which we will reveal in due time. Meanwhile I hope you find it interesting and useful.
By the way, the claim that this is the First Open Source Visualization for 23andMe data is, of course, arguable.
First Publicly Available Personal Genome Via DAS
August 26th, 2010 § 3 Comments
You may have heard stories about some well known people to have released their genome for public use. I would like to convince you that now you don’t have to have a lot of money or being a public figure in order to do that. Companies like 23andMe and Navigenics provide the ability to get one’s genome tested for not a lot of money and get the results via a password protected website. The problem is that our current understanding of what these results mean are rather limited on their own. Thus having open collaboration platforms for citizen science using genomic data may be a step forward in helping understand one’s genetic testing results. Initiatives like DIYgenomics are already working on this concept.
You may wonder why making one’s genome released is useful. The answer is, in practical terms it is not. However, the concept of being able to do that I consider it to be a very interesting one. After all, one’s genome data on its own is hardly informative, but when compared with information like known genes, pathways or even other people’s genomes, it becomes much more interesting and opens up the possibility for real discoveries.
With this post I hope to prove that genomes can now be put on the web in a standard format like the Distributed Annotated System (DAS) where people can share and integrate them with other public data sources mappable to genome coordinates. DAS is an environment that is open source, decentralized and unregulated. So what is different here from what is being done already? Why is this significant? I can think of at least three reasons. 1) Flexibility: pretty much any genome annotation can be put up; 2) Integration capabilities: anything can be combined with anything else as long as they share the same coordinates system and 3) Data outsourcing: data is stored and maintained by DAS source owners elsewhere. Here is my story:
Last year I decided to get a 23andMe kit to have my genome analyzed. After results were delivered, I decided to download the data in raw format, consisting of >0.5M SNPs (single nucleotide polymorphisms) mapped to the NCBI36 genome assembly.
I wanted to experiment with this data from a bioinformatics point of view, so I decided to put my “genome” on the web for public access. Well almost. I did not put up my real genome, I created a randomly shuffled version of it (i.e. it does not resemble any recognizable trace to the real data). I put up this unreal data to make a point of principle.
Anyone in the world can thus access my randomly shuffled genome using an URL like this:
http://mykaryoview.com:9000/das/mykaryoview/features?segment=1:1,2000000
where after the token “segment=” in the above URL a chromosome type is specified [1-22, X, Y], followed by a colon, followed by the start and end position, separated by comma. Try the above URL with different chromosome number and coordinates and see what results you get!
In the above figure you see different columns, denoting the SNP id, start and end positions, the genotype under the “Notes” heading and a link to the SNP’s corresponding entry in dbSNP.
Now that this genome is in a standard format, it can easily be integrated with any other publicly available data in DAS. As of this writing (26th August 2010) there are 139 data sources available in the DAS registry mapped to Human Genome coordinates. I may not be interested in them all, but certainly this is one of the greatest repositories of genomic data in just one shop. Leading providers of publicly available genomic DAS sources include Ensembl, the Database of Genomic Variants and ENCODE. Potential permutations of this data provides a range of possibilities for interrogation of biological hypotheses that is probably unparalled.
Now this shuffled genome is available for public use via a DAS web service. It will probably not be the last one to be put up and soon real 23andMe genomes will follow.
Open Tech 2010
July 25th, 2010 § Leave a Comment
I will be speaking at Open Tech 2010 in London (UK) on Friday 11 September 2010. My talk, entitled ‘Who Owns my Genome Data’, will be delivered at the Seminar Room (First Session, 10:30 a.m.). If you are planning to attend Open Tech 2010 this year, let me know and be sure to attend my talk!
Sending Sensitive Data Encrypted
July 8th, 2010 § Leave a Comment
The other day I was asked to find a way to send sensitive clinical data to another institute. How to make sure that the data is protected and only acessible to the right people? There are two aspects of protecting data, reflecting the different risks which the data may be exposed to:
- data in transit (email “in flight”, web or FTP downloads, data sets on USB disks shipped by FedEx, etc)
- data at rest (email arrived in recipient’s inbox, data copied to collaborator’s working disk, etc)
Here we will only explore the requirements for encrypting data in transit. The security of the data at rest is assumed to be taken care of by the collaborator or their IT staff, since it is outside one’s control.
There are various possible file transfer methods:
- email – suitable for small files (typically up to 5MB although different sites impose different limits); no automatic encryption in transit
- FTP or non-SSL password-protected web site – suitable for large files (in the GB range); no automatic encryption in transit
- scp – suitable for large files; intrinsic encryption in transit; likely to encounter firewall issues
- password-protected SSL web site – suitable for large files; intrinsic encryption in transit
- USB disk – suitable for very large data sets (TB range); no automatic encryption in transit
When encryption is mandated (e.g. by a data access agreement) and the file transfer method does not provide encryption intrinsically, it is necessary to encrypt the data separately and transfer the encrypted file by the chosen method.
For ad-hoc or one-off data encryption, it is appropriate to encrypt a data set with a password (“symmetric encryption”, because the same password is used to encrypt and decrypt) which will be sent to the recipient by a separate means to the actual data. For example, if the data is shipped on a USB disk, the password could be sent by email, or given over the phone. Sending the password with the encrypted data defeats the object of encrypting it!
For regular or scheduled data transfers, public-key encryption may be suitable – and removes the need to send a password – but that will not be explored here due to the extra work in creating and managing keys.
A suitable encryption tool on Linux systems is gpg (the GNU Privacy Guard). The simplest usage is to prepare a single file containing the data in question using tar or zip, and then to encrypt that:
$ gpg -c bigfile.tar gpg: gpg-agent is not available in this session Enter passphrase: Repeat passphrase:$ ls bigfile.tar* bigfile.tar bigfile.tar.gpgAt this point, "bigfile.tar.gpg" is the encrypted file which is safe to transfer by email, FTP, or any other non-encrypted method. Note that the passphrase is not displayed while it is being entered; and that the encrypted file is typically smaller than the original due to compression in the encryption process. However it is necessary to have enough disk space to contain both the original and the encrypted data simultaneously, which may make this approach unsuitable for very large (TB) datasets.
The passphrase should be chosen with the same care as a computer login password. The Linux utility "pwgen" produces a selection of random passwords which may be useful in selecting a suitable passphrase.
The recipient will decrypt the file in a similar way:
$ gpg bigfile.tar.gpg gpg: CAST5 encrypted data gpg: gpg-agent is not available in this session Enter passphrase: gpg: encrypted with 1 passphrase gpg: WARNING: message was not integrity protectedNote that if the passphrase is lost then it is vanishingly unlikely that the encrypted data can be recovered. Unless the passphrase is easily guessable, the encryption is sufficiently strong as to defeat most attempts to break it.
Written by Dr David Holland (WTSI), adapted by Manuel Corpas. Posted with Dr Holland's permission.
Biomedical Community-Wide Annotation Using Wikipedia
June 3rd, 2010 § 9 Comments
The pace of data generation is leaving far behind our ability to convert this data into usable knowledge. Even well funded biomedical databases find it increasingly difficult to keep up to speed. In order to tackle this problem, some databases have opted for increasing automation in the way data is deposited, reducing the time needed for interpreting results. The problem with this approach is that generated knowledge as a result is less accurate than manually annotated entries and of lower quality. Another potential solution has been to engage leading experts, creating a sort of consortium where they give some of their time to curate data entries that match their specialties. Unfortunately, engaging world experts in curating biomedical resources has not had a lot of success, with a few contributing a lot and many hardly ever dedicating any time to curation no matter how much they were fetched.
A new revolutionary idea has come from Alex Bateman‘s group to engage not just the community of experts but the whole of the Internet, using Wikipedia. One of his group’s databases, Rfam, which characterises RNA families, is now providing all of its annotation via Wikipedia. Wikipedia is already the leader reference resource for all kinds of information. It possesses the know-how and capability to mediate the curation of database entries as well as managing to have extremely resounding success in terms of gathering reasonably high quality knowledge.
After having a persuasive discussion with Alex, I decided to give it a try myself and add my very first entry to Wikipedia, which I thought it could potentially help the database I develop outsource its public/non-sensitive data annotation part.
I copied, edited and formatted parts of a non-sensitive entry (a Syndrome description) to Wikipedia. I learnt –contrary to what I expected- that as long as one has an account and no entry exists on the topic, a page can be added on the fly. So I added a page and started editing, copying and pasting.
It took me a bit of time to get used to some of the conventions and formatting tags used by Wikipedia but very early on I had help from Wikipedia ‘agents’. It really surprised me how quickly these agents picked up my entry and immediately made me know the criteria for making sure this Wikipedia entry achieves a high standard.
I learnt about important concepts in the Wikipedia context such as Notability and Conflicts of Interests. Apparently one cannot write about oneself for example, and personal opinions or articles are not accepted. So far this was OK for me although problems came when one of this agents pointed at some copywriting issues: I was trying to copy an entry of a website/database.
Blatant copy of public content from another website is considered a copyright violation unless a correct license is put in place and one ‘owns’ the data. In our case, the Creative Commons License, which is the one we hold, was not OK because although it lets public use of the information, it does not allow alteration. This means that people would not be able to edit my Wikipedia entry.
I must admit I felt intimidated at this point. Despite that, I was extremely impressed with the efficacy with which agents acted as well as how quickly they responded to my queries. I can understand why they have to be so tough so that they prevent abuse.
Overall I feel quite satisfied with what I have learnt in the process and I am extremely eager to keep exploring the use of Wikipedia for database curation. Of course this is just a try and our adopted solution for keeping up with current annotation may be something different in the end. However, it is worth a try.
Is Tide of Privacy War Turning Against Facebook?
May 14th, 2010 § 2 Comments
A series of articles in the NY Times and elsewhere are being written about the increasing disgust of people for Facebook’s privacy policies. Apparently, a bewildering tangle of options are needed in order to set one’s privacy configuration in your own profile. What is more, a group of 4 nerds in NYC have launched a cry to arms against Facebook, promising to develop a social network called Diaspora* that will not need users to surrender their privacy to be sold to third parties.
Whether this is going to be the beginning of a long battle for the holy grail of social network dominance or simply just another trifling spark against the giant it remains to be seen. What it is clear is that a social clamor is mounting up for their most basic instincts in search of privacy protection. The proof is that Diaspora* has been raising funds for this new venture in Kickstarter and 18 days to go for closing of this round they have already been promised 1273% of the money they asked initially.
Perhaps there are now clear signs that cyberusers are getting tired of being imposed rules by the big monopoly or simply they just would like to see new blood providing more self-control options. It is clear though that the battle for personal privacy in the web continues and that the Tide of War might be turning against Facebook. What it is not so clear though is how all this will affect the end user.




