Manuel Corpas' Blog

Genomes, Internet, Bioethics and More…

I don’t want privacy to be dead

“Privacy is dead. Get over it”. People who have famously said this or something similar include Scott McNealy (CEO Sun Microsystems), Eric Schmidt (CEO Google) and Mark Zuckerberg (Founder of Facebook).

Some of the reasons that justify why not having privacy is OK include “if you do something that you want nobody to know, maybe you should not do it on the first place” or “people are ever more comfortable with sharing information about themselves”.

These arguments are all well and good. However, the other side of the coin is when personal information is used by strangers to take advantage of the person, let alone potential misinterpretations or simply gossip.

Is this interconnectedness worth exchanging for personal privacy? ‘Clever’ algorithms are constantly crawling the web in search of personal information. The degree to which these algorithms are more effective at spamming you is proportional to the amount of public information about you on the web.

Have you ever said anything or joined an internet group you would rather not join now? Unfortunately it is likely that this information will never disappear. Even if you delete it from your profile, it is probable that some web crawling algorithm has stored that information somewhere.

It has only been until recently when users have a tighter control over the information they make available to the web in Facebook for instance. Default settings are indeed terrifying in terms of the information of one’s profile made available to search engines.

Despite the possibility of being able to control how much information one makes public, this is not the end of the story: I have found situations where pieces of information in my profile were picked by a friend without me knowing it. True, it is specified in the settings how much you want to make public. Nevertheless, even though I have unselected information from my public profile, who knows who would have looked at it.

Not that I have anything to hide but I would rather keep quiet about my personal interests rather than sharing them widely. Surely a better knowledge of my personal profile could facilitate the way to finding more easily passwords and even breaking into my bank account.

Have we now reached a point of no return in privacy?

Probably.

Filed under: Facebook, Technology , , , ,

Why Should I Care About RNA?

I keep bumping into talks, articles and stuff related to the Bioinformatics of RNAs. It certainly looks appealing. I am not sure though what the fuss is about or whether it is as exciting as the DNA field. The promise of the Next Gen Sequencing and its impact in Science, Policy and Society are expected to be significant in the near future. Why should I care about RNAs though?

Here is a list of questions that I throw into the wild hoping to attract the attention of any experts in the field.

  1. Does any recent RNA breakthrough have the same potential to transform Society as DNA research is doing?
  2. Is RNA Bioinformatics an area that is appealing because not a lot of people are working on it? Or has it been fuelled by a new technique that is revolutionising it?
  3. Is the rate of accumulation of RNA data as steep as its DNA cousin?
  4. What are RNA research trends to watch out for?

The debate is open.

Post a comment if you want to say anything.

Filed under: Bioinformatics, Biology , , , ,

A Script to Calculate GC content

Intermediate Perl

GC content is a very interesting property of DNA sequences because it is correlated to repeats and gene deserts. A simple way to calculate GC content is to divide the sum of G and C letters by the total number of nucleotides in the sequence. Let’s assume that you start with a string $sequence.

The WRONG way in which I initially did this was to convert the string to an array of letters, as shown here:

sub calcgc {
 my $seq = $_[0];
 my @seqarray = split('',$seq);
 my $count = 0;
 foreach my $base (@seqarray) {
   $count++ if $base =~ /[G|C]/i;
 }
 my $len = $#seqarray+1;
 my $num=$count / $len;
 my ($dec)=$num =~ /(\S{6})/;
 return $dec;
}

This is a very inefficient way of calculating the GC content, because arrays in Perl are quite expensive in terms of memory. The result of this was that I run out of memory quite quickly.

I found a more efficient approach by using the substr function, looping through the whole sequence, taking one base at a time:

sub calcgc {
 my $seq = $_[0];
 my $count = 0;
 my $len   = length($seq);
 for (my $i = 1;$i<$len+1; $i++) {
   my $base = substr $seq, $i, 1;
 $count++ if $base =~ /[G|C]/i;
 }
 my $num=$count / $len;
 my ($dec)=$num =~ /(\S{6})/;
 return $dec;
}

Now I do not run out of memory and the script promptly calculates the GC content for every sequence I have tried it with. I have not tried it with a whole genome though!

Happy coding!

Filed under: Bioinformatics, Tutorials , , , ,

A Simple Script to Remove Duplicate Emails

Basic Perl Programming

Suppose you have a list of emails compiled from different sources and want to get rid of duplicates. You google this and to your amazement (and mine), there is no free service available to paste your emails to return a list of unique addresses.

Here is a small simple script to eliminate email duplicates, in case this saves a few minutes (or hours!) to someone else.

This script is written in Perl and takes a list of email addresses (one per line), prunes it and returns a list with unique ones. You will need to have installed Perl in your computer (you can download it for free in Windows using ActivePerl); any flavors of UNIX or Mac OS should have Perl installed by default.

Once you have Perl installed, copy the following script to a file and save it as duplicate_email_remover.pl in a suitable directory:

#!perl
use strict;
my %hash;
while(my $line = <>) {
  chomp($line);
  $hash{$line}++;
}
map {print $_ ."\n"} keys %hash;

Also make sure that you have your email list (one email per line) saved in a file named emails.txt in the same directory where you saved duplicate_email_remover.pl.

Run the following command in your console (or command prompt if you use Windows):

perl duplicate_email_remover.pl < emails.txt

You should get printed to your console a list of emails with unique addresses.

Enjoy!

Filed under: Tutorials , , , , ,

Array CGH for Dummies

Array-CGH (Comparative Genomic Hybridasation) is becoming a common method used for analysis of patients’ genomes. Array-CGH works by taking a reference genome covering the whole human genome sequence, cutting it into thousands of pieces and orderly attach them to a chip. These pieces are called probes and are usually on the range of 500-2000 DNA bases long. A saliva or blood sample is then taken from the patient and its DNA is also chopped into thousands of pieces in suspension with a solvent. The array is then washed with the suspension containing the patient’s DNA.

DNA is a double chain of nucleotide bases where one chain complements the other. Knowing one chain of the DNA, it is possible to know the other chain. In its natural state, a single DNA chain will tend to bind to its complementary chain. Thus, by washing the patient’s suspension with the array probes will make the patient’s DNA pieces bind to its complementary DNA in the array.

Picture 4Array-CGH can be used to detect whether a patient has a region of the genome missing or duplicated. Probes attached to the chip emit a different color depending on their state of binding. Once the array is washed, most of probe spots will appear yellow, that is, all different probes of the reference genome are bound to the patient’s DNA. If a DNA region is missing in the patient, the complementary spots in the array appear in red. These changes appear in sequential order mapping to the reference genome missing in the patient. Depending on the genes that overlap to the deletion, different symptoms may appear in the patient.

The same happens if the array shows a series of green spots, indicating that a duplicated region of the genome has been found in the DNA of the patient. Because the gene content will be altered in the duplicated region, this may cause disease as a consequence of the over-expression of genes included in the duplicated regions.

Thus, using array techniques, we are now able to find deletions or duplications in the genome of a patient beyond the microscopic level, i.e. changes not directly observable. We are all familiar with the features of a patient with Down’s Syndrome. This syndrome is caused because there is an extra copy of chromosome 21 in the affected patient, due to a duplication of one of the two usual copies (Trisomy 21).

Most of the chromosomal deletions and duplications occur at the molecular level [1], not identifiable with microscopic techniques, as in the case of Down’s Syndrome. Up until recently most of the patients suspected of suffering from genomic diseases, i.e. diseases caused by pathogenic deletions or duplications, went undiagnosed because techniques did not allow detection beyond big chromosomal changes (like whole chromosomes). Techniques such as array-CGH now allow detection of chromosomal changes a thousand times smaller in length.

For a price of about £100 per array one can have one’s genome screened for chromosomal changes. In fact, it seems that most of the genetic changes between any two people (in terms of number of DNA bases) is dependent on the level of micro- deletions and duplications (called Copy Number Variations) [2], just the level we are now starting to handle with current analysis techniques. Next generation sequencing technologies are fast arriving that will allow the base-by-base complete sequencing of the DNA of people at price of $1000 in a short period of time [3].

[1] H.V. Firth, S.M. Richards, A.P. Bevan, S. Clayton, M. Corpas, D. Rajan, S. Van Vooren, Y. Moreau, R.M. Pettett, N.P. Carter (2009). DECIPHER: DatabasE of Chromosomal Imbalance and Phenotype using Ensembl Resources. The American Journal of Human Genetics.

[2] J. R. Lupski (2009). Genomic disorders ten years on.  Genome Medicine

[3] Mardis E.R. (2006). Anticipating the 1,000 dollar genome. Genome Biology


Filed under: Bioinformatics, Tutorials , , , , , , ,

Validating Chromosome Entered is Correct

When you have a web form and one of the fields to be entered is the Chromosome Number, you’d be wise to check that the user does not enter the wrong thing (e.g. ‘0′ , ‘X1′, ‘21.13′). Thus a validation check may save you some headaches.

I set out to find the appropriate regular expression in Javascript that returns an error when the chromosome entered is not 1-22 or X or Y. It turns out that for me it wasn’t that easy to solve this little problem. I searched in Google to see if anyone had posted the solution to the problem and found nothing meaningful.

Therefore I paste below the solution I wrote. I must admit that it is rather ugly the fact that I have to use two if expressions. I would appreciate if any  reader posted a more elegant solution. But for now, this code seems to work fine.

function notValidChr(field)
{
  // Field may start with a number or optionally two
  if ( field.value.match(/^\d\d?$/)
      // Is it an integer greated than 0?
      && (0 < parseInt(field.value)
            // Is it an integer smaller than 23?
            && parseInt(field.value) <23
          )
      )
  {
    // If all conditions above are met, notValidChr is not true
    return false;
  }
  // Check if it is a valid X or Y chromosome
  if (field.value.match(/^[XY]$/)) {
    // If so notValidChr is not true
    return false;
  }
  // Yes! the text entered is not a valid chromosome
  return true;
}

Filed under: Bioinformatics, Tutorials

Tips for Remapping from NCBI36 to NCBI37 Genome Assembly

It might seem for some people straight forward but I had to spend quite some time trying to understand how to remap my array probes from ncbi36 to ncbi37. If you use the Ensembl genome browser, you might have noticed that from July 2009 the ncbi37 assembly is now in use. For DECIPHER (the database I help develop), this is a little bit of a headache, because it means that all of the probes from array CGH that we used have to be remapped to the new assembly. If this does not interest you I recommend that you stop reading here.

First I learned that there is a program called liftOver by UCSC that is able to do this remapping. Since the amount of probes I have to map (around 6 million) is a number that I would not wish to through to anyone’s server, I decided to do this in-house. You can download this program from here. I did not know which was the right binary for me to download, as they had linux32 and linux64 versions. I decided to go for the former, since I am using debian and it sounds like a conservative option.

Once I downloaded the program, I needed to make it executable:

chmod u+x liftOver

OK, so I was in a position to run it:

./liftOver

In the usage information it appears that I need several arguments and files to be able to run this program correctly:

liftOver oldFile map.chain newFile unMapped

Now I learned that I need also to get a file called the map.chain. I was not sure what it meant. I learned that this map.chain file has parameters that are used by liftOver and that there are map.chain files depending on the remapping one wants to do. In my case, I want to remap from ncbi36 to ncbi37 in human. However, when I look at the different remappings, I do not see ncbi formats anywhere. I learned here that what I am looking for is map chain file that is called this:

hg18toHg19.over.chain

Apparently hg18 refers to ncbi36 and hg19 to ncbi37. Doing a google search I could find that file here.

Now I get quite a few options and learn that I need to have my probes in bed format to run liftOver. Apparently there are quite a few formats I can use according to UCSC FAQs formats. Here an example of what my bed file looks like (chromosome-tab-start_position-tab-end_position):

chrY       12308579        12468100
chrY       12468101        12581699
chrY       12581700        12759636
chrY       12759637        12838587

Now I am in a position to run liftOver. I notice now that in the usage one has the following description:

liftOver oldFile map.chain newFile unMapped

‘newFile’ and ‘unMapped’ are the names of the files where the output goes into and therefore are empty. This can be confusing as the user might think that these are some other kind of files one has to get hold of.

OK, so now I am ready to transform our old array probe mapping ncbi36 to the new ncbi37 one:

./liftOver probes.ncbi36 hg18toHg19.over.chain probes.ncbi37 unmapped-to-ncbi37

I got the following output to console:

Reading liftover chains
Mapping coordinates
ERROR: start coordinate is after end coordinate (chromStart > chromEnd) on line 5171240 of bed file probes.decipher.ncbi36
ERROR: 4 2515512 2515453

…which is a bit worrying.

I’ve gone through my probes and found that some of them (just 44757!) had start point coordinates greater than their ends. I guess that if you encounter those you’ll have to decide what to do. For the time being I just took them out and re run liftOver again.

This time it worked.

Filed under: Bioinformatics, Tutorials , , , , , , ,

Of Hackers and Painters

Paul Graham’s Hackers and Painters: Big Ideas from the Computer Age is a book that tells a lot about the life of the author himself. According to Graham, nerds do not spend as much time thinking about how to be popular. He calls himself a nerd and tells how he was in the ‘D’ popularity group at school, ‘E’ being the least popular group of all. This in part implies that hacker traits are implicit from young age. The author has the very interesting background of having studied painting in Florence after finishing his PhD in Computer Science at Harvard. He explains his experience creating his own startup company, called Viaweb, which ended up being bought by Yahoo, becoming a very successful online store tool.

The author thinks that the reason why Viaweb managed to become the leading company in their sector was due to providing their services via the web itself and using the Lisp computer language. According him, being a web software company has great advantages. Web companies are able to add features incrementally or fixing bugs without having to ship the application back to the user every time: such small changes make releases more manageable and bugs can be reproduced exactly without having to guess the user’s environment and configuration. Web applications are much easier to run, you just need a web browser. Traditional standalone applications require the user to know system administration (getting the right configuration) but in web applications the system administration burden is transferred to the developers instead.

Coding in Lisp apparently was a definitive advantage to keep abreast with competition. He argues strongly in favor of Lisp as the most powerful language due to its ability to provide greater capability for abstraction than any other language. Lisp was initially developed as a mathematical formalism and then made into language. Other computer languages tend to be based on assembler specifications, approaching concepts from the hardware point of view rather than mathematics. This argument made me think seriously about learning at least the fundamentals of Lisp to discover new ways of coding abstractions. I found in Paul Graham’s own website an introductory tutorial for Lisp. I look forward to my next trip to this new mindset.

Filed under: Technology , , , , ,

Computational Bioethics: Leveraging Personalized Medicine Ethically

Computational Bioethics relates to the appropriateness of use, management of access and discovery of biological insights applied to patient health. A simple search in Google shows that this concept has only been used in the past at a workshop session in the Rocky Bioinformatics conference in 2008. In the resulting publication of this meeting, Computational Bioethics is said to encompass ethical issues that are unique to computational scientists compared with bench and clinical scientists. In addition, it is argued that as medical informatics, computational biology and bioinformatics become more widely used in medical research, these issues will become even more relevant in the future.

And the future is now become a present reality. The rise of next generation sequencing technology, together with the adoption of microarray techniques in normal clinical practice is revolutionizing the way patient’s data is handled. Similarly, the amount of data created is surpassing most software tools available. Projects like the 1000 Genomes are generating an enormous pool of variability of genome data and derived sources of information. This makes necessary the exchange of data between research centers to be able to establish what constitutes normal and pathogenic variation in human beings.

Personalized medicine, a widely used term since the near completion of the Human Genome Project, implies the application and use of personal genome data for the determining genomic changes leading to disease, risk factors and patient’s susceptibilities. The problem with this sort of data is that it may yield information that could severely change the life style of the person, affect relatives and, in many cases, with little ability to do anything to mitigate the effects of findings. As our current knowledge is so fragmented and rudimentary, it is expected that collected information with no current value may be relevant in the future, influencing analyses and results.

Some ethical practices currently implemented in Computational Bioethics include the creation of consent forms that allow putting anonymized patient information in genome databases. This information may be accessed by the scientific community and compared with other previously consented patient data for the characterization of new syndromes, the effect of knockout genes in the general population and the understanding of genome variation in normal individuals.

Filed under: Computational Bioethics, Technology

How to be a Biohacker

Biohackers embrace fully the philosophy of hackers: love for freedom, veneration of competence and utter curiosity for how things work. How does one become a biohacker? Usually biohackers cannot tell if they are really one of them until someone else says so. However, it is not enough to be competent in the mastery of programming or being a computer wiz. You need IT skills that suit computational biology research and familiarity with the biology itself, which in the end is the problem one has to solve.

A big attitude to the biohacker philosophy is that you do not only need love to solve technical problems for their own sake; you need to think of living organisms as an extension of the information systems you work with. Biological concepts may be then abstracted into objects whose hierarchical organization reflect the different levels of order in living things. Computer languages thus become the perfect analogy for understanding the complex information flows in living systems.

True to hackerdom culture, Unix, Perl and MySQL are programming skills that you need to master (I can think of people who would also say Java, Javascript, CSS, etc.). The best way to master the art of programming is to spend as much time as possible reading and writing source code. Some people think Perl is doomed. This is not true in the biohackers world. In part due to legacy and in part to the flexibility it provides, Perl is still the language of choice for many biohackers. Perl is used to construct 1) the back end of web applications, 2) pipelines and workflows and 3) quick and dirty scripts for parsing and calling other programs.

You will also need to be familiar with projects like R and Bioconductor, since a lot of the work will involve providing the computational infrastructure for analyzing data. In addition, you’ll need to know about data formats (fasta, sbml, mmcif…), software toolkits and libraries (Paup, Phylip, EMBOSS, BioPerl…), databases (Ensembl, InterPro, PDB, KEGG…), webservers and portals (Pubmed, ISCB).

Finally keep in mind best practices. Some of them I wrote about in a previous post (like refraining from reinventing the wheel), but above all, give yourself the time to enjoy the learning process. Getting to the top usually takes longer than staying at the top; so what’s the point if you haven’t enjoyed the trip?

Filed under: Bioinformatics , , , , ,

About this Blog

Written and maintained by Manuel Corpas, a Computational Biologist at the Wellcome Trust Sanger Institute (Cambridge, UK), a leading international center for genome research. Any views expressed here are the author's alone and do not necessarily form part of the official positions of his employer.
Subscribe to me on FriendFeed