Manuel Corpas' Blog

Genomes, Internet, Bioethics and More…

A Script to Calculate GC content

Intermediate Perl

GC content is a very interesting property of DNA sequences because it is correlated to repeats and gene deserts. A simple way to calculate GC content is to divide the sum of G and C letters by the total number of nucleotides in the sequence. Let’s assume that you start with a string $sequence.

The WRONG way in which I initially did this was to convert the string to an array of letters, as shown here:

sub calcgc {
 my $seq = $_[0];
 my @seqarray = split('',$seq);
 my $count = 0;
 foreach my $base (@seqarray) {
   $count++ if $base =~ /[G|C]/i;
 }
 my $len = $#seqarray+1;
 my $num=$count / $len;
 my ($dec)=$num =~ /(\S{6})/;
 return $dec;
}

This is a very inefficient way of calculating the GC content, because arrays in Perl are quite expensive in terms of memory. The result of this was that I run out of memory quite quickly.

I found a more efficient approach by using the substr function, looping through the whole sequence, taking one base at a time:

sub calcgc {
 my $seq = $_[0];
 my $count = 0;
 my $len   = length($seq);
 for (my $i = 1;$i<$len+1; $i++) {
   my $base = substr $seq, $i, 1;
 $count++ if $base =~ /[G|C]/i;
 }
 my $num=$count / $len;
 my ($dec)=$num =~ /(\S{6})/;
 return $dec;
}

Now I do not run out of memory and the script promptly calculates the GC content for every sequence I have tried it with. I have not tried it with a whole genome though!

Happy coding!

Filed under: Bioinformatics, Tutorials , , , ,

A Simple Script to Remove Duplicate Emails

Basic Perl Programming

Suppose you have a list of emails compiled from different sources and want to get rid of duplicates. You google this and to your amazement (and mine), there is no free service available to paste your emails to return a list of unique addresses.

Here is a small simple script to eliminate email duplicates, in case this saves a few minutes (or hours!) to someone else.

This script is written in Perl and takes a list of email addresses (one per line), prunes it and returns a list with unique ones. You will need to have installed Perl in your computer (you can download it for free in Windows using ActivePerl); any flavors of UNIX or Mac OS should have Perl installed by default.

Once you have Perl installed, copy the following script to a file and save it as duplicate_email_remover.pl in a suitable directory:

#!perl
use strict;
my %hash;
while(my $line = <>) {
  chomp($line);
  $hash{$line}++;
}
map {print $_ ."\n"} keys %hash;

Also make sure that you have your email list (one email per line) saved in a file named emails.txt in the same directory where you saved duplicate_email_remover.pl.

Run the following command in your console (or command prompt if you use Windows):

perl duplicate_email_remover.pl < emails.txt

You should get printed to your console a list of emails with unique addresses.

Enjoy!

Filed under: Tutorials , , , , ,

Array CGH for Dummies

Array-CGH (Comparative Genomic Hybridasation) is becoming a common method used for analysis of patients’ genomes. Array-CGH works by taking a reference genome covering the whole human genome sequence, cutting it into thousands of pieces and orderly attach them to a chip. These pieces are called probes and are usually on the range of 500-2000 DNA bases long. A saliva or blood sample is then taken from the patient and its DNA is also chopped into thousands of pieces in suspension with a solvent. The array is then washed with the suspension containing the patient’s DNA.

DNA is a double chain of nucleotide bases where one chain complements the other. Knowing one chain of the DNA, it is possible to know the other chain. In its natural state, a single DNA chain will tend to bind to its complementary chain. Thus, by washing the patient’s suspension with the array probes will make the patient’s DNA pieces bind to its complementary DNA in the array.

Picture 4Array-CGH can be used to detect whether a patient has a region of the genome missing or duplicated. Probes attached to the chip emit a different color depending on their state of binding. Once the array is washed, most of probe spots will appear yellow, that is, all different probes of the reference genome are bound to the patient’s DNA. If a DNA region is missing in the patient, the complementary spots in the array appear in red. These changes appear in sequential order mapping to the reference genome missing in the patient. Depending on the genes that overlap to the deletion, different symptoms may appear in the patient.

The same happens if the array shows a series of green spots, indicating that a duplicated region of the genome has been found in the DNA of the patient. Because the gene content will be altered in the duplicated region, this may cause disease as a consequence of the over-expression of genes included in the duplicated regions.

Thus, using array techniques, we are now able to find deletions or duplications in the genome of a patient beyond the microscopic level, i.e. changes not directly observable. We are all familiar with the features of a patient with Down’s Syndrome. This syndrome is caused because there is an extra copy of chromosome 21 in the affected patient, due to a duplication of one of the two usual copies (Trisomy 21).

Most of the chromosomal deletions and duplications occur at the molecular level [1], not identifiable with microscopic techniques, as in the case of Down’s Syndrome. Up until recently most of the patients suspected of suffering from genomic diseases, i.e. diseases caused by pathogenic deletions or duplications, went undiagnosed because techniques did not allow detection beyond big chromosomal changes (like whole chromosomes). Techniques such as array-CGH now allow detection of chromosomal changes a thousand times smaller in length.

For a price of about £100 per array one can have one’s genome screened for chromosomal changes. In fact, it seems that most of the genetic changes between any two people (in terms of number of DNA bases) is dependent on the level of micro- deletions and duplications (called Copy Number Variations) [2], just the level we are now starting to handle with current analysis techniques. Next generation sequencing technologies are fast arriving that will allow the base-by-base complete sequencing of the DNA of people at price of $1000 in a short period of time [3].

[1] H.V. Firth, S.M. Richards, A.P. Bevan, S. Clayton, M. Corpas, D. Rajan, S. Van Vooren, Y. Moreau, R.M. Pettett, N.P. Carter (2009). DECIPHER: DatabasE of Chromosomal Imbalance and Phenotype using Ensembl Resources. The American Journal of Human Genetics.

[2] J. R. Lupski (2009). Genomic disorders ten years on.  Genome Medicine

[3] Mardis E.R. (2006). Anticipating the 1,000 dollar genome. Genome Biology


Filed under: Bioinformatics, Tutorials , , , , , , ,

Validating Chromosome Entered is Correct

When you have a web form and one of the fields to be entered is the Chromosome Number, you’d be wise to check that the user does not enter the wrong thing (e.g. ‘0′ , ‘X1′, ‘21.13′). Thus a validation check may save you some headaches.

I set out to find the appropriate regular expression in Javascript that returns an error when the chromosome entered is not 1-22 or X or Y. It turns out that for me it wasn’t that easy to solve this little problem. I searched in Google to see if anyone had posted the solution to the problem and found nothing meaningful.

Therefore I paste below the solution I wrote. I must admit that it is rather ugly the fact that I have to use two if expressions. I would appreciate if any  reader posted a more elegant solution. But for now, this code seems to work fine.

function notValidChr(field)
{
  // Field may start with a number or optionally two
  if ( field.value.match(/^\d\d?$/)
      // Is it an integer greated than 0?
      && (0 < parseInt(field.value)
            // Is it an integer smaller than 23?
            && parseInt(field.value) <23
          )
      )
  {
    // If all conditions above are met, notValidChr is not true
    return false;
  }
  // Check if it is a valid X or Y chromosome
  if (field.value.match(/^[XY]$/)) {
    // If so notValidChr is not true
    return false;
  }
  // Yes! the text entered is not a valid chromosome
  return true;
}

Filed under: Bioinformatics, Tutorials

Tips for Remapping from NCBI36 to NCBI37 Genome Assembly

It might seem for some people straight forward but I had to spend quite some time trying to understand how to remap my array probes from ncbi36 to ncbi37. If you use the Ensembl genome browser, you might have noticed that from July 2009 the ncbi37 assembly is now in use. For DECIPHER (the database I help develop), this is a little bit of a headache, because it means that all of the probes from array CGH that we used have to be remapped to the new assembly. If this does not interest you I recommend that you stop reading here.

First I learned that there is a program called liftOver by UCSC that is able to do this remapping. Since the amount of probes I have to map (around 6 million) is a number that I would not wish to through to anyone’s server, I decided to do this in-house. You can download this program from here. I did not know which was the right binary for me to download, as they had linux32 and linux64 versions. I decided to go for the former, since I am using debian and it sounds like a conservative option.

Once I downloaded the program, I needed to make it executable:

chmod u+x liftOver

OK, so I was in a position to run it:

./liftOver

In the usage information it appears that I need several arguments and files to be able to run this program correctly:

liftOver oldFile map.chain newFile unMapped

Now I learned that I need also to get a file called the map.chain. I was not sure what it meant. I learned that this map.chain file has parameters that are used by liftOver and that there are map.chain files depending on the remapping one wants to do. In my case, I want to remap from ncbi36 to ncbi37 in human. However, when I look at the different remappings, I do not see ncbi formats anywhere. I learned here that what I am looking for is map chain file that is called this:

hg18toHg19.over.chain

Apparently hg18 refers to ncbi36 and hg19 to ncbi37. Doing a google search I could find that file here.

Now I get quite a few options and learn that I need to have my probes in bed format to run liftOver. Apparently there are quite a few formats I can use according to UCSC FAQs formats. Here an example of what my bed file looks like (chromosome-tab-start_position-tab-end_position):

chrY       12308579        12468100
chrY       12468101        12581699
chrY       12581700        12759636
chrY       12759637        12838587

Now I am in a position to run liftOver. I notice now that in the usage one has the following description:

liftOver oldFile map.chain newFile unMapped

‘newFile’ and ‘unMapped’ are the names of the files where the output goes into and therefore are empty. This can be confusing as the user might think that these are some other kind of files one has to get hold of.

OK, so now I am ready to transform our old array probe mapping ncbi36 to the new ncbi37 one:

./liftOver probes.ncbi36 hg18toHg19.over.chain probes.ncbi37 unmapped-to-ncbi37

I got the following output to console:

Reading liftover chains
Mapping coordinates
ERROR: start coordinate is after end coordinate (chromStart > chromEnd) on line 5171240 of bed file probes.decipher.ncbi36
ERROR: 4 2515512 2515453

…which is a bit worrying.

I’ve gone through my probes and found that some of them (just 44757!) had start point coordinates greater than their ends. I guess that if you encounter those you’ll have to decide what to do. For the time being I just took them out and re run liftOver again.

This time it worked.

Filed under: Bioinformatics, Tutorials , , , , , , ,

About this Blog

Written and maintained by Manuel Corpas, a Computational Biologist at the Wellcome Trust Sanger Institute (Cambridge, UK), a leading international center for genome research. Any views expressed here are the author's alone and do not necessarily form part of the official positions of his employer.
Subscribe to me on FriendFeed