A Script to Calculate GC content

Intermediate Perl

GC content is a very interesting property of DNA sequences because it is correlated to repeats and gene deserts. A simple way to calculate GC content is to divide the sum of G and C letters by the total number of nucleotides in the sequence. Let’s assume that you start with a string $sequence.

The WRONG way in which I initially did this was to convert the string to an array of letters, as shown here:

#WRONG METHOD 1
sub calcgc {
 my $seq = $_[0];
 my @seqarray = split('',$seq);
 my $count = 0;
 foreach my $base (@seqarray) {
   $count++ if $base =~ /[G|C]/i;
 }
 my $len = $#seqarray+1;
 my $num=$count / $len;
 my ($dec)=$num =~ /(\S{6})/;
 return $dec;
}

This is a very inefficient way of calculating the GC content, because arrays in Perl are quite expensive in terms of memory. The result of this was that I run out of memory quite quickly.

I found a more efficient approach by using the substr function, looping through the whole sequence, taking one base at a time. However, according to a colleague, Andy Jenkinsonit contains some bugs:

#WRONG METHOD 2
sub calcgc {
 my $seq = $_[0];
 my $count = 0;
 my $len   = length($seq);
 for (my $i = 1;$i<$len+1; $i++) {
   my $base = substr $seq, $i, 1;
 $count++ if $base =~ /[G|C]/i;
 }
 my $num=$count / $len;
 my ($dec)=$num =~ /(\S{6})/;
 return $dec;
}

The reasons for being wrong, Andy argues, are that “it ignores the first character of the sequence because the substr function is zero-index based. The rounding at the end using \S{6} also only works where there are >=6 characters in the resulting fraction – so a string like “ATCG” has a GC content of 0.5, but will appear to your application as zero. If you need to do this, you should use \S{0,6}.”

I addition to this, he adds that whilst it solves the memory issue, [one] might also consider a much more CPU-friendly and simpler implementation:

#METHOD 3
$count++ while $seq =~ /[CG]/gi;

He carried out a test simulation of #METHOD 3 for human chromosome 1 (247 million characters), which took 12 seconds with the same memory footprint as #METHOD 2, which took 111 seconds. Here is the source code for Andy’s simulation:

use strict;

sub calcgc {
 my $seq = $_[0];
 my $count = 0;
 my $len   = length($seq);
 for (my $i = 1;$i<$len+1; $i++) {
 my $base = substr $seq, $i, 1;
 $count++ if $base =~ /[G|C]/i;
 }
 my $num=$count / $len;
 my ($dec)=$num =~ /(\S{6})/;
 return $dec;
}

sub calcgc2 {
 my $seq = $_[0];
 my $count = 0;
 $count++ while ($seq =~ m/[GC]/gi);
 my $num = $count / length($seq);;
 my ($dec) = $num =~ /(\S{0,6})/;
 return $dec;
}

my $seq = "CBADEFGHIJ"x (100*1000*247);
my $time1 = time();
my $gc1 = calcgc($seq);
my $time2 = time();
print "Old method: '$gc1' in ".($time2-$time1)." seconds\n";
my $gc2 = calcgc2($seq);
my $time3 = time();
print "New method: '$gc2' in ".($time3-$time2)." seconds\n";

I have not had time to test #METHOD 3 yet, but I hope this last addition helps people.

Happy coding!

Follow Manuel Corpas on Twitter

12 comments

  1. Glenn Proctor

    The Ensembl Slice.pm module has a method for calculating %GC content (and some other stuff) which uses tr// – given that this is a built-in Perl function I’m pretty sure it’s going to be faster and more memory efficient. It also uses Keith’s method of counting A,C,G and T and using that as the divisor rather than the length of the sequence.

  2. Keith Bradnam

    Dear Anon,

    I haven’t shown you my own GC calculating code, but you might be placated in knowing that it does indeed count lower case as well as upper case characters. The sequence data that our lab deals with does often include unknown (N) characters and sometimes other nucleotide ambiguity codes (R for purine etc). In these situations, using the length of the sequence in the GC calculation leads to an incorrect value.

    Regards,

    Keith

    1. manuelcorpas

      @Mahsa

      your line means match the first non-Space six-characters in variable $num and store them in $dec.

  3. Busybody

    the PERL function substr starts counting at zero

    so shouldn’t your script read

    for (my $i = 0;$i<$len-1; $i++) {
    my $base = substr $seq, $i, 1;
    $count++ if $base =~ /[G|C]/i;
    }

    instead

  4. Martijn van Iersel

    I think so. tr goes through the string in a single pass.

    With the substr method you make a copy of every character in the string and operate on that. Making copies is a little extra work for the processor.

    But other effects could come into play. You can’t now for sure unless you try it out and measure it.

  5. Martijn van Iersel

    You can do it even shorter using the tr/// operator. Tr’s main goal is to replace one character with another, but as a side effect it returns the number of replacements made.

    sub calcgc {

    my $seq = $_[0];
    my $count = 0;
    my $len = length($seq);
    $count = (tr/GC/GC/ =~ $seq);
    my $num=$count / $len;
    my ($dec)=$num =~ /(\S{6})/;
    return $dec;

    }

    1. Keith Bradnam

      From personal experience I would say that using transliteration is much faster than using substitution, but you might not notice the benefits if your sequences are short.

      A possible flaw with your code is that you don’t take account that there may be unspecified (‘N’) characters in the sequence. I.e. if a 100 nt sequence contains 10 Ns and 50 G+C nucleotides, then your code would calculate the GC content as 50% which is incorrect (though of course it might be true if you knew what those Ns are).

      It is safer to count A, C, G, T separately and then total up their counts to make the effective length for calculating GC%. In my GC subroutine I also count Ns and then ‘Other’ in order to capture any other use of nucleotide ambiguity codes.

    2. Anon

      @Keith

      It is unfortunate that your script doesn’t also count up the number of lower case a, c, g, and t, or even accented As/Cs/Ts & Gs (ÀÁÂÃÄÅĀĄĂÇĆČĈĊŤŢŦȚĜĞĠĢ) as this would also be “safer”.

      Perhaps you could implement some sort of near-key substitution, for example, if ‘q’,’w’,’s’,’x’ or ‘z’ appeared, you could assume they meant ‘a’ as it’s next to it on the keyboard.

    3. Chris Fields

      Transliteration tends to be faster, primarily b/c the transliteration table is built at compile time. However this limits it (no variable interp, no char sets).

Leave a Reply to Keith BradnamCancel reply

%d bloggers like this: