A Script to Calculate GC content

Feb 3, 2010

Intermediate Perl

GC content is a very interesting property of DNA sequences because it is correlated to repeats and gene deserts. A simple way to calculate GC content is to divide the sum of G and C letters by the total number of nucleotides in the sequence. Let’s assume that you start with a string $sequence.

The WRONG way in which I initially did this was to convert the string to an array of letters, as shown here:

#WRONG METHOD 1
sub calcgc {
 my $seq = $_[0];
 my @seqarray = split('',$seq);
 my $count = 0;
 foreach my $base (@seqarray) {
   $count++ if $base =~ /[G|C]/i;
 }
 my $len = $#seqarray+1;
 my $num=$count / $len;
 my ($dec)=$num =~ /(\S{6})/;
 return $dec;
}

This is a very inefficient way of calculating the GC content, because arrays in Perl are quite expensive in terms of memory. The result of this was that I run out of memory quite quickly.

I found a more efficient approach by using the substr function, looping through the whole sequence, taking one base at a time. However, according to a colleague, Andy Jenkinson, it contains some bugs:

#WRONG METHOD 2
sub calcgc {
 my $seq = $_[0];
 my $count = 0;
 my $len   = length($seq);
 for (my $i = 1;$i<$len+1; $i++) {
   my $base = substr $seq, $i, 1;
 $count++ if $base =~ /[G|C]/i;
 }
 my $num=$count / $len;
 my ($dec)=$num =~ /(\S{6})/;
 return $dec;
}

The reasons for being wrong, Andy argues, are that “it ignores the first character of the sequence because the substr function is zero-index based. The rounding at the end using \S{6} also only works where there are >=6 characters in the resulting fraction – so a string like “ATCG” has a GC content of 0.5, but will appear to your application as zero. If you need to do this, you should use \S{0,6}.”

I addition to this, he adds that whilst it solves the memory issue, [one] might also consider a much more CPU-friendly and simpler implementation:

#METHOD 3
$count++ while $seq =~ /[CG]/gi;

He carried out a test simulation of #METHOD 3 for human chromosome 1 (247 million characters), which took 12 seconds with the same memory footprint as #METHOD 2, which took 111 seconds. Here is the source code for Andy’s simulation:

use strict;

sub calcgc {
 my $seq = $_[0];
 my $count = 0;
 my $len   = length($seq);
 for (my $i = 1;$i<$len+1; $i++) {
 my $base = substr $seq, $i, 1;
 $count++ if $base =~ /[G|C]/i;
 }
 my $num=$count / $len;
 my ($dec)=$num =~ /(\S{6})/;
 return $dec;
}

sub calcgc2 {
 my $seq = $_[0];
 my $count = 0;
 $count++ while ($seq =~ m/[GC]/gi);
 my $num = $count / length($seq);;
 my ($dec) = $num =~ /(\S{0,6})/;
 return $dec;
}

my $seq = "CBADEFGHIJ"x (100*1000*247);
my $time1 = time();
my $gc1 = calcgc($seq);
my $time2 = time();
print "Old method: '$gc1' in ".($time2-$time1)." seconds\n";
my $gc2 = calcgc2($seq);
my $time3 = time();
print "New method: '$gc2' in ".($time3-$time2)." seconds\n";

I have not had time to test #METHOD 3 yet, but I hope this last addition helps people.

Happy coding!

Follow Manuel Corpas on Twitter

Podcast also available on PocketCasts, SoundCloud, Spotify, Google Podcasts, Apple Podcasts, and RSS.

12 responses

Glenn Proctor

April 9, 2010 at 8:43 am

The Ensembl Slice.pm module has a method for calculating %GC content (and some other stuff) which uses tr// – given that this is a built-in Perl function I’m pretty sure it’s going to be faster and more memory efficient. It also uses Keith’s method of counting A,C,G and T and using that as the divisor rather than the length of the sequence.

LikeLike

Reply
Nona

April 8, 2010 at 8:12 pm

@Anon

That’s awesome, I didn’t know some many accented characters even existed. Thanks for the tip!

LikeLike

Reply
Keith Bradnam

April 8, 2010 at 5:32 pm

Dear Anon,

I haven’t shown you my own GC calculating code, but you might be placated in knowing that it does indeed count lower case as well as upper case characters. The sequence data that our lab deals with does often include unknown (N) characters and sometimes other nucleotide ambiguity codes (R for purine etc). In these situations, using the length of the sequence in the GC calculation leads to an incorrect value.

Regards,

Keith

LikeLike

Reply
Mahsa

March 6, 2010 at 12:09 pm

guys, what does this line mean?
my ($dec)=$num =~ /(\S{6})/;

thanks in advance for your help.

LikeLike

Reply
1. manuelcorpas
  
  March 8, 2010 at 10:53 am
  
  @Mahsa
  
  your line means match the first non-Space six-characters in variable $num and store them in $dec.
  
  LikeLike
Busybody

February 25, 2010 at 6:53 am

the PERL function substr starts counting at zero

so shouldn’t your script read

for (my $i = 0;$i<$len-1; $i++) {
my $base = substr $seq, $i, 1;
$count++ if $base =~ /[G|C]/i;
}

instead

LikeLike

Reply
Martijn van Iersel

February 7, 2010 at 10:10 am

I think so. tr goes through the string in a single pass.

With the substr method you make a copy of every character in the string and operate on that. Making copies is a little extra work for the processor.

But other effects could come into play. You can’t now for sure unless you try it out and measure it.

LikeLike

Reply
Martijn van Iersel

February 5, 2010 at 1:27 pm

You can do it even shorter using the tr/// operator. Tr’s main goal is to replace one character with another, but as a side effect it returns the number of replacements made.

sub calcgc {

my $seq = $_[0];
my $count = 0;
my $len = length($seq);
$count = (tr/GC/GC/ =~ $seq);
my $num=$count / $len;
my ($dec)=$num =~ /(\S{6})/;
return $dec;

}

LikeLike

Reply
1. manuelcorpas
  
  February 5, 2010 at 1:38 pm
  
  Thanks Martijn,
  
  is this a faster method in your opinion?
  
  Manuel
  
  LikeLike
2. Keith Bradnam
  
  March 17, 2010 at 8:38 pm
  
  From personal experience I would say that using transliteration is much faster than using substitution, but you might not notice the benefits if your sequences are short.
  
  A possible flaw with your code is that you don’t take account that there may be unspecified (‘N’) characters in the sequence. I.e. if a 100 nt sequence contains 10 Ns and 50 G+C nucleotides, then your code would calculate the GC content as 50% which is incorrect (though of course it might be true if you knew what those Ns are).
  
  It is safer to count A, C, G, T separately and then total up their counts to make the effective length for calculating GC%. In my GC subroutine I also count Ns and then ‘Other’ in order to capture any other use of nucleotide ambiguity codes.
  
  LikeLike
3. Anon
  
  April 7, 2010 at 7:25 pm
  
  @Keith
  
  It is unfortunate that your script doesn’t also count up the number of lower case a, c, g, and t, or even accented As/Cs/Ts & Gs (ÀÁÂÃÄÅĀĄĂÇĆČĈĊŤŢŦȚĜĞĠĢ) as this would also be “safer”.
  
  Perhaps you could implement some sort of near-key substitution, for example, if ‘q’,’w’,’s’,’x’ or ‘z’ appeared, you could assume they meant ‘a’ as it’s next to it on the keyboard.
  
  LikeLike
4. Chris Fields
  
  April 8, 2010 at 6:01 pm
  
  Transliteration tends to be faster, primarily b/c the transliteration table is built at compile time. However this limits it (no variable interp, no char sets).
  
  LikeLike