Intermediate Perl
GC content is a very interesting property of DNA sequences because it is correlated to repeats and gene deserts. A simple way to calculate GC content is to divide the sum of G and C letters by the total number of nucleotides in the sequence. Let’s assume that you start with a string $sequence.
The WRONG way in which I initially did this was to convert the string to an array of letters, as shown here:
#WRONG METHOD 1
sub calcgc {
my $seq = $_[0];
my @seqarray = split('',$seq);
my $count = 0;
foreach my $base (@seqarray) {
$count++ if $base =~ /[G|C]/i;
}
my $len = $#seqarray+1;
my $num=$count / $len;
my ($dec)=$num =~ /(\S{6})/;
return $dec;
}
This is a very inefficient way of calculating the GC content, because arrays in Perl are quite expensive in terms of memory. The result of this was that I run out of memory quite quickly.
I found a more efficient approach by using the substr function, looping through the whole sequence, taking one base at a time. However, according to a colleague, Andy Jenkinson, it contains some bugs:
#WRONG METHOD 2
sub calcgc {
my $seq = $_[0];
my $count = 0;
my $len = length($seq);
for (my $i = 1;$i<$len+1; $i++) {
my $base = substr $seq, $i, 1;
$count++ if $base =~ /[G|C]/i;
}
my $num=$count / $len;
my ($dec)=$num =~ /(\S{6})/;
return $dec;
}
The reasons for being wrong, Andy argues, are that “it ignores the first character of the sequence because the substr function is zero-index based. The rounding at the end using \S{6} also only works where there are >=6 characters in the resulting fraction – so a string like “ATCG” has a GC content of 0.5, but will appear to your application as zero. If you need to do this, you should use \S{0,6}.”
I addition to this, he adds that whilst it solves the memory issue, [one] might also consider a much more CPU-friendly and simpler implementation:
#METHOD 3 $count++ while $seq =~ /[CG]/gi;
He carried out a test simulation of #METHOD 3 for human chromosome 1 (247 million characters), which took 12 seconds with the same memory footprint as #METHOD 2, which took 111 seconds. Here is the source code for Andy’s simulation:
use strict;
sub calcgc {
my $seq = $_[0];
my $count = 0;
my $len = length($seq);
for (my $i = 1;$i<$len+1; $i++) {
my $base = substr $seq, $i, 1;
$count++ if $base =~ /[G|C]/i;
}
my $num=$count / $len;
my ($dec)=$num =~ /(\S{6})/;
return $dec;
}
sub calcgc2 {
my $seq = $_[0];
my $count = 0;
$count++ while ($seq =~ m/[GC]/gi);
my $num = $count / length($seq);;
my ($dec) = $num =~ /(\S{0,6})/;
return $dec;
}
my $seq = "CBADEFGHIJ"x (100*1000*247);
my $time1 = time();
my $gc1 = calcgc($seq);
my $time2 = time();
print "Old method: '$gc1' in ".($time2-$time1)." seconds\n";
my $gc2 = calcgc2($seq);
my $time3 = time();
print "New method: '$gc2' in ".($time3-$time2)." seconds\n";
I have not had time to test #METHOD 3 yet, but I hope this last addition helps people.
Happy coding!


















































Leave a comment