Intermediate Perl
GC content is a very interesting property of DNA sequences because it is correlated to repeats and gene deserts. A simple way to calculate GC content is to divide the sum of G and C letters by the total number of nucleotides in the sequence. Let’s assume that you start with a string $sequence.
The WRONG way in which I initially did this was to convert the string to an array of letters, as shown here:
sub calcgc {
my $seq = $_[0];
my @seqarray = split('',$seq);
my $count = 0;
foreach my $base (@seqarray) {
$count++ if $base =~ /[G|C]/i;
}
my $len = $#seqarray+1;
my $num=$count / $len;
my ($dec)=$num =~ /(\S{6})/;
return $dec;
}
This is a very inefficient way of calculating the GC content, because arrays in Perl are quite expensive in terms of memory. The result of this was that I run out of memory quite quickly.
I found a more efficient approach by using the substr function, looping through the whole sequence, taking one base at a time:
sub calcgc {
my $seq = $_[0];
my $count = 0;
my $len = length($seq);
for (my $i = 1;$i<$len+1; $i++) {
my $base = substr $seq, $i, 1;
$count++ if $base =~ /[G|C]/i;
}
my $num=$count / $len;
my ($dec)=$num =~ /(\S{6})/;
return $dec;
}
Now I do not run out of memory and the script promptly calculates the GC content for every sequence I have tried it with. I have not tried it with a whole genome though!
Happy coding!
Filed under: Bioinformatics, Tutorials , array, GC content, perl, substr
guys, what does this line mean?
my ($dec)=$num =~ /(\S{6})/;
thanks in advance for your help.
@Mahsa
your line means match the first non-Space six-characters in variable $num and store them in $dec.
the PERL function substr starts counting at zero
so shouldn’t your script read
for (my $i = 0;$i<$len-1; $i++) {
my $base = substr $seq, $i, 1;
$count++ if $base =~ /[G|C]/i;
}
instead
I think so. tr goes through the string in a single pass.
With the substr method you make a copy of every character in the string and operate on that. Making copies is a little extra work for the processor.
But other effects could come into play. You can’t now for sure unless you try it out and measure it.
You can do it even shorter using the tr/// operator. Tr’s main goal is to replace one character with another, but as a side effect it returns the number of replacements made.
sub calcgc {
my $seq = $_[0];
my $count = 0;
my $len = length($seq);
$count = (tr/GC/GC/ =~ $seq);
my $num=$count / $len;
my ($dec)=$num =~ /(\S{6})/;
return $dec;
}
Thanks Martijn,
is this a faster method in your opinion?
Manuel
From personal experience I would say that using transliteration is much faster than using substitution, but you might not notice the benefits if your sequences are short.
A possible flaw with your code is that you don’t take account that there may be unspecified (‘N’) characters in the sequence. I.e. if a 100 nt sequence contains 10 Ns and 50 G+C nucleotides, then your code would calculate the GC content as 50% which is incorrect (though of course it might be true if you knew what those Ns are).
It is safer to count A, C, G, T separately and then total up their counts to make the effective length for calculating GC%. In my GC subroutine I also count Ns and then ‘Other’ in order to capture any other use of nucleotide ambiguity codes.