Converting FASTQ to FASTA
May 21st, 2012 § 2 Comments
A little Perl one liner I borrowed from The Edwards Lab that converts FASTQ to FASTA. Please note I had to truncate the line to make it show properly in this blog entry.
$ cat file_to_covert.fq | perl -e \
'$i=0;while(<>){if(/^\@/&&$i==0){s/^\@/\>/;print;}elsif($i==1){print;$i=-3}$i++;}' \
> output.fasta
Thanks Edwards Lab!
Converting Genes and Genomic Features From NCBI36 to GRCh37
January 10th, 2012 § 2 Comments
The Human Genome is a like map where features and genes are mapped to. As techniques improve, our fine-grained resolution for that map increases and new versions are released every few years. When a new coordinate reference map (or assembly) for the Human Genome is released, it produces lots of headaches for those who work in the field as it means that the locations of genes, chromosomal bands and other features like Single Nucleotide Polymorphisms (SNPs) or Copy Number Variation (CNVs) change.
In order to have the most up-to-date version for the Human Genome set of genes and features sometimes it is necessary to convert from one assembly to another. In the past I have written a tutorial on how to remap from NCBI36 to GRCh37 human assemblies using liftOver. In this tutorial I present a simple step-by-step guide for feature remapping using NCBI’s remapping tool.
Important:
Please make sure you know in advance the assembly to which your aberration data is currently mapped to. If by mistake you remap an aberration already in GRCh37 to GRCh37 you will get new coordinates for the region mapped to the wrong coordinates.
The NCBI provides a web facility to convert coordinates from one assembly into another. To convert coordinates using their genome remapping service do the following:
- Make sure that your data is in BED format, e.g. “chr3 100000 999990 myId0000123” -> CNV aberration in NCBI36/hg18
- Please note that each field is separated by a tab and each line by a character return. Please follow this strictly or the remapping tool may throw an error.
- Add as many lines as aberrations you would like to remap
- Go to the NCBI Remap page:
- Select “Organism for source data” Homo Sapiens, “Source Assembly” NCBI36 (hg18) and “Target Assembly” GRCh37 (hg19)
- Please leave all “Remapping Options” (Minimum ratio of bases that must remap, etc) with default values
- Select for “Input format” BED, “Output format” Same as input
- Paste your aberration in the input box where it says “Paste data here” and hit submit at the bottom of the page
- Wait until results are returned
- To retrieve results download “Mapping Report”, which is in excel format or alternatively Mapping report Sample in the results page
Please note that your aberration may remap to more than one location. I recommend that you manually check the coordinates and select the most appropriate of the doubly remapped aberration in the new assembly. Please also note that your aberration may not remap because the region is partially or entirely deleted in the new assembly or split in GRCh37. In this case I recommend that you use another start or end point position, maybe use the start/end of alternative probes until you find a region where it maps.
Another possibility could be to look at the genes for the region in the old assembly and select a region in GRCh37 that includes the same genes as in NCBI36. Each of these solutions requires careful deliberation and may not be applicable to your particular case (e.g. genes in different chromosomes would not allow remapping based on genes).
Beware of Gene Names in Excel
November 5th, 2011 § 4 Comments
For the past few days I have been trying to compile the list of gene names that is the most complete possible. To start with, I was given an initial list of genes in an excel file that was taken from the HUGO Gene Nomenclature Committee (HGNC). Unfortunately, the gene names were pasted from the original source (HGNC) to an Excel spreadsheet without modifying the expected format of the column cells. This led to Excel trying to “help” with the formatting of the value inserted, changing those gene names that are similar to dates to an actual date. In the bioinformatics field, misnaming a gene can lead to disastrous consequences such as misdiagnosis of a causal gene in a clinical setting. Thus:
Beware of pasting gene names in an Excel spreadsheet with a default format, as these may be changed into dates.
From my current list of 19,026 genes that I have compiled as of now, here are the names of the genes that have been automatically changed by Excel into dates. In the table below, the first column denotes the date the gene name is changed to, the middle column the Ensembl ID of the gene and the right column the actual name that was changed by Excel into a date.
Sep-01 ENSG00000180096 SEPT1 Sep-02 ENSG00000168385 SEPT2 Sep-03 ENSG00000100167 SEPT3 Sep-04 ENSG00000108387 SEPT4 Sep-05 ENSG00000184702 SEPT5 Sep-06 ENSG00000125354 SEPT6 Sep-07 ENSG00000122545 SEPT7 Sep-08 ENSG00000164402 SEPT8 Sep-09 ENSG00000184640 SEPT9 Sep-10 ENSG00000186522 SEPT10 Sep-11 ENSG00000138758 SEPT11 Sep-12 ENSG00000140623 SEPT12 Sep-14 ENSG00000154997 SEPT14 Mar-01 ENSG00000145416 MARCH1 Mar-02 ENSG00000099785 MARCH2 Mar-03 ENSG00000173926 MARCH3 Mar-04 ENSG00000144583 MARCH4 Mar-05 ENSG00000198060 MARCH5 Mar-06 ENSG00000145495 MARCH6 Mar-07 ENSG00000136536 MARCH7 Mar-08 ENSG00000165406 MARCH8 Mar-09 ENSG00000139266 MARCH9 Mar-10 ENSG00000173838 MARCH10 Mar-11 ENSG00000183654 MARCH11 Dec-01 ENSG00000173077 DEC1
Remapping from NCBI36/hg18 to GRCh37/hg19
February 2nd, 2011 § 2 Comments
Given the huge response I have at work about remapping features into another assembly, I present here an adapted version for how to remap a feature from NCBI36/hg18 to GRCh37/hg19 using UCSC’s liftOver tool.
Important:
Please make sure you know in advance the assembly to which your aberration data is currently mapped to. If by mistake you remap an aberration already in GRCh37 to GRCh37 you will get new coordinates for the region mapped to the wrong coordinates.
UCSC’s Genome Browser provides a web facility to convert coordinates from one assembly into another. To convert coordinates using their liftOver tool do the following:
- Make sure that your data is in BED format, e.g. “chr3 100000 999990 myPatientId0000123” –> aberration in NCBI36/hg18
- Note that each field is separated by a tab and each line by a character return. Please follow this strictly or the remapping tool may throw an error.
- Add as many lines as aberrations you would like to remap.
- Go to the liftOver page
- Select “Original Assembly” Mar. 2006 (NCBI36/hg18) and “New Assembly” Feb. 2009 (GRCh37/hg19)
- Leave all other parameters (Minimum ratio of bases that must remap, etc) with default values
- Paste your aberration in the input box where it says “Paste in data” and hit submit
- To get results, scroll down the page and click on the “View Conversions” link.
- Here is the result I get:
chr3 125000 1024990 myPatientId0000123
Please note that your feature may not remap because the region is partially or entirely deleted in the new assembly or split in GRCh37. In this case I recommend that you use another start or end point position, maybe use the start/end of alternative probes until you find a region where it maps. Another possibility would be to look at the genes for the region in the old assembly and select a region in GRCh37 that includes the same genes as in NCBI36. Each of these solutions require careful deliberation and may not be applicable to your particular case (e.g. genes in different chromosomes would not allow remapping based on genes).
I hope this is helpful.
Latest Sanger’s Public Engagement Video
January 26th, 2011 § Leave a Comment
Here is a very entertaining video that shows in images some of the fascinating science happening within the Wellcome Trust Sanger Institute‘s walls. In the story a group of high school students come to visit and learn science via its Public Engagement Programme.
I consider myself very fortunate of being a contributor to these efforts, whose main objective is to inspire the younger generations and bring them closer to the biomedical sciences. There is no better way of maintaining one’s own motivation to do science than to inspire others, specially the would-be scientists.
The following quote describing the role of the Public Engagement Program at Sanger is taken from its own website. I think it explains very well their mission and importance as a way of raising awareness and securing future funding for this kind of research.
The role of the Wellcome Trust Sanger Institute’s Communication and Public Engagement programme is to promote understanding of the nature, discoveries and wonder of science and its implications for individuals and society.
Sending Sensitive Data Encrypted
July 8th, 2010 § Leave a Comment
The other day I was asked to find a way to send sensitive clinical data to another institute. How to make sure that the data is protected and only acessible to the right people? There are two aspects of protecting data, reflecting the different risks which the data may be exposed to:
- data in transit (email “in flight”, web or FTP downloads, data sets on USB disks shipped by FedEx, etc)
- data at rest (email arrived in recipient’s inbox, data copied to collaborator’s working disk, etc)
Here we will only explore the requirements for encrypting data in transit. The security of the data at rest is assumed to be taken care of by the collaborator or their IT staff, since it is outside one’s control.
There are various possible file transfer methods:
- email – suitable for small files (typically up to 5MB although different sites impose different limits); no automatic encryption in transit
- FTP or non-SSL password-protected web site – suitable for large files (in the GB range); no automatic encryption in transit
- scp – suitable for large files; intrinsic encryption in transit; likely to encounter firewall issues
- password-protected SSL web site – suitable for large files; intrinsic encryption in transit
- USB disk – suitable for very large data sets (TB range); no automatic encryption in transit
When encryption is mandated (e.g. by a data access agreement) and the file transfer method does not provide encryption intrinsically, it is necessary to encrypt the data separately and transfer the encrypted file by the chosen method.
For ad-hoc or one-off data encryption, it is appropriate to encrypt a data set with a password (“symmetric encryption”, because the same password is used to encrypt and decrypt) which will be sent to the recipient by a separate means to the actual data. For example, if the data is shipped on a USB disk, the password could be sent by email, or given over the phone. Sending the password with the encrypted data defeats the object of encrypting it!
For regular or scheduled data transfers, public-key encryption may be suitable – and removes the need to send a password – but that will not be explored here due to the extra work in creating and managing keys.
A suitable encryption tool on Linux systems is gpg (the GNU Privacy Guard). The simplest usage is to prepare a single file containing the data in question using tar or zip, and then to encrypt that:
$ gpg -c bigfile.tar gpg: gpg-agent is not available in this session Enter passphrase: Repeat passphrase:$ ls bigfile.tar* bigfile.tar bigfile.tar.gpgAt this point, "bigfile.tar.gpg" is the encrypted file which is safe to transfer by email, FTP, or any other non-encrypted method. Note that the passphrase is not displayed while it is being entered; and that the encrypted file is typically smaller than the original due to compression in the encryption process. However it is necessary to have enough disk space to contain both the original and the encrypted data simultaneously, which may make this approach unsuitable for very large (TB) datasets.
The passphrase should be chosen with the same care as a computer login password. The Linux utility "pwgen" produces a selection of random passwords which may be useful in selecting a suitable passphrase.
The recipient will decrypt the file in a similar way:
$ gpg bigfile.tar.gpg gpg: CAST5 encrypted data gpg: gpg-agent is not available in this session Enter passphrase: gpg: encrypted with 1 passphrase gpg: WARNING: message was not integrity protectedNote that if the passphrase is lost then it is vanishingly unlikely that the encrypted data can be recovered. Unless the passphrase is easily guessable, the encryption is sufficiently strong as to defeat most attempts to break it.
Written by Dr David Holland (WTSI), adapted by Manuel Corpas. Posted with Dr Holland's permission.
Biomedical Community-Wide Annotation Using Wikipedia
June 3rd, 2010 § 9 Comments
The pace of data generation is leaving far behind our ability to convert this data into usable knowledge. Even well funded biomedical databases find it increasingly difficult to keep up to speed. In order to tackle this problem, some databases have opted for increasing automation in the way data is deposited, reducing the time needed for interpreting results. The problem with this approach is that generated knowledge as a result is less accurate than manually annotated entries and of lower quality. Another potential solution has been to engage leading experts, creating a sort of consortium where they give some of their time to curate data entries that match their specialties. Unfortunately, engaging world experts in curating biomedical resources has not had a lot of success, with a few contributing a lot and many hardly ever dedicating any time to curation no matter how much they were fetched.
A new revolutionary idea has come from Alex Bateman‘s group to engage not just the community of experts but the whole of the Internet, using Wikipedia. One of his group’s databases, Rfam, which characterises RNA families, is now providing all of its annotation via Wikipedia. Wikipedia is already the leader reference resource for all kinds of information. It possesses the know-how and capability to mediate the curation of database entries as well as managing to have extremely resounding success in terms of gathering reasonably high quality knowledge.
After having a persuasive discussion with Alex, I decided to give it a try myself and add my very first entry to Wikipedia, which I thought it could potentially help the database I develop outsource its public/non-sensitive data annotation part.
I copied, edited and formatted parts of a non-sensitive entry (a Syndrome description) to Wikipedia. I learnt –contrary to what I expected- that as long as one has an account and no entry exists on the topic, a page can be added on the fly. So I added a page and started editing, copying and pasting.
It took me a bit of time to get used to some of the conventions and formatting tags used by Wikipedia but very early on I had help from Wikipedia ‘agents’. It really surprised me how quickly these agents picked up my entry and immediately made me know the criteria for making sure this Wikipedia entry achieves a high standard.
I learnt about important concepts in the Wikipedia context such as Notability and Conflicts of Interests. Apparently one cannot write about oneself for example, and personal opinions or articles are not accepted. So far this was OK for me although problems came when one of this agents pointed at some copywriting issues: I was trying to copy an entry of a website/database.
Blatant copy of public content from another website is considered a copyright violation unless a correct license is put in place and one ‘owns’ the data. In our case, the Creative Commons License, which is the one we hold, was not OK because although it lets public use of the information, it does not allow alteration. This means that people would not be able to edit my Wikipedia entry.
I must admit I felt intimidated at this point. Despite that, I was extremely impressed with the efficacy with which agents acted as well as how quickly they responded to my queries. I can understand why they have to be so tough so that they prevent abuse.
Overall I feel quite satisfied with what I have learnt in the process and I am extremely eager to keep exploring the use of Wikipedia for database curation. Of course this is just a try and our adopted solution for keeping up with current annotation may be something different in the end. However, it is worth a try.
A Script to Calculate GC content
February 3rd, 2010 § 12 Comments
Intermediate Perl
GC content is a very interesting property of DNA sequences because it is correlated to repeats and gene deserts. A simple way to calculate GC content is to divide the sum of G and C letters by the total number of nucleotides in the sequence. Let’s assume that you start with a string $sequence.
The WRONG way in which I initially did this was to convert the string to an array of letters, as shown here:
#WRONG METHOD 1
sub calcgc {
my $seq = $_[0];
my @seqarray = split('',$seq);
my $count = 0;
foreach my $base (@seqarray) {
$count++ if $base =~ /[G|C]/i;
}
my $len = $#seqarray+1;
my $num=$count / $len;
my ($dec)=$num =~ /(\S{6})/;
return $dec;
}
This is a very inefficient way of calculating the GC content, because arrays in Perl are quite expensive in terms of memory. The result of this was that I run out of memory quite quickly.
I found a more efficient approach by using the substr function, looping through the whole sequence, taking one base at a time. However, according to a colleague, Andy Jenkinson, it contains some bugs:
#WRONG METHOD 2
sub calcgc {
my $seq = $_[0];
my $count = 0;
my $len = length($seq);
for (my $i = 1;$i<$len+1; $i++) {
my $base = substr $seq, $i, 1;
$count++ if $base =~ /[G|C]/i;
}
my $num=$count / $len;
my ($dec)=$num =~ /(\S{6})/;
return $dec;
}
The reasons for being wrong, Andy argues, are that “it ignores the first character of the sequence because the substr function is zero-index based. The rounding at the end using \S{6} also only works where there are >=6 characters in the resulting fraction – so a string like “ATCG” has a GC content of 0.5, but will appear to your application as zero. If you need to do this, you should use \S{0,6}.”
I addition to this, he adds that whilst it solves the memory issue, [one] might also consider a much more CPU-friendly and simpler implementation:
#METHOD 3 $count++ while $seq =~ /[CG]/gi;
He carried out a test simulation of #METHOD 3 for human chromosome 1 (247 million characters), which took 12 seconds with the same memory footprint as #METHOD 2, which took 111 seconds. Here is the source code for Andy’s simulation:
use strict;
sub calcgc {
my $seq = $_[0];
my $count = 0;
my $len = length($seq);
for (my $i = 1;$i<$len+1; $i++) {
my $base = substr $seq, $i, 1;
$count++ if $base =~ /[G|C]/i;
}
my $num=$count / $len;
my ($dec)=$num =~ /(\S{6})/;
return $dec;
}
sub calcgc2 {
my $seq = $_[0];
my $count = 0;
$count++ while ($seq =~ m/[GC]/gi);
my $num = $count / length($seq);;
my ($dec) = $num =~ /(\S{0,6})/;
return $dec;
}
my $seq = "CBADEFGHIJ"x (100*1000*247);
my $time1 = time();
my $gc1 = calcgc($seq);
my $time2 = time();
print "Old method: '$gc1' in ".($time2-$time1)." seconds\n";
my $gc2 = calcgc2($seq);
my $time3 = time();
print "New method: '$gc2' in ".($time3-$time2)." seconds\n";
I have not had time to test #METHOD 3 yet, but I hope this last addition helps people.
Happy coding!

