Manuel Corpas

Genomes, Internet, Bioethics and More…

A Script to Calculate GC content

Intermediate Perl

GC content is a very interesting property of DNA sequences because it is correlated to repeats and gene deserts. A simple way to calculate GC content is to divide the sum of G and C letters by the total number of nucleotides in the sequence. Let’s assume that you start with a string $sequence.

The WRONG way in which I initially did this was to convert the string to an array of letters, as shown here:

sub calcgc {
 my $seq = $_[0];
 my @seqarray = split('',$seq);
 my $count = 0;
 foreach my $base (@seqarray) {
   $count++ if $base =~ /[G|C]/i;
 }
 my $len = $#seqarray+1;
 my $num=$count / $len;
 my ($dec)=$num =~ /(\S{6})/;
 return $dec;
}

This is a very inefficient way of calculating the GC content, because arrays in Perl are quite expensive in terms of memory. The result of this was that I run out of memory quite quickly.

I found a more efficient approach by using the substr function, looping through the whole sequence, taking one base at a time:

sub calcgc {
 my $seq = $_[0];
 my $count = 0;
 my $len   = length($seq);
 for (my $i = 1;$i<$len+1; $i++) {
   my $base = substr $seq, $i, 1;
 $count++ if $base =~ /[G|C]/i;
 }
 my $num=$count / $len;
 my ($dec)=$num =~ /(\S{6})/;
 return $dec;
}

Now I do not run out of memory and the script promptly calculates the GC content for every sequence I have tried it with. I have not tried it with a whole genome though!

Happy coding!

Filed under: Bioinformatics, Tutorials , , , ,

A Simple Script to Remove Duplicate Emails

Basic Perl Programming

Suppose you have a list of emails compiled from different sources and want to get rid of duplicates. You google this and to your amazement (and mine), there is no free service available to paste your emails to return a list of unique addresses.

Here is a small simple script to eliminate email duplicates, in case this saves a few minutes (or hours!) to someone else.

This script is written in Perl and takes a list of email addresses (one per line), prunes it and returns a list with unique ones. You will need to have installed Perl in your computer (you can download it for free in Windows using ActivePerl); any flavors of UNIX or Mac OS should have Perl installed by default.

Once you have Perl installed, copy the following script to a file and save it as duplicate_email_remover.pl in a suitable directory:

#!perl
use strict;
my %hash;
while(my $line = <>) {
  chomp($line);
  $hash{$line}++;
}
map {print $_ ."\n"} keys %hash;

Also make sure that you have your email list (one email per line) saved in a file named emails.txt in the same directory where you saved duplicate_email_remover.pl.

Run the following command in your console (or command prompt if you use Windows):

perl duplicate_email_remover.pl < emails.txt

You should get printed to your console a list of emails with unique addresses.

Enjoy!

Filed under: Tutorials , , , , ,

Tips on How to Manage Databases with Medical Data

Collaborative tools are being developed to facilitate the interpretation of microarray and sequencing data. These are the result of the shifting paradigm in current clinical practice towards a more personalized approach. This trend of genomics becoming personal is expected to augment dramatically, catalyzed by the next generation of sequencing technologies as well as the standardization of microarrays in clinical genetics tests.

Genomic databases provide access to different levels of data as well as bioinformatics tool suites that facilitate its integration and visualization. Such databases may contain sensitive data from patients, including pictures, medical history, sex, age, etc. In the case of clinical diagnosis for developmental disorders, having pictures and phenotypic information may aid diagnosis of new syndromes. Morphological information can be especially useful to discriminate normal variation from structural changes in the DNA that may cause disease. Usually, this data, on its own, is not a threat to the privacy of patients. However, it has been shown by Homer et al [1] that the considerable push to make experimental data publicly available does not completely conceal identity, since it is straightforward to assess the probability that a person or relative participated in a study.

Many approaches and solutions have been suggested to ensure data protection for patients while enabling information sharing and access of scientific data. Here we summarize a set of measures that are being successfully implemented in current databases with sensitive data. This is the result of experiences and procedures that have been accumulated by MC, involving both developers and users. We do not expect these guidelines will eliminate every risk but hope that will reduce them dramatically.

1. Use password protection for access of data. This exerts control on who access data. Grant different privileges to users and depending on their privileges grant different levels of access to specific pieces of information.

2. Restrict access to sensitive data to qualified users. Allow only accredited centers/institutions for access to sensitive patient data. Accredited users may be those who prove that they work for a reputable institution, such as a hospital or university, have recent publications in the field and belong to a relevant professional society.

3. Use Encryption. The information stored in the database itself may be encrypted. Protocols for transfer of information may also be encrypted (https). Organizations may also run their own certificate authority, particularly if they are responsible for setting up browsers to access their own sites (for example, sites on a company intranet, or major universities). They can easily add copies of their own signing certificate to the trusted certificates distributed with the browser.

4. Use Data Access Agreements. Data access agreements may be useful for specifying allowed uses of the data and the ways in which research findings may be shared with the community. This may ensure that only users who have signed the agreement have access to the data.

5. Handle images in a special manner. Images may be some of the most sensitive pieces of information. Watermarking images with recognizable stamps may prevent attempts of sharing this information among unauthorized users. Access to image data may be only allowed to reputable institutions.

6. Consent. Publish only sensitive data that has been previously consented by the patient or person involved. A consent form should specify the uses of the data in an intelligible way for non experts.

7. Do never carry copies of your database in laptops or any portable devices. Database developers and administrators should never store backups or development databases outside their institution. Portable devices can be easily lost or stolen.

8. Do not make accessible the contact details of whoever entered or owns the data. Doctors or assistants will normally be the people who enter the data in their database. Their affiliations may be crucial information for determining the origin of the data deposited. Only allow accredited users to have access to this information.

9. Keep patient notified with any changes of use of data. Databases are likely to be in a continuous process of improvement, adding new features to them, allowing new uses and even moving locations. It is very important to notify your patients of the whereabouts of their data in a regular basis. It is not legitimate to provide new uses which where not initially anticipated in the consent form.

10. Have access to an Ethics Committee panel of experts. If your project is not big enough to have the resources to hire people for these purposes, at least it is recommended that any uses documents that regulate the use, access and publishing of data is reviewed by a panel of experts. Such a panel would ideally be composed by lawyers and scientists with experience in ethical handling of data.

We have presented here a set of rules that describe a starting point on handling and management of sensitive data. As personalized medicine starts delivering health care to patients, it is expected that databases containing potentially sensitive information will proliferate. A set of guiding principles have been exposed here that may help keep patient data from abuse. It is of paramount importance that database developers and administrators seek expert advice in any step of the development process.

  1. Homer N, Szelinger S, Redman M, Duggan D, Tembe W, et al. (2008) Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays. PLoS Genet 4(8): e1000167. doi:10.1371/journal.pgen.1000167

Filed under: Bioinformatics, Computational Bioethics, Technology, Tutorials , , , , , ,

Array CGH for Dummies

Array-CGH (Comparative Genomic Hybridasation) is becoming a common method used for analysis of patients’ genomes. Array-CGH works by taking a reference genome covering the whole human genome sequence, cutting it into thousands of pieces and orderly attach them to a chip. These pieces are called probes and are usually on the range of 500-2000 DNA bases long. A saliva or blood sample is then taken from the patient and its DNA is also chopped into thousands of pieces in suspension with a solvent. The array is then washed with the suspension containing the patient’s DNA.

DNA is a double chain of nucleotide bases where one chain complements the other. Knowing one chain of the DNA, it is possible to know the other chain. In its natural state, a single DNA chain will tend to bind to its complementary chain. Thus, by washing the patient’s suspension with the array probes will make the patient’s DNA pieces bind to its complementary DNA in the array.

Picture 4Array-CGH can be used to detect whether a patient has a region of the genome missing or duplicated. Probes attached to the chip emit a different color depending on their state of binding. Once the array is washed, most of probe spots will appear yellow, that is, all different probes of the reference genome are bound to the patient’s DNA. If a DNA region is missing in the patient, the complementary spots in the array appear in red. These changes appear in sequential order mapping to the reference genome missing in the patient. Depending on the genes that overlap to the deletion, different symptoms may appear in the patient.

The same happens if the array shows a series of green spots, indicating that a duplicated region of the genome has been found in the DNA of the patient. Because the gene content will be altered in the duplicated region, this may cause disease as a consequence of the over-expression of genes included in the duplicated regions.

Thus, using array techniques, we are now able to find deletions or duplications in the genome of a patient beyond the microscopic level, i.e. changes not directly observable. We are all familiar with the features of a patient with Down’s Syndrome. This syndrome is caused because there is an extra copy of chromosome 21 in the affected patient, due to a duplication of one of the two usual copies (Trisomy 21).

Most of the chromosomal deletions and duplications occur at the molecular level [1], not identifiable with microscopic techniques, as in the case of Down’s Syndrome. Up until recently most of the patients suspected of suffering from genomic diseases, i.e. diseases caused by pathogenic deletions or duplications, went undiagnosed because techniques did not allow detection beyond big chromosomal changes (like whole chromosomes). Techniques such as array-CGH now allow detection of chromosomal changes a thousand times smaller in length.

For a price of about £100 per array one can have one’s genome screened for chromosomal changes. In fact, it seems that most of the genetic changes between any two people (in terms of number of DNA bases) is dependent on the level of micro- deletions and duplications (called Copy Number Variations) [2], just the level we are now starting to handle with current analysis techniques. Next generation sequencing technologies are fast arriving that will allow the base-by-base complete sequencing of the DNA of people at price of $1000 in a short period of time [3].

[1] H.V. Firth, S.M. Richards, A.P. Bevan, S. Clayton, M. Corpas, D. Rajan, S. Van Vooren, Y. Moreau, R.M. Pettett, N.P. Carter (2009). DECIPHER: DatabasE of Chromosomal Imbalance and Phenotype using Ensembl Resources. The American Journal of Human Genetics.

[2] J. R. Lupski (2009). Genomic disorders ten years on.  Genome Medicine

[3] Mardis E.R. (2006). Anticipating the 1,000 dollar genome. Genome Biology


Filed under: Bioinformatics, Tutorials , , , , , , ,

Validating Chromosome Entered is Correct

When you have a web form and one of the fields to be entered is the Chromosome Number, you’d be wise to check that the user does not enter the wrong thing (e.g. ‘0′ , ‘X1′, ‘21.13′). Thus a validation check may save you some headaches.

I set out to find the appropriate regular expression in Javascript that returns an error when the chromosome entered is not 1-22 or X or Y. It turns out that for me it wasn’t that easy to solve this little problem. I searched in Google to see if anyone had posted the solution to the problem and found nothing meaningful.

Therefore I paste below the solution I wrote. I must admit that it is rather ugly the fact that I have to use two if expressions. I would appreciate if any  reader posted a more elegant solution. But for now, this code seems to work fine.

function notValidChr(field)
{
  // Field may start with a number or optionally two
  if ( field.value.match(/^\d\d?$/)
      // Is it an integer greated than 0?
      && (0 < parseInt(field.value)
            // Is it an integer smaller than 23?
            && parseInt(field.value) <23
          )
      )
  {
    // If all conditions above are met, notValidChr is not true
    return false;
  }
  // Check if it is a valid X or Y chromosome
  if (field.value.match(/^[XY]$/)) {
    // If so notValidChr is not true
    return false;
  }
  // Yes! the text entered is not a valid chromosome
  return true;
}

Filed under: Bioinformatics, Tutorials

Tips for Remapping from NCBI36 to NCBI37 Genome Assembly

It might seem for some people straight forward but I had to spend quite some time trying to understand how to remap my array probes from ncbi36 to ncbi37. If you use the Ensembl genome browser, you might have noticed that from July 2009 the ncbi37 assembly is now in use. For DECIPHER (the database I help develop), this is a little bit of a headache, because it means that all of the probes from array CGH that we used have to be remapped to the new assembly. If this does not interest you I recommend that you stop reading here.

First I learned that there is a program called liftOver by UCSC that is able to do this remapping. Since the amount of probes I have to map (around 6 million) is a number that I would not wish to through to anyone’s server, I decided to do this in-house. You can download this program from here. I did not know which was the right binary for me to download, as they had linux32 and linux64 versions. I decided to go for the former, since I am using debian and it sounds like a conservative option.

Once I downloaded the program, I needed to make it executable:

chmod u+x liftOver

OK, so I was in a position to run it:

./liftOver

In the usage information it appears that I need several arguments and files to be able to run this program correctly:

liftOver oldFile map.chain newFile unMapped

Now I learned that I need also to get a file called the map.chain. I was not sure what it meant. I learned that this map.chain file has parameters that are used by liftOver and that there are map.chain files depending on the remapping one wants to do. In my case, I want to remap from ncbi36 to ncbi37 in human. However, when I look at the different remappings, I do not see ncbi formats anywhere. I learned here that what I am looking for is map chain file that is called this:

hg18toHg19.over.chain

Apparently hg18 refers to ncbi36 and hg19 to ncbi37. Doing a google search I could find that file here.

Now I get quite a few options and learn that I need to have my probes in bed format to run liftOver. Apparently there are quite a few formats I can use according to UCSC FAQs formats. Here an example of what my bed file looks like (chromosome-tab-start_position-tab-end_position):

chrY       12308579        12468100
chrY       12468101        12581699
chrY       12581700        12759636
chrY       12759637        12838587

Now I am in a position to run liftOver. I notice now that in the usage one has the following description:

liftOver oldFile map.chain newFile unMapped

‘newFile’ and ‘unMapped’ are the names of the files where the output goes into and therefore are empty. This can be confusing as the user might think that these are some other kind of files one has to get hold of.

OK, so now I am ready to transform our old array probe mapping ncbi36 to the new ncbi37 one:

./liftOver probes.ncbi36 hg18toHg19.over.chain probes.ncbi37 unmapped-to-ncbi37

I got the following output to console:

Reading liftover chains
Mapping coordinates
ERROR: start coordinate is after end coordinate (chromStart > chromEnd) on line 5171240 of bed file probes.decipher.ncbi36
ERROR: 4 2515512 2515453

…which is a bit worrying.

I’ve gone through my probes and found that some of them (just 44757!) had start point coordinates greater than their ends. I guess that if you encounter those you’ll have to decide what to do. For the time being I just took them out and re run liftOver again.

This time it worked.

Filed under: Bioinformatics, Tutorials , , , , , , ,

Of Hackers and Painters

Paul Graham’s Hackers and Painters: Big Ideas from the Computer Age is a book that tells a lot about the life of the author himself. According to Graham, nerds do not spend as much time thinking about how to be popular. He calls himself a nerd and tells how he was in the ‘D’ popularity group at school, ‘E’ being the least popular group of all. This in part implies that hacker traits are implicit from young age. The author has the very interesting background of having studied painting in Florence after finishing his PhD in Computer Science at Harvard. He explains his experience creating his own startup company, called Viaweb, which ended up being bought by Yahoo, becoming a very successful online store tool.

The author thinks that the reason why Viaweb managed to become the leading company in their sector was due to providing their services via the web itself and using the Lisp computer language. According him, being a web software company has great advantages. Web companies are able to add features incrementally or fixing bugs without having to ship the application back to the user every time: such small changes make releases more manageable and bugs can be reproduced exactly without having to guess the user’s environment and configuration. Web applications are much easier to run, you just need a web browser. Traditional standalone applications require the user to know system administration (getting the right configuration) but in web applications the system administration burden is transferred to the developers instead.

Coding in Lisp apparently was a definitive advantage to keep abreast with competition. He argues strongly in favor of Lisp as the most powerful language due to its ability to provide greater capability for abstraction than any other language. Lisp was initially developed as a mathematical formalism and then made into language. Other computer languages tend to be based on assembler specifications, approaching concepts from the hardware point of view rather than mathematics. This argument made me think seriously about learning at least the fundamentals of Lisp to discover new ways of coding abstractions. I found in Paul Graham’s own website an introductory tutorial for Lisp. I look forward to my next trip to this new mindset.

Filed under: Technology , , , , ,

Computational Bioethics: Leveraging Personalized Medicine Ethically

Computational Bioethics relates to the appropriateness of use, management of access and discovery of biological insights applied to patient health. A simple search in Google shows that this concept has only been used in the past at a workshop session in the Rocky Bioinformatics conference in 2008. In the resulting publication of this meeting, Computational Bioethics is said to encompass ethical issues that are unique to computational scientists compared with bench and clinical scientists. In addition, it is argued that as medical informatics, computational biology and bioinformatics become more widely used in medical research, these issues will become even more relevant in the future.

And the future is now become a present reality. The rise of next generation sequencing technology, together with the adoption of microarray techniques in normal clinical practice is revolutionizing the way patient’s data is handled. Similarly, the amount of data created is surpassing most software tools available. Projects like the 1000 Genomes are generating an enormous pool of variability of genome data and derived sources of information. This makes necessary the exchange of data between research centers to be able to establish what constitutes normal and pathogenic variation in human beings.

Personalized medicine, a widely used term since the near completion of the Human Genome Project, implies the application and use of personal genome data for the determining genomic changes leading to disease, risk factors and patient’s susceptibilities. The problem with this sort of data is that it may yield information that could severely change the life style of the person, affect relatives and, in many cases, with little ability to do anything to mitigate the effects of findings. As our current knowledge is so fragmented and rudimentary, it is expected that collected information with no current value may be relevant in the future, influencing analyses and results.

Some ethical practices currently implemented in Computational Bioethics include the creation of consent forms that allow putting anonymized patient information in genome databases. This information may be accessed by the scientific community and compared with other previously consented patient data for the characterization of new syndromes, the effect of knockout genes in the general population and the understanding of genome variation in normal individuals.

Filed under: Computational Bioethics, Technology

How to be a Biohacker

Biohackers embrace fully the philosophy of hackers: love for freedom, veneration of competence and utter curiosity for how things work. How does one become a biohacker? Usually biohackers cannot tell if they are really one of them until someone else says so. However, it is not enough to be competent in the mastery of programming or being a computer wiz. You need IT skills that suit computational biology research and familiarity with the biology itself, which in the end is the problem one has to solve.

A big attitude to the biohacker philosophy is that you do not only need love to solve technical problems for their own sake; you need to think of living organisms as an extension of the information systems you work with. Biological concepts may be then abstracted into objects whose hierarchical organization reflect the different levels of order in living things. Computer languages thus become the perfect analogy for understanding the complex information flows in living systems.

True to hackerdom culture, Unix, Perl and MySQL are programming skills that you need to master (I can think of people who would also say Java, Javascript, CSS, etc.). The best way to master the art of programming is to spend as much time as possible reading and writing source code. Some people think Perl is doomed. This is not true in the biohackers world. In part due to legacy and in part to the flexibility it provides, Perl is still the language of choice for many biohackers. Perl is used to construct 1) the back end of web applications, 2) pipelines and workflows and 3) quick and dirty scripts for parsing and calling other programs.

You will also need to be familiar with projects like R and Bioconductor, since a lot of the work will involve providing the computational infrastructure for analyzing data. In addition, you’ll need to know about data formats (fasta, sbml, mmcif…), software toolkits and libraries (Paup, Phylip, EMBOSS, BioPerl…), databases (Ensembl, InterPro, PDB, KEGG…), webservers and portals (Pubmed, ISCB).

Finally keep in mind best practices. Some of them I wrote about in a previous post (like refraining from reinventing the wheel), but above all, give yourself the time to enjoy the learning process. Getting to the top usually takes longer than staying at the top; so what’s the point if you haven’t enjoyed the trip?

Filed under: Bioinformatics , , , , ,

EMBnet comes to Major Bioinformatics African Conference

Arun Gupta, Nicola Mulder, Manuel Corpas

Bioinformatics is a relatively affordable scientific discipline to establish as it requires intellectual capacity but not expensive laboratory facilities or equipment. This makes it a very accessible discipline to scientists in poorly resourced countries in Africa. The International Society for Computational Biology (ISCB) and the African Society for Bioinformatics and Computational Biology (ASBCB) have teamed up to organize a major meeting in Africa in 2009 focused on the theme “Bioinformatics of Infectious Diseases: Pathogens, Hosts and Vectors”. This meeting, a new venture between ISCB and ASBCB and a follow on from a previous successful meeting held in Nairobi by the ASBCB, will be held this year in Bamako (Mali), hosted by the prestigious Malaria Research and Training Center, an important facility for malaria research in Africa. Although it will have a particular African focus, the meeting is intended to be a fully-fledged international event, encompassing scientists and students from leading institutions in the US, Latin America, Europe and Africa. By holding this event in Africa, we intend to stimulate local efforts for cooperation and dissemination of leading research techniques to combat major African diseases.

Program

The meeting will consist of a 4-day conference followed by 2 days of practical workshops. The first 3 days of the meeting will include keynote presentations by 6 invited speakers from around the world. The last day of the conference will be a dedicated KAUST (King Abdullah University of Science and Technology) day focused on the topic “Systems view of biological organisms”. KAUST has secured 20 full fellowships (valued at up to $1,700 each) to cover travel expenses, registration and accommodation for Africans attending the ISCB Africa ASBCB conference. Highly accomplished researchers will present the 2 days of post-conference tutorial workshops. Erik Bongcam-Rudloff, chairman of EMBnet, will give a keynote presentation and a tutorial during a workshop day.

Participation

We expect that most participants will come from Africa, and that the majority will be on the level of junior faculty, PhD students and post-doctoral researchers. Travel fellowships will be awarded to African researchers and students to cover travel and local expenses, with priority given to those selected for oral presentations through peer review of submitted research paper submissions. Several other travel fellowships will be secured for non-African participants. Fellowships will be entirely dependent on the funds available.  Hence, prospective participants are encouraged to seek their own sources of funding.

Submissions

The conference will consist of a single track with oral and poster presentations. There will be a call for submission of abstracts (papers of up to 1000 words will be required for oral presentations); authors will indicate whether they aspire to give an oral or poster presentation. Graduate students, young investigators and all African researchers involved in the field will be strongly encouraged to submit an abstract describing their work. Submissions from all other parts of the globe are invited as well. The Scientific Committee will evaluate submissions, and the (few) selected abstracts from the oral presentation track will be invited for oral presentation, while others will be invited for poster presentation. Using this model we have previously established a powerful cadre of networking scientists – who have demonstrated that by meeting, they can establish networks of collaboration between Africa and the US and EU, as well as between African countries.

Proceedings

The first ASCBCB conference was published in a special issue of Infection, Genetics and Evolution (see: http://dx.doi.org/10.1016/j.meegid.2008.09.002 and http://dx.doi.org/10.1016/j.meegid.2008.09.003). The organizers are in the process of negotiating and securing a publication for the proceedings of this 2009 conference.

Website for further details:

https://www.iscb.org/iscb-africa

About EMBnet

EMBnet is a bioinformatics-based group of collaborating nodes throughout Europe and around the world. Its combined expertise provides services to the molecular biology community.

http://www.embnet.org/

Filed under: Bioinformatics, Conferences

About this Blog

Written and maintained by Manuel Corpas, a Computational Biologist at the Wellcome Trust Sanger Institute (Cambridge, UK), a leading international center for genome research. Any views expressed here are the author's alone and do not necessarily form part of the official positions of his employer.
Subscribe to me on FriendFeed