From Life in the Server to Life in the Cluster

March 12, 2012 § Leave a Comment

Life in early 2011:

  • Work around a server, one process, Gigabyte datasets.

Life in early 2012:

  • Work around a cluster, many processes, Terabyte datasets.

I remember the old days, when I had to pipette to run an experiment. Today I do not have to pipette, I run a command or pipeline in a computer terminal connecting remotely to a cluster of a few thousand nodes. Sometimes it might be quicker to run a PCR than running my workflow script.

I consider a privilege being “drown” in data. Why? Because this is the future. More data brings more hypotheses and more hypotheses bring more knowledge. One either learns to surf the waves or a tsunami ends up catching one soon enough.

How does it feel from the inside? It feels exciting, overwhelmingly exhilarating! It feels like wanting to surf in a sea of data yet happy to be able to barely keep afloat: this is the inevitable fate of those genome bioinformaticians dealing with Next Generation Sequencing data.

What next in my todo list?

Cloud computing. I am counting the days when my experiments will be run in the cloud, not the cluster.

By Sam Johnston (CC BY-SA 3.0 license)

I look forward to welcoming you to the data feast. Will you join?

Some 10 Current Interesting Challenges in (Computational) Biology

June 5, 2010 § 5 Comments

This is not an exhaustive list, but rather a compendium of current problems that I encounter on a regular basis. This post might be especially useful for students who want to find a challenging problem for their research or simply anyone interested to know some of the science that goes on at the Wellcome Trust Genome Campus and beyond.

  1. To understand genome variation. How to explain variation within and between species? What are the mechanisms that produced those changes? How can those changes explain different susceptibilities to diseases and traits?
  2. To predict a genotype given a phenotype. How to correlate phenotypic terms to specific mutations? How to encode phenotypes in a computationally friendly format?
  3. To understand genetic heritability of complex diseases like Alzheimer’s, Parkinson’s or Stroke. GWAS studies have shown that the contribution of any one gene to specific complex diseases is meager or marginal in most cases. What models are needed for modeling mutation leading to disease? What pieces of the puzzle are missing?
  4. To optimally manage the data resulting from large scale experiments. How to store this data and make it accessible? Where to store it? Locally? In the cloud? How to make sure that no important data is lost?
  5. To optimally integrate data from disparate sources for analysis. Should we use federated systems? How to combine the ever-growing number of formats? What software to use to make possible such analyses? How to visualize this data more intuitively?
  6. Data privacy and accessibility. As more and more sensitive data is produced for analysis of patients’ genomic disorders, how not to hamper reproducibility of experiments? At the same time, how can we protect the privacy of patients? How to secure systems where sensitive data is stored?
  7. Understanding the effects of epigenetics in molecular regulation and disease. What mechanisms are available for molecular regulation? How does it affect gene expression? What molecular agents are involved in epigenetics regulation?
  8. Understanding the role of RNAs as enzymes and regulatory entities. How many different kinds of RNA are there? What is their function? How did they evolve?
  9. How do transmembrane proteins fold? Given a protein sequence, can we predict their final 3D functional state?  How does the celular membrane affect the folding process? What helper molecules are involved to make sure that the protein folds correctly?
  10. Automatic extraction and text mining. Given the current mass of scientific literature, how can we extract automatically this knowledge from text? How close can we get for computers to “understand” human language? How to structure scientific literature to make it more machine-readable?

Sure I am missing many other important topics. I do apologize for those that I missed. Feel free to add your own if you wish.

How to be a Biohacker

July 14, 2009 § 4 Comments

Biohackers embrace fully the philosophy of hackers: love for freedom, veneration of competence and utter curiosity for how things work. How does one become a biohacker? Usually biohackers cannot tell if they are really one of them until someone else says so. However, it is not enough to be competent in the mastery of programming or being a computer wiz. You need IT skills that suit computational biology research and familiarity with the biology itself, which in the end is the problem one has to solve.

A big attitude to the biohacker philosophy is that you do not only need love to solve technical problems for their own sake; you need to think of living organisms as an extension of the information systems you work with. Biological concepts may be then abstracted into objects whose hierarchical organization reflect the different levels of order in living things. Computer languages thus become the perfect analogy for understanding the complex information flows in living systems.

True to hackerdom culture, Unix, Perl and MySQL are programming skills that you need to master (I can think of people who would also say Java, Javascript, CSS, etc.). The best way to master the art of programming is to spend as much time as possible reading and writing source code. Some people think Perl is doomed. This is not true in the biohackers world. In part due to legacy and in part to the flexibility it provides, Perl is still the language of choice for many biohackers. Perl is used to construct 1) the back end of web applications, 2) pipelines and workflows and 3) quick and dirty scripts for parsing and calling other programs.

You will also need to be familiar with projects like R and Bioconductor, since a lot of the work will involve providing the computational infrastructure for analyzing data. In addition, you’ll need to know about data formats (fasta, sbml, mmcif…), software toolkits and libraries (Paup, Phylip, EMBOSS, BioPerl…), databases (Ensembl, InterPro, PDB, KEGG…), webservers and portals (Pubmed, ISCB).

Finally keep in mind best practices. Some of them I have written about elsewhere (like refraining from reinventing the wheel), but above all, give yourself the time to enjoy the learning process. Getting to the top usually takes longer than staying at the top; so what’s the point if you haven’t enjoyed the trip?

Where Am I?

You are currently browsing entries tagged with computational biology at Manuel Corpas' Blog.

Follow

Get every new post delivered to your Inbox.

Join 27 other followers