This is not an exhaustive list, but rather a compendium of current problems that I encounter on a regular basis. This post might be especially useful for students who want to find a challenging problem for their research or simply anyone interested to know some of the science that goes on at the Wellcome Trust Genome Campus and beyond.
- To understand genome variation. How to explain variation within and between species? What are the mechanisms that produced those changes? How can those changes explain different susceptibilities to diseases and traits?
- To predict a genotype given a phenotype. How to correlate phenotypic terms to specific mutations? How to encode phenotypes in a computationally friendly format?
- To understand genetic heritability of complex diseases like Alzheimer’s, Parkinson’s or Stroke. GWAS studies have shown that the contribution of any one gene to specific complex diseases is meager or marginal in most cases. What models are needed for modeling mutation leading to disease? What pieces of the puzzle are missing?
- To optimally manage the data resulting from large scale experiments. How to store this data and make it accessible? Where to store it? Locally? In the cloud? How to make sure that no important data is lost?
- To optimally integrate data from disparate sources for analysis. Should we use federated systems? How to combine the ever-growing number of formats? What software to use to make possible such analyses? How to visualize this data more intuitively?
- Data privacy and accessibility. As more and more sensitive data is produced for analysis of patients’ genomic disorders, how not to hamper reproducibility of experiments? At the same time, how can we protect the privacy of patients? How to secure systems where sensitive data is stored?
- Understanding the effects of epigenetics in molecular regulation and disease. What mechanisms are available for molecular regulation? How does it affect gene expression? What molecular agents are involved in epigenetics regulation?
- Understanding the role of RNAs as enzymes and regulatory entities. How many different kinds of RNA are there? What is their function? How did they evolve?
- How do transmembrane proteins fold? Given a protein sequence, can we predict their final 3D functional state? How does the celular membrane affect the folding process? What helper molecules are involved to make sure that the protein folds correctly?
- Automatic extraction and text mining. Given the current mass of scientific literature, how can we extract automatically this knowledge from text? How close can we get for computers to “understand” human language? How to structure scientific literature to make it more machine-readable?
Sure I am missing many other important topics. I do apologize for those that I missed. Feel free to add your own if you wish.
I recall us chatting a while ago on your current research, and I stumbled across your blog … thought I’d add my 2p worth from my speciality.
“5. To optimally integrate data from disparate sources for analysis. Should we use federated systems? How to combine the ever-growing number of formats?”
One fairly plausible method is to use one of the data warehouse models – like the kimball (http://www.ralphkimball.com/) star schema. One of more central fact tables with a number of dimensions hanging off the fact tables.
I realise your total database is multi-Tb in size, but with modern database compression techniques and optimised designs, a star schema data warehouse design becomes feasible.
On the formats question, I’d suggest some data format that is fairly ubiquitous, like SQL in one of its many flavours. Possibly OLAP cubing might present some possibilities, if the dimension tables that support gene data have some strong hierarchies, which naturally lead into aggregations.
Look forward to catching up with you guys sometime soon!
Nice to hear from you! Your suggestion actually reminds me of a solution which seems to have had some success in the Bioinformatics world, called BioMart: http://www.biomart.org/. In the website it reads as a query-oriented data management system. I have heard good and bad things about it, but it does use a star schema, a step in the direction you point to in your previous comment.
In a nutshell, it is analyzing evolutionary problems (and all problems in biology are to some extent related to evolution) as if it were backengineering of an object orientated software program. Fauceirs are units of evolution that comprise both information and control function as objects in an object orientated software.
Well some, not all, of these problems can be better tackled, not solved, by taking the fauceir approach.
what’s the fauceir approach? Please explain.