It might seem for some people straight forward but I had to spend quite some time trying to understand how to remap my array probes from ncbi36 to CGRCh37. If you use the Ensembl genome browser, you might have noticed that from July 2009 the ncbi37 assembly is now in use. For DECIPHER (the database I help develop), this is a little bit of a headache, because it means that all of the probes from array CGH that we used have to be remapped to the new assembly. If this does not interest you I recommend that you stop reading here.
First I learned that there is a program called liftOver by UCSC that is able to do this remapping. Since the amount of probes I have to map (around 6 million) is a number that I would not wish to through to anyone’s server, I decided to do this in-house. You can download this program from here. I did not know which was the right binary for me to download, as they had linux32 and linux64 versions. I decided to go for the former, since I am using debian and it sounds like a conservative option.
Once I downloaded the program, I needed to make it executable:
chmod u+x liftOver
OK, so I was in a position to run it:
./liftOver
In the usage information it appears that I need several arguments and files to be able to run this program correctly:
liftOver oldFile map.chain newFile unMapped
Now I learned that I need also to get a file called the map.chain. I was not sure what it meant. I learned that this map.chain file has parameters that are used by liftOver and that there are map.chain files depending on the remapping one wants to do. In my case, I want to remap from ncbi36 to GRCh37 in human. However, when I look at the different remappings, I do not see ncbi formats anywhere. I learned here that what I am looking for is map chain file that is called this:
hg18toHg19.over.chain
Apparently hg18 refers to ncbi36 and hg19 to ncbi37. Doing a google search I could find that file here.
Now I get quite a few options and learn that I need to have my probes in bed format to run liftOver. Apparently there are quite a few formats I can use according to UCSC FAQs formats. Here an example of what my bed file looks like (chromosome-tab-start_position-tab-end_position):
chrY 12308579 12468100 chrY 12468101 12581699 chrY 12581700 12759636 chrY 12759637 12838587
Now I am in a position to run liftOver. I notice now that in the usage one has the following description:
liftOver oldFile map.chain newFile unMapped
‘newFile’ and ‘unMapped’ are the names of the files where the output goes into and therefore are empty. This can be confusing as the user might think that these are some other kind of files one has to get hold of.
OK, so now I am ready to transform our old array probe mapping ncbi36 to the new ncbi37 one:
./liftOver probes.ncbi36 hg18toHg19.over.chain probes.grch37 unmapped-to-grch37
I got the following output to console:
Reading liftover chains Mapping coordinates ERROR: start coordinate is after end coordinate (chromStart > chromEnd) on line 5171240 of bed file probes.decipher.ncbi36 ERROR: 4 2515512 2515453
…which is a bit worrying.
I’ve gone through my probes and found that some of them (just 44757!) had start point coordinates greater than their ends. I guess that if you encounter those you’ll have to decide what to do. For the time being I just took them out and re run liftOver again.
This time it worked.
Alex
Great thanks, but shouldn’t NCBI37 read “GRCh37”?
manuelcorpas
yes, you are absolutely right!
Manuel
jeff
whoops wrong post… should have been under Validating Chromosome Entered is Correct
I lose.
jeff
Would the following regex help? Haven’t tried it in JS, but should just work….
^([1-9]|1[0-9]|2[0-2])(X|Y)$
Felix
Thanks for this, Manuel! Saved a lot of time.