Sunday, January 15, 2012

A Basic Guide to the UCSC Liftover Command Line Utility


If, like me, you have spent time working with data files from various projects on the human genome, you have probably encountered the need to compare files that are built from different builds of the human reference genome.


If you are only working with a small number of sites, it is very conveniant to use the online UCSC Batch Coordinate Conversion (Liftover) tool.  Lately, however, I have needed to translate some larger genome files from an older build of the human genome (hg18) to the newest version (hg19).  


As an amateur programmer, I had a lot of trouble finding instructions on how to properly execute a liftover on my local machine.


After some trial and error I was finally succesful in converting between genome builds on my local maching using the UCSC executable liftover command line utility. So, I thought I would share some very basic instructions on how I accomplished this.  (Note that I work on a Mac so instructions for other machines might vary!).


DOWNLOADING AND CONFIGURING THE LIFTOVER EXECUTABLE
To begin, the data file you want to convert has to be in the Browser Extensible Data (BED) format. Basically, this is a tab delimited file in which the first three columns are chromosome, start position, end position.


Next, a few files need to be downloaded from the UCSC Genome Bioinformatics site and configured. The first of these is the actual liftover executable file. To download and configure this:

  1. Go to http://hgdownload.cse.ucsc.edu/admin/exe/ and download the appropriate version of the liftOver utility (depending on your system).  Running a macbook pro with Intel Core i5 processor, I chose to download the macOSX.i386/ version.
  2. Click on macOSX.i386/ then choose liftOver from the      directory (I had to right click to download the attached file).
  3. Once downloaded, you must give the file permission to run as an executable by running at the terminal prompt:

    • $ chmod +x liftOver  (make sure your present directory is  the one where liftOver resides)

THE BASIC LIFTOVER COMMAND
The liftOver executable is now ready to go. Run $liftOver at the terminal prompt to see a description of the commands. Note that if liftOver is not in a file in your $PATH, you either have to move it to a folder in $PATH or give the full location of the file: ex. /path/to/liftOver


The basic command of the liftOver utility takes this form:


$ liftOver oldFile map.chain newFile unMapped


Where:

  • liftOver is the initial command (again use full file location if executable is not in $PATH).
  • oldFile is the location of the file you want to convert to a new build (again, must be in BED format).
  • map.chain is the UCSC chain file that holds the instructions for conversion (instructions for download follow).
  • newFile is the location and name of the file that will hold the results of the successfully remapped output.
  • unMapped is the location and name of the file that will hold the results of the unmapped output.  



DOWNLOADING THE CHAIN FILE
As mentioned above, the liftover process requires an applicable .chain file from the UCSC site:

  1. Go to the UCSC Genome Browser Downloads Page.
  2. Find the download section for the species you are working with, then find the build of the genome that you wish to convert from.  Ex. If converting the human genome from hg18 to hg19, go to the downloads section for hg18.
  3. Find the link entitled LiftOver files and click.
  4. Find the appropriate conversion file and click to download (or right click and select Download Linked File.
  5. Once downloaded, use this file when you run the liftOver utility at the terminal prompt. 

I have successfully used these basic instructions to convert a few relatively large (100 to 200 MB) genome position files. If there are no large sections unmapped this tool should work without any additional specifications.  If you encounter any more problems, I found the following threads on the UCSC mailing list helpful.


I'm going to be trying to use the conversion tool on some larger whole-genome data files and I will update this post if I encounter any major issues.


I hope that some other amateur computational biologists out there find these brief instructions useful!