Beware of Gene Names in Excel

For the past few days I have been trying to compile the list of gene names that is the most complete possible. To start with, I was given an initial list of genes in an excel file that was taken from the HUGO Gene Nomenclature Committee (HGNC). Unfortunately, the gene names were pasted from the original source (HGNC) to an Excel spreadsheet without modifying the expected format of the column cells. This led to Excel trying to “help” with the formatting of the value inserted, changing those gene names that are similar to dates to an actual date. In the bioinformatics field, misnaming a gene can lead to disastrous consequences such as misdiagnosis of a causal gene in a clinical setting. Thus:

Beware of pasting gene names in an Excel spreadsheet with a default format, as these may be changed into dates.

From my current list of 19,026 genes that I have compiled as of now, here are the names of the genes that have been automatically changed by Excel into dates. In the table below, the first column denotes the date the gene name is changed to, the middle column the Ensembl ID of the gene and the right column the actual name that was changed by Excel into a date.

Sep-01    ENSG00000180096        SEPT1    
Sep-02    ENSG00000168385        SEPT2
Sep-03    ENSG00000100167        SEPT3
Sep-04    ENSG00000108387        SEPT4
Sep-05    ENSG00000184702        SEPT5
Sep-06    ENSG00000125354        SEPT6
Sep-07    ENSG00000122545        SEPT7
Sep-08    ENSG00000164402        SEPT8
Sep-09    ENSG00000184640        SEPT9
Sep-10    ENSG00000186522        SEPT10
Sep-11    ENSG00000138758        SEPT11
Sep-12    ENSG00000140623        SEPT12
Sep-14    ENSG00000154997        SEPT14

Mar-01    ENSG00000145416        MARCH1
Mar-02    ENSG00000099785        MARCH2
Mar-03    ENSG00000173926        MARCH3
Mar-04    ENSG00000144583        MARCH4
Mar-05    ENSG00000198060        MARCH5
Mar-06    ENSG00000145495        MARCH6
Mar-07    ENSG00000136536        MARCH7
Mar-08    ENSG00000165406        MARCH8
Mar-09    ENSG00000139266        MARCH9
Mar-10    ENSG00000173838        MARCH10
Mar-11    ENSG00000183654        MARCH11

Dec-01    ENSG00000173077        DEC1

 

5 comments

  1. Andres Muñiz (@Andresinmp)

    I guess the same would have happened with other spreadsheet programs like libreoffice calc and gnumeric?
    Would it be better to just paste it in a plain txt file?

    What about limits in length? At work I have problems with excel because of the low number of columns it lets you have. I only realized later that I was loosing data. Gnumeric has a lot more columns and I trust their statistical calculations a lot more (linked to the R-project)

    1. admin

      Hi Andres!

      today I have been working on an Excel spreadsheet that had approximately 60,000 rows and surprisingly, it did not crash. I say surprisingly because I few years ago, when I was doing my PhD it would crash if it had more than 20,000 rows.

      Thanks for your contributions/comments.

      Manuel

  2. admin

    Thanks for the link. I have found that there are 15 gene names that were not reported in the publication above [http://www.biomedcentral.com/1471-2105/5/80]. There are 34 gene names reported that I do not encounter in my list.

    My gene list is the most up-to-date compilation of protein coding genes that have Ensembl IDs and coordinate mappings to GRCh37. These remaining 34 that I do not have could be

    1) Old genes that have been dropped
    2) Not protein coding
    3) Do not have coordinates in GRCh37

Leave a Reply

Discover more from Manuel Corpas

Subscribe now to keep reading and get access to the full archive.

Continue reading