For the past few days I have been trying to compile the list of gene names that is the most complete possible. To start with, I was given an initial list of genes in an excel file that was taken from the HUGO Gene Nomenclature Committee (HGNC). Unfortunately, the gene names were pasted from the original source (HGNC) to an Excel spreadsheet without modifying the expected format of the column cells. This led to Excel trying to “help” with the formatting of the value inserted, changing those gene names that are similar to dates to an actual date. In the bioinformatics field, misnaming a gene can lead to disastrous consequences such as misdiagnosis of a causal gene in a clinical setting. Thus:
Beware of pasting gene names in an Excel spreadsheet with a default format, as these may be changed into dates.
From my current list of 19,026 genes that I have compiled as of now, here are the names of the genes that have been automatically changed by Excel into dates. In the table below, the first column denotes the date the gene name is changed to, the middle column the Ensembl ID of the gene and the right column the actual name that was changed by Excel into a date.
Sep-01 ENSG00000180096 SEPT1 Sep-02 ENSG00000168385 SEPT2 Sep-03 ENSG00000100167 SEPT3 Sep-04 ENSG00000108387 SEPT4 Sep-05 ENSG00000184702 SEPT5 Sep-06 ENSG00000125354 SEPT6 Sep-07 ENSG00000122545 SEPT7 Sep-08 ENSG00000164402 SEPT8 Sep-09 ENSG00000184640 SEPT9 Sep-10 ENSG00000186522 SEPT10 Sep-11 ENSG00000138758 SEPT11 Sep-12 ENSG00000140623 SEPT12 Sep-14 ENSG00000154997 SEPT14 Mar-01 ENSG00000145416 MARCH1 Mar-02 ENSG00000099785 MARCH2 Mar-03 ENSG00000173926 MARCH3 Mar-04 ENSG00000144583 MARCH4 Mar-05 ENSG00000198060 MARCH5 Mar-06 ENSG00000145495 MARCH6 Mar-07 ENSG00000136536 MARCH7 Mar-08 ENSG00000165406 MARCH8 Mar-09 ENSG00000139266 MARCH9 Mar-10 ENSG00000173838 MARCH10 Mar-11 ENSG00000183654 MARCH11 Dec-01 ENSG00000173077 DEC1
Rajesh
Just found out that 1-Mar gene can be actually 2 genes:
MARC1(chr1) or MARCH1(chr4)
Andres Muñiz (@Andresinmp)
I guess the same would have happened with other spreadsheet programs like libreoffice calc and gnumeric?
Would it be better to just paste it in a plain txt file?
What about limits in length? At work I have problems with excel because of the low number of columns it lets you have. I only realized later that I was loosing data. Gnumeric has a lot more columns and I trust their statistical calculations a lot more (linked to the R-project)
admin
Hi Andres!
today I have been working on an Excel spreadsheet that had approximately 60,000 rows and surprisingly, it did not crash. I say surprisingly because I few years ago, when I was doing my PhD it would crash if it had more than 20,000 rows.
Thanks for your contributions/comments.
Manuel
admin
Thanks for the link. I have found that there are 15 gene names that were not reported in the publication above [http://www.biomedcentral.com/1471-2105/5/80]. There are 34 gene names reported that I do not encounter in my list.
My gene list is the most up-to-date compilation of protein coding genes that have Ensembl IDs and coordinate mappings to GRCh37. These remaining 34 that I do not have could be
1) Old genes that have been dropped
2) Not protein coding
3) Do not have coordinates in GRCh37
Anonymous
“Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics”
http://www.biomedcentral.com/1471-2105/5/80