Why finding data from papers is still so hard?

May 25, 2017

Last night on Twitter I saw a mention to an article in The Guardian entitled “Scientists identify 40 genes that she new light on biology of intelligence”.

Screen Shot 2017-05-24 at 10.07.20

Screen Shot 2017-05-24 at 10.07.58

Cool, this sounds like my kind of thing. Luckily the article itself did contain a link to the original source, a Nature Genetics article entitled “Genome-wide association meta-analysis of 78,308 individuals identifies new loci and genes influencing human intelligence“. So I went to the actual article to check it out. The abstract mentions that from these 78k+ individuals they came up with 336 SNPs implicating 22 genes, of which 11 are new findings. OK, the result a priori looks credible given the huge amount of individuals that were implicated in it. In addition, the named genes appear to be mainly expressed in the brain.

As I begin to read the introduction, it says that they combined genome-wide association study (GWAS) data from 13 cohorts. It then refers me to ‘Online Methods’, presumably directing me to where the data from which results originated is available.

Screen Shot 2017-05-24 at 10.20.04

The problem is that I am not subscribed to Nature Genetics and I am not able to check the online methods. Online methods are behind the paywall. Nature Genetics does kindly allow the reading of the paper via its application ReadCube but access to the online methods remains elusive. Other than making reference to the online methods, there is no clear indication of where the data is. I scrolled up and down the online open access bit of the article and the only clue about the data they are using appears in Supplementary Table 1, an excel spreadsheet where cohorts are indicated as codes. I find no links to public repositories from which these cohort codes could be retrieved from.

Next, I decide to search Google. I take the name of the first cohort ‘CHIC:ALSPAC’ in the spreadsheet and perform a search. Google points back to the article. However the exact text that Google matches is still behind the paywall. I cannot access it.

Screen Shot 2017-05-24 at 10.52.54

I cut then the bit of text that Google provides in its search results. Despite it has now made clearer that this cohort name is relevant for me to find the original datasets, I am unable to see whether there is any indication of any potential repository related to this dataset.

Next, I search Repositive. ‘CHIC:ALSPAC’ retrieves nothing. Searching for CHIC retrieves 11 results, all relate to ‘Capture Hi-C experiments’ but this does not seem that relevant.

Just to get to this point it has taken me almost an hour. I have another meeting and I have stop searching. I do not question that the authors have put somewhere the datasets from which the study is derived from. By reading the supplementary text and figures I find potential data sources, e.g., ‘Manchester and Newcastle Longitudinal Studies of Cognitive Ageing Cohorts’, ‘Twins Early Development Study’, etc. Despite this, there are no clear links as to how to even get access to the data.

In my opinion, this is *not* how data should be made available for any given published research. I would argue that articles need to clearly show the links to where the data is deposited, even if granted access need to be sought.

An example of good data accessibility? Here is one:

Screen Shot 2017-05-24 at 15.16.36

A study that shows a description about induced pluripotent stem (iPS) cells. It comes from an initiative called HipSci. I search Google using ‘HipSci’ as query. The top link in the Google search retrieves the project’s website, which directly links to a page describing the data.

Screen Shot 2017-05-24 at 15.18.03

Not only it is clear to find about the data. There are instructions for how to access, find and download the data.

Screen Shot 2017-05-24 at 15.20.22

I then decide to search Repositive for ‘HipSci’ and this is what I get:

Screen Shot 2017-05-24 at 15.21.42

I am not sure whether this search result matches all of the actual data provided by the study, but in any case, it looks very comprehensive to me. I am impressed and pleased.

This is a good old #DATAEUREKA moment! Yay!

Did this interest you? Follow me on twitter or my Personal Genomics Zone blog to stay connected.