Anonymizing Patient Records for Next Generation Sequencing Studies

The advent of Next Generation Sequencing (NGS) is proving valuable for Genome-Wide Association Studies (GWAS), allowing identification of specific genomic variants leading to disease. The release of genomic data together with clinical features are of paramount value for validation of biological associations. This data involve real individuals, whose privacy should be protected. Protecting the individual’s privacy while making research findings available to the community for validation is not an easy task. There is a trade-off between how much personal data can be made public and the amount to which this data released may be useful for further studies.

Recently, this topic has been addressed by Loukides et al., in a PNAS paper were they present a method for automatically extracting linkable clinical features, which are then modified in a way that they can no longer be used to link to a small number of patients.

Modifications to clinical features are carried out preserving the associations between genomic sequences and specific sets of clinical features relevant to GWAS-related diseases. The user specifies the sets of diagnosis codes that are linkable to specific disorders and the algorithm modifies the linkable codes so that they cannot be attributed to a small number of individuals. Such code modification is carried out making sure that clinical association validation is retained, where sets of clinical features are replaced with semantically related codes to satisfy the distribution of GWAS-related diseases.

Thus, an attacker who knows a set of clinical features diagnosed for a single visit, would not be able to uniquely identify an individual because each record links to no fewer than say, 5 individuals (k=5).

Some limitations to this approach are also discussed. Data linkage may be performed if additional sources of data are possessed about the individual. Allowing data owners to make the decision of the appropriate level of protection (setting k) may hamper the reproducibility of findings if the attacker’s knowledge is overestimated, as well as compromise the individual’s privacy if underestimated. Their method also does not guarantee that the information loss for the chosen level of protection is minimized.

Leave a Reply

%d bloggers like this: