These weeks I will be putting here the lectures I am currently delivering for Biobanking for Data Science, a module that I lead at the University of Westminster for the MSc AI Digital Health course I also lead. I will be providing a summary and the recordings for these lectures.
The research presented here is available via this preprint, which is currently in revision.
Biobanks have become one of the most powerful infrastructures in modern biomedical research. By systematically collecting, storing, and cataloguing biological materials alongside rich clinical and lifestyle data, they enable discoveries that shape the way we understand disease, predict risk, and develop personalized treatments.
Over the past two decades, biobanking has evolved from static repositories into dynamic digital ecosystems. What began in the early 2000s with standardization efforts has now entered an era of high-throughput data generation, ethical regulation, and most recently artificial intelligence (AI) integration . Today, biobanks are not just about sample storage; they are about data-driven health intelligence.
Why Biobanking Matters
Biobanks underpin critical advances in both research and healthcare:
- Personalized medicine: Genetic and molecular data from biobanks enable treatments tailored to an individual’s profile.
- Diagnostics: Biomarkers identified through biobank analyses improve diagnostic accuracy.
- Preventive medicine: Predictive models trained on biobank data help forecast disease risk before symptoms arise .
From the UK Biobank’s half a million participants to the U.S. All of Us initiative and the Estonian Biobank, these large-scale resources are shaping how science approaches both common and rare diseases .
The Digital Turn: AI and Biobanking
The 2020s have brought a decisive shift: the convergence of biobanking with AI and digital health. Machine learning tools now allow us to sift through immense, heterogeneous datasets—genomes, imaging, electronic health records—to generate insights that were previously inaccessible .
This is where Large Language Models (LLMs) enter the stage. Trained on massive text corpora, LLMs like GPT and Claude are proving their value in mining biobank-related literature, summarising biomedical findings, and even benchmarking patterns across thousands of studies .
Benchmarking LLMs on Biobank Research
In recent work, I explored how LLMs perform when applied to UK Biobank research outputs. The findings are illuminating:
- Top topics: Genome-wide association studies (GWAS), cardiovascular disease, type 2 diabetes, and Mendelian randomisation dominate the field .
- Data diversity: While UK Biobank is a rich resource, its representativeness of the UK population remains debated .
- Model performance: Some LLMs (e.g., Gemini 2.0 Flash) captured the breadth and depth of biomedical concepts more effectively than others, showing promise for automated synthesis of biobank knowledge .
Yet challenges remain—coverage does not guarantee accuracy, and LLMs still struggle with clinical reasoning, multimodal integration, and hypothesis generation .
Challenges and Opportunities Ahead
The promise of AI-enabled biobanking is vast, but so are the challenges:
- Bias and representativeness: Ensuring that biobanks reflect diverse populations remains a pressing equity issue.
- Data access and interoperability: Formats, standards, and siloed governance often impede global collaboration .
- Trust and ethics: Protecting sensitive data while enabling meaningful use is a balancing act that requires robust regulation and community trust .
If addressed properly, the integration of biobanking with LLM-driven analytics could revolutionise our ability to link genetics, lifestyle, environment, and health outcomes.
Looking Forward
The future of biobanking lies in its digital transformation. AI and LLMs are not replacing the scientific process but augmenting it, helping us navigate complexity, accelerate discovery, and bring equitable precision medicine closer to reality.
The question now is not whether we will use these tools, but how responsibly and effectively we can embed them into global health research infrastructures.


















































Leave a comment