Summary Statistics

Here we present a range of summary statistics related to the data which powers the dashboard (and some which doesn't), updated when the Catalog is updated. Note that these summary statistics and the figures themselves only use raw (wrangled) data from the Catalog. Currently:

  • There are a total of 6680 studies in the Catalog.
  • The earliest study in the Catalog is PubMedID 15761122 published on 2005-03-10 by Klein RJ et al.
  • The most recent study in the Catalog is PubMedID 38036781 published on 2023-11-30 by He Y et al.
  • The accession with biggest sample is presently PubMedID 37794016 (N=4690421) by Budu-Aggrey A et al.
  • There are presently a total of 89981 unique study accessions.
  • There are presently a total of 67171 unique diseases and traits studied.
  • There are presently a total of 16832 unique EBI "Mapped Traits".
  • The total number of associations found is presently 566798.
  • The average number of associations found is presently 6.3.
  • The mean P-Value for the strongest SNP risk allele is presently: 6.301e-07.
  • The number of associations reaching the 5e-8 significance threshold is presently: 554686.
  • The journal to feature the most GWAS studies is presently: Nat Genet.
  • The total number of different journals publishing GWAS is presently: 837.
  • The most frequently studied (Non-European) disease or trait is presently: Type 2 diabetes.

Funding, Contact and Acknowledgements

This work is currently maintained and hosted by the Leverhulme Centre for Demographic Science, where an earlier prototype was generously supported by the European Research Council (grants 615603 and 835079) and The British Academy. This dashboard has been redesigned in consulation with Global Initiative, with special thanks there to Quentin Brunier, Alex Malowany, Jamie May, Lea Misseri, Gareth Nixon, Veatriki Ntova and Chris Sinclair. In addition, we are grateful for comments on the source code and dashboard more generally from Ian Knowles, Yi Liu, Molly Przeworski, Ben Domingue, Sam Trejo and Oxford's SOCIOGENOME group. We look forward to updating the GWAS Diversity Monitor with your suggestions via email (contact@gwasdiversitymonitor.com), Twitter (@OxfordDemSci or @melindacmills) or GitHub.

About

This interactive dashboard monitors the diversity of participants across all published Genome Wide Association Studies (GWAS), the primary technique used for genetic discovery. The objective of a GWAS is to identify statistical associations between a set of genetic variants across different individuals (Single Nucleotide Polymorphisms, or 'SNPs') with specific traits of interest. This monitor is an extension of our earlier project, 'The Scientometrics of Genome Wide Association Studies' published in Communications Biology in January 2019. As there, we leverage the magnificent dataset curated by the NHGRI-EBI Catalog (subject to their licensing information which can be found here ). The dashboard itself is a combination of Python (Flask) and JavaScript (D3), designed to be used in modern web browsers for presentation. An earlier prototype appears here, with the full code base available for replication on GitHub. We are actively encouraging community-based suggestions and contributions. The backend code checks daily for updates to the NHGRI-EBI Catalog, writing to logs and refreshing the dataset which powers the dashboard as appropriate. The dashboard is under perpetual development and review, but is currently comprised of two global widgets: METRIC (which toggles whether we are evaluating by number of studies or by number of participants), and STAGE (which determines whether we are considering the discovery or replication phase of research). Local widgets also toggle the 'EFO Parent Term', the 'Broader' ancestry category, and 'Year' (related to year of study). Due to the relatively small size of the dataset, not all widgets apply to each figure, and we describe our design choices below:

  • Summary Breakdown: Total GWAS participants diversity. Shown in the upper left panel, this displays summary statistics without any filtering (over time, ancestry, traits, or otherwise).
  • Bubble Plot: Ancestry over time by parent term. This graphic in the upper middle panel provides a granular overview of all GWAS in the Catalog, mapping onto EFO Parent terms. In addition to being affected by the STAGE global widget, it also allows a finer, more granular search term based on individual EFO Traits (and combinations thereof). Clicking on the individual bubbles provides detailed study information (including Unique Identifiers provided by the and links out to the relevant PubMed page.
  • Time Series Plot: Participants across all parent terms. This figure in the upper right panel displays how 'Broader Ancestry' varies over time across the two global widgets. Note that we do not divide by the ‘Parent Term’ widget here due to the fact that that least studied ancestry categories are relatively absent by this level of disaggregation. The tickbox for 'Include not recorded' provides a robustness check with respect to how we are mapping our 'Broader Ancestry' field.
  • Heatmap: Parent term by 'broader' ancestry. This figure in the bottom left panel displays maps studies across both STAGE and METRIC, broken down into each of the individual 'Broader Ancestry' and 'Parent Terms' across each of the years in the dataset. Hovering over the figure reveals numbers by each category.
  • Choropleth Map.Due to the sparsity of countries recruited from across all EFO Parent terms, this figure draws from the two global widgets, adjustable by Year only. Polygon data are available from here. Hovering over the figure provides the name of the country and number of participants.
  • Doughnut Chart. This figure displays the percent of 'Broader Ancestry' by 'Parent Term' across both the global widgets. A 'Show'/'Hide' toggle reveals an inset graph of the breakdown of associations discovered across all EFO Parent terms at the discovery stage alone.