Questions and Answers
1. What is a Genome Wide Association Study (GWAS)?
We have known for some time that the majority of genetic differences between individuals comes from
variations in DNA sequences. These differences are measured by genetic variants (Single-Nucleotide
Polymorphisms, or 'SNPs'). Humans are otherwise 99.9% identical to each other and it is this 0.1% by which
we differ that makes us all unique. To study these differences, pioneering researchers developed a
methodological approach known as the Genome Wide Association Study (GWAS, pronounced 'gee-woz'), which is a
hypothesis-free method that searches the entire genome to discover how SNP variations are associated with a
specific trait of interest. It is rarely the case that it is just one genetic variation that is associated
with a trait, but often many. Only in the case of 'Mendelian diseases' like Huntington's is there only one
single associated gene. The majority of traits which we are interested in are complex and are the result of
multiple genetic loci (polygenic).
2. Why did we develop the GWAS Diversity Monitor?
We have known for some time (i.e. Need and Goldstein 2009) that GWAS and genetics research uses data from
highly selective groups of people that do not equally represent the inhabitants of people from that country
or the world. However, until now, there has been no systematic way for stakeholders to analyze this question
in an accessible way. Although everyone agrees it is important, there wasn’t any way get the most recent
estimates without some small degree of expertise in data analysis. For this reason, we developed the GWAS
Diversity Monitor to track where current research is lacking in terms of geography, disease, and ancestry
groups, and how this varies over time. The GWAS Diversity Monitor dynamically outlines which diseases and
traits have (and have not) been studied across which ancestry groups over time and provides easy and
searchable links to each individual study searchable by individual trait.
3. Where does the data driving the GWAS Diversity Monitor come from?
The GWAS Diversity Monitor is driven by data taken from the NHGRI-EBI GWAS Catalog, with detailed information
on the eligibility criteria available elsewhere
and the NHGRI-EBI Catalog itself. For inclusion in
the Catalog, studies are identified via an automated PubMed literature search and then manually examined
against matching the inclusion criteria. Trained GWAS Catalog curators then include the information in the
Catalog using a pre-defined protocol.
4. Is there any replication material?
Yes! We are advocates for Open Science and encourage you to download all of the graphs and underlying data
you need, with the hope that you will be able to easily integrate it into your own research. With respect to
the data, we give full attribution to the NHGRI-EBI
Catalog and point to their EMBL-EBI terms of use. With regards to the code,
we have made this available under an MIT License (a permissive free software license) and it can be found on
the Leverhulme Centre for Demographic Science (LCDS) GitHub page. Please
let us know if and how this has been integrated into your work!
5. What does the category ‘Parent term’ mean?
Our online dashboard divides the traits that are studied across GWAS by the standard Experimental Factor
Ontology (EFO), with the most aggregated or highest-level categories often referred to as 'parent terms'.
This provides a systematic description of many experimental variables available in EBI databases, and for
projects such as the NHGRI-EBI GWAS Catalog. Please see the EFO page on the EMBL-EBI Ontology Lookup Service
GitHub for more information.
6. How are the groups divided by 'Ancestry'?
The GWAS Diversity Monitor is divided into 6 broad ancestry groups, based on a framework developed by Morales
et al. (2018) which creates a standard for ancestry classifications in GWAS research. Although we
recognize that it is suboptimal genetically to categorize groups into such broad categories, this type of
assignment reduces complexity and enables visualizations and categorisation. Furthermore, since most
non-European ancestry groups remain underrepresented, disaggregating to a more detailed level is often not
useful. Our hope is that as diversity increases, disaggregated groups will be more frequently represented in
the future.
7. Are race and ancestry the same thing?
As described in described in Mills
et al. (2020) and Peterson
et al. (2019), one common misconception is that the term 'ancestry' used in population genetics equates
to racial or ethnic differences. Genetic variation in populations (ancestry) is different from the social,
cultural, and political meanings ascribed to different human groups. Race is not a biological category since
genetic variation is traced to geographical locations and does not map into the perpetually changing and
socially and politically defined racial or ethnic groups. Populations are the product of repeated mixtures
over tens of thousands of years. The concentration of genetic alleles in some groups is thus related to
where they have descended from. For example, the ancestral category of 'African-American' is highly diverse
can be traced in relation to migration and assortative mating patterns.
8. Why is there a difference in the doughnut charts of ‘participants by ancestry’ and ‘count of all
associations discovered’? Why do ancestry groups have different allele frequencies?
Most evidence places the origins of Homo sapiens in Africa. Sub-Saharan Africa is where patterns of DNA
sequence variation are the greatest, and populations with the greatest genetic variation are assumed to be
the oldest. Current knowledge places the cradle of the human species in what is now modern Namibia and
Angola in Southern Africa. This is attributed to the fact that when humans migrated to new regions, they
took progressively smaller amounts of genetic variation in the gene pool with them. Each new population is
younger than its original source and thus has less time to accumulate new mutations, with European ancestry
groups having the least variation. Sequencing of Khoi-San bushmen showed that even two people
from adjacent villages were as different from one another as any two European or non-African ancestry
individuals.
9. Why can’t the results from GWAS derived from European ancestry populations simply be applied to
other ancestry groups?
Polygenic scores derived from one population are not reliable when applied to others, shown previously by Alicia Martin et al.
(2017). For example, the results for BMI (body mass index) derived from a GWAS from a European
population only explains 3% of the variance for individuals of African ancestry compared to 13% of those
with European ancestry.
10. Why is it important to look at diversity beyond ancestry and towards historical, geographical,
demographic differences?
Although ancestry has often been the focus of the diversity discussion in genetics, research has shown that
genetic findings also largely differ by historical period, birth cohort and the geographical location,
demographics and socioeconomic context of individuals.
11. Why is it important to get more diverse populations in genetics and genomics research??
GWA studies that draw from data from diverse populations will provide more accurately targeted therapeutic
treatments to more of the world’s population, extend insights into the genetic architecture of traits and
very likely make new genetic discoveries. Diagnoses, treatments and interventions derived from European
ancestry populations cannot be easily applied to other groups and if they are, they could even cause damage.
For instance, a recent study published in Cell contained data on people from rural
Uganda and identified 10 new links between diseases and genes which had been previously undetected. They
found that 22 percent of people in the sample have a gene that causes the blood disorder thalassemia. This
supposed ‘disorder’ has in fact become common in the African population since it is protective against
severe malaria. The genetic variant is linked to glycated haemoglobin which is currently used (based on
discoveries from European populations) to diagnose diabetes, which in turn would lead to incorrect diagnoses
of diabetes in this African population.
12. What's the landscape looking like going forward?
Given the full release of the UK Biobank and increased reliance on large direct-to-consumer data, we predict
that diversity in GWAS ancestry may decrease even further given that 94.23% of the 488,377 genotyped UK
Biobank participants are in the white ethnic group and 23andMe has a sample with 77% European ancestry. This
acknowledgement can give further impetus to large scale data collection that is currently taking place in
Africa, large projects such as NIH’s All of Us study
or new initiatives such as the UK’s plans for a new 5 million person cohort.
13. What do you mean by quasi-real time monitoring? Why do some of the numbers mentioned in academic
articles differ from the online dashboard?
Note that given the ‘live’ and on-going dynamic nature of the GWAS Diversity Monitor, the figures update on a
regular basis and may not concur exactly with those in the published article. Quasi real-time refers to the
fact that we check the EMBL-EBI each morning and automatically update the monitor when the Catalog itself is
updated (approximately every three weeks).
14. Why do some of the numbers of total GWAS diversity differ across the different figures?
The ‘Total GWAS participants diversity’ in the upper left corner represents total, unfiltered estimates of
GWAS participant diversity (dropping all rows of the Catalog which have at least one ancestry which is not
reported). It is cumulative and contains all GWAS up until a specific point in time (excluding those which
have 'not-recorded' ancestries) and across both stages (discovery and replication).
The other two figures which we might expect to have identical numbers but do not are the 'All parent terms'
version of the doughnut chart, and the 2019 time series estimate. These are different to the headline
summary figures because they are time contingent, year-on-year estimates, and are subject to the STAGE
global widget. Furthermore, these two figures actually differ from each other because of the lack of some
specific EFO terms for a fraction of the data which prevents mapping onto parent terms in the construction
of the doughnut chart.