Questions and Answers

1. What is a Genome Wide Association Study (GWAS)?

We have known for some time that the majority of genetic differences between individuals comes from variations in DNA sequences. These differences are measured by genetic variants (Single-Nucleotide Polymorphisms, or 'SNPs'). Humans are otherwise 99.9% identical to each other and it is this 0.1% by which we differ that makes us all unique. To study these differences, pioneering researchers developed a methodological approach known as the Genome Wide Association Study (GWAS, pronounced 'gee-woz'), which is a hypothesis-free method that searches the entire genome to discover how SNP variations are associated with a specific trait of interest. It is rarely the case that it is just one genetic variation that is associated with a trait, but often many. Only in the case of 'Mendelian diseases' like Huntington's is there only one single associated gene. The majority of traits which we are interested in are complex and are the result of multiple genetic loci (polygenic).

2. Why did we develop the GWAS Diversity Monitor?

We have known for some time (i.e. Need and Goldstein 2009) that GWAS and genetics research uses data from highly selective groups of people that do not equally represent the inhabitants of people from that country or the world. However, until now, there has been no systematic way for stakeholders to analyze this question in an accessible way. Although everyone agrees it is important, there wasn’t any way get the most recent estimates without some small degree of expertise in data analysis. For this reason, we developed the GWAS Diversity Monitor to track where current research is lacking in terms of geography, disease, and ancestry groups, and how this varies over time. The GWAS Diversity Monitor dynamically outlines which diseases and traits have (and have not) been studied across which ancestry groups over time and provides easy and searchable links to each individual study searchable by individual trait.

3. Where does the data driving the GWAS Diversity Monitor come from?

The GWAS Diversity Monitor is driven by data taken from the NHGRI-EBI GWAS Catalog, with detailed information on the eligibility criteria available elsewhere and the NHGRI-EBI Catalog itself. For inclusion in the Catalog, studies are identified via an automated PubMed literature search and then manually examined against matching the inclusion criteria. Trained GWAS Catalog curators then include the information in the Catalog using a pre-defined protocol.

4. Is there any replication material?

Yes! We are advocates for Open Science and encourage you to download all of the graphs and underlying data you need, with the hope that you will be able to easily integrate it into your own research. With respect to the data, we give full attribution to the NHGRI-EBI Catalog and point to their EMBL-EBI terms of use. With regards to the code, we have made this available under an MIT License (a permissive free software license) and it can be found on the Leverhulme Centre for Demographic Science (LCDS) GitHub page. Please let us know if and how this has been integrated into your work!

5. What does the category ‘Parent term’ mean?

Our online dashboard divides the traits that are studied across GWAS by the standard Experimental Factor Ontology (EFO), with the most aggregated or highest-level categories often referred to as 'parent terms'. This provides a systematic description of many experimental variables available in EBI databases, and for projects such as the NHGRI-EBI GWAS Catalog. Please see the EFO page on the EMBL-EBI Ontology Lookup Service GitHub for more information.

6. How are the groups divided by 'Ancestry'?

The GWAS Diversity Monitor is divided into 6 broad ancestry groups, based on a framework developed by Morales et al. (2018) which creates a standard for ancestry classifications in GWAS research. Although we recognize that it is suboptimal genetically to categorize groups into such broad categories, this type of assignment reduces complexity and enables visualizations and categorisation. Furthermore, since most non-European ancestry groups remain underrepresented, disaggregating to a more detailed level is often not useful. Our hope is that as diversity increases, disaggregated groups will be more frequently represented in the future.

7. Are race and ancestry the same thing?

As described in described in Mills et al. (2020) and Peterson et al. (2019), one common misconception is that the term 'ancestry' used in population genetics equates to racial or ethnic differences. Genetic variation in populations (ancestry) is different from the social, cultural, and political meanings ascribed to different human groups. Race is not a biological category since genetic variation is traced to geographical locations and does not map into the perpetually changing and socially and politically defined racial or ethnic groups. Populations are the product of repeated mixtures over tens of thousands of years. The concentration of genetic alleles in some groups is thus related to where they have descended from. For example, the ancestral category of 'African-American' is highly diverse can be traced in relation to migration and assortative mating patterns.

8. Why is there a difference in the doughnut charts of ‘participants by ancestry’ and ‘count of all associations discovered’? Why do ancestry groups have different allele frequencies?

Most evidence places the origins of Homo sapiens in Africa. Sub-Saharan Africa is where patterns of DNA sequence variation are the greatest, and populations with the greatest genetic variation are assumed to be the oldest. Current knowledge places the cradle of the human species in what is now modern Namibia and Angola in Southern Africa. This is attributed to the fact that when humans migrated to new regions, they took progressively smaller amounts of genetic variation in the gene pool with them. Each new population is younger than its original source and thus has less time to accumulate new mutations, with European ancestry groups having the least variation. Sequencing of Khoi-San bushmen showed that even two people from adjacent villages were as different from one another as any two European or non-African ancestry individuals.

9. Why can’t the results from GWAS derived from European ancestry populations simply be applied to other ancestry groups?

Polygenic scores derived from one population are not reliable when applied to others, shown previously by Alicia Martin et al. (2017). For example, the results for BMI (body mass index) derived from a GWAS from a European population only explains 3% of the variance for individuals of African ancestry compared to 13% of those with European ancestry.

10. Why is it important to look at diversity beyond ancestry and towards historical, geographical, demographic differences?

Although ancestry has often been the focus of the diversity discussion in genetics, research has shown that genetic findings also largely differ by historical period, birth cohort and the geographical location, demographics and socioeconomic context of individuals.

11. Why is it important to get more diverse populations in genetics and genomics research??

GWA studies that draw from data from diverse populations will provide more accurately targeted therapeutic treatments to more of the world’s population, extend insights into the genetic architecture of traits and very likely make new genetic discoveries. Diagnoses, treatments and interventions derived from European ancestry populations cannot be easily applied to other groups and if they are, they could even cause damage. For instance, a recent study published in Cell contained data on people from rural Uganda and identified 10 new links between diseases and genes which had been previously undetected. They found that 22 percent of people in the sample have a gene that causes the blood disorder thalassemia. This supposed ‘disorder’ has in fact become common in the African population since it is protective against severe malaria. The genetic variant is linked to glycated haemoglobin which is currently used (based on discoveries from European populations) to diagnose diabetes, which in turn would lead to incorrect diagnoses of diabetes in this African population.

12. What's the landscape looking like going forward?

Given the full release of the UK Biobank and increased reliance on large direct-to-consumer data, we predict that diversity in GWAS ancestry may decrease even further given that 94.23% of the 488,377 genotyped UK Biobank participants are in the white ethnic group and 23andMe has a sample with 77% European ancestry. This acknowledgement can give further impetus to large scale data collection that is currently taking place in Africa, large projects such as NIH’s All of Us study or new initiatives such as the UK’s plans for a new 5 million person cohort.

13. What do you mean by quasi-real time monitoring? Why do some of the numbers mentioned in academic articles differ from the online dashboard?

Note that given the ‘live’ and on-going dynamic nature of the GWAS Diversity Monitor, the figures update on a regular basis and may not concur exactly with those in the published article. Quasi real-time refers to the fact that we check the EMBL-EBI each morning and automatically update the monitor when the Catalog itself is updated (approximately every three weeks).

14. Why do some of the numbers of total GWAS diversity differ across the different figures?

The ‘Total GWAS participants diversity’ in the upper left corner represents total, unfiltered estimates of GWAS participant diversity (dropping all rows of the Catalog which have at least one ancestry which is not reported). It is cumulative and contains all GWAS up until a specific point in time (excluding those which have 'not-recorded' ancestries) and across both stages (discovery and replication). The other two figures which we might expect to have identical numbers but do not are the 'All parent terms' version of the doughnut chart, and the 2019 time series estimate. These are different to the headline summary figures because they are time contingent, year-on-year estimates, and are subject to the STAGE global widget. Furthermore, these two figures actually differ from each other because of the lack of some specific EFO terms for a fraction of the data which prevents mapping onto parent terms in the construction of the doughnut chart.