Introduction
What, or who, gets researched is incredibly important to consider. We cannot make life better for people if we do not understand the nuanced experiences of different people. We cannot understand the world if we only focus on one aspect of it. One way research is biased is through the selection of data we collect and analyse. This is relevant for primary (data we collect ourselves) and secondary (data collected by others that we analyse) studies. A fundamental part of any research study involves a certain amount of data exclusion; deciding not to interview participants with certain characteristics or removing data points outside of a certain range. Whilst some data curation is usually necessary, it is important to think carefully about the consequences of these choices on how representative our results are. As argued in the article ‘Why Data is Never Raw’ by Nick Barrowman, all data reflects, to some extent, the pre-existing theories, assumptions, interests, and limitations of the researcher who collected it.


Data can directly represent human experience or aspects of bodily functions (such as diary excerpts or blood pressure measurements), but it can also indirectly reflect our motivations and interests (from spending habits to observations of the universe). Wherever data is collected from, it is important to consider who it represents and who it benefits, whether directly or indirectly.
Historically, research has overwhelmingly represented the identities and interests of a global minority: white cis-heteronormative men in the global north. Despite the research environment becoming more diverse in recent years, current practice and institutional structures largely continue this legacy. Whether it is the histories that get told, the theories that get taught, or the data that is collected, research and our understanding of the world have been disproportionately shaped by this global minority identity. Certain human characteristics, geopolitical regions, species, and ways of knowing are under-recognised or excluded in research.
One of the issues with data representation is that the characteristics of participants or samples are often not appropriately recorded. The reasons for this can be manifold, though often it is simply the result of oversight (potentially due to unconscious bias) or belief that certain characteristics are not relevant to the research. As argued in ‘Four ways natural history museums skew reality’ by Jack Ashby, absence of data does not mean it does not exist, or that it is not important. Furthermore, this Guardian interview with Catherine D’Ignazio: ‘Data is never a raw, truthful input – and it is never neutral’ discusses the harm these gaps can produce when data is used by algorithms and AI.
Our page on Co-creation and Collaboration introduces some ideas for how to engage with communities in a way that improves representation. However, caution must be taken not to overuse certain participant groups, called ‘convenience sampling’. Not only does this fall back into the trap of over-representing a particular population, but it can also induce involvement fatigue within this group. Given that such groups may already be marginalised and experiencing exploitation within society, researchers must take care not to compound such problems. Instead, it is important that we challenge ourselves to find new representative groups and therefore broaden our understanding rather than narrow it.
Representative data is crucial for research that seeks to accurately represent the world and yield positive outcomes. Global power dynamics act to exclude certain groups of people, places, concepts, and phenomena from data collection which limits the validity and usefulness of research. Researchers must critically assess their data choices, seek broader representation, and engage with communities ethically—without relying on convenience sampling or reinforcing existing inequalities.
Case Studies
Practical Steps and Tools
Think about the data you will collect or use in your own research and answer the following questions:
- What (or who) does my data represent?
- Are there any groups or identities which are not represented in my source data?
Understand the cause of this under-representation:
- Who does your research question serve? Are particular groups of people more (or less) likely to engage with the work?
- Is the under-representation due to a cognitive bias, resulting in particular data sources being overlooked or excluded?
- Is the under-representation due to an external factor—perhaps you use data collected by someone else, or rely on a company for samples, or particular resources are behind a paywall?
- Is the under-representation an issue of accessibility?
Think about what action you can take to make your data more representative:
- Identify established specialist groups which may be able to advise on improving the representation in the research. For example, Patient and Public Involvement (PPI) groups, Lived Experience groups and community outreach groups. These should be a starting point, do not over-rely on existing groups.
- Trial Forge have a range of frameworks to improve diversity in research participation. Frameworks cover topics like socioeconomic disadvantage, older age, and ethnicity.
Where appropriate and possible, collect comprehensive demographic information for participants, research contributors or primary source data. Under-reporting of certain characteristics such as ethnicity and socioeconomic status is common in research, despite the importance of these characteristics in shaping experience. See Reporting of data on participant ethnicity and socioeconomic status in high-impact medical journals: a targeted literature review by Buttery et al. for more on this. Explain clearly why these measures are being collected and what they will be used for. We recommend using the EDIS Group DAISY Guidance to help select which measures to collect and how to phrase the questions.
Be vocal about the issue of representation in your research. Discuss it with your colleagues and collaborators. If you rely on samples from a company, contact them and bring the issue to their attention.
References and Further Resources
Arts
- In Digital Cultural Colonialism: measuring bias in aggregated digitized content held in Google Arts & Culture by Kizhner et al.,the authors study representation of institutions, countries, and artefact-origins within the Google Arts & Culture corpus.
Environmental sciences
- Uncovering Big Data Bias in Sustainability Science by Record & Vera
gives an overview of how capitalism and the monetary value of species influences which species are researched. Certain species (e.g. lobster) are over-represented in research datasets. - Geographical Bias in Physiological Data Limits Predictions of Global Climate Change Impacts by White et al.shows how spatial bias in multi-species physiological data limits the capacity to forecast responses to climate change. The causes of spatial bias, including inhospitable climates and geopolitics, are discussed.
Life sciences
- Addressing Racial and Phenotypic Bias in Human Neuroscience Methods by Webb et al.
gives an exploration of how cognitive biases detrimentally impact neuroscience methods and data, leading to the exclusion of people racialized as Black from neuroimaging and physiological studies. - Queer Data by Kevin Guyan
discusses the importance of collecting representative data about, and for, LGBTQIA+ identities.
General
- Invisible Women by Criado Perez describes the adverse effects on women caused by gender bias in big data collection.