
Global Language Data Review
A resource for planners

Investment in quality language data is needed to leave no one behind
The Sustainable Development Goals’ central promise of leaving no one behind is dependent on reaching and hearing from the billions of people who speak marginalized languages - especially since language barriers compound disadvantages linked to gender, age, ethnicity, and disability.
Yet, as 2030 approaches, planners and service providers often lack language data to fulfill that promise.

Language data for planning should be:
- representative,
- recent,
- reliable,
- readily accessible, and
- at a relevant level of detail for program design.
Good language data helps develop services that include speakers of marginalized languages. Where this data is unavailable or unused, services risk reinforcing existing exclusion – leaving disadvantaged groups behind.

Existing data is often protected by restrictive copyright or locked behind paywalls.
Languages coexist and overlap in the real world, but are often visualized as uniform blocks or specific points on a map. In other cases, quality data exists, but people planning and delivering services may not know where to find it or how reliable it is.

This Global Language Data Review sets out what language data you need for planning, where you can find it, and how you can help build the body of data to support inclusive services.
The need for language data is linked to diversity, risk, and disadvantage
Language data is most needed where people are most at risk of being marginalized because of their language. We assessed this based on:
- the overall levels of risk and disadvantage of the population, using the INFORM Severity Index and the World Bank income group classification , and
- the diversity of languages spoken, using UNESCO’s Linguistic Diversity Index .
We identified 88 countries for which language data is most critical to ensuring no one is left behind.
We identified six criteria to assess the quality and accessibility of language data for planning
Our experience curating and mapping language data shows that the value of language data for planning services depends on a range of factors. Our six criteria for assessing language data are:
1. Representativeness
From full-scale censuses to small studies, accuracy is determined by how representative the data of a given population is.
2. Timeliness
Language use changes over time, as do demographics. More recent data better reflects the population’s language profile.
3. Location data
Language distribution can change greatly between two geographically close areas. Data that is collected in small-scale geographic units, like towns or districts, is more useful than national-level data.
4. Quality of language questions
Language can be included in a dataset in many different ways based on many different questions. Questions that generate specific data points on people’s main and other languages, comprehension of written, oral or signed information, and communication preferences, paint a more precise picture than just one question that asks if someone speaks any national language.
5. Dissemination and availability
Even the best data is not useful if it is not available. Data that sits behind paywalls or is otherwise inaccessible is not usable for operational purposes by most planners.
6. Format
If data cannot be accessed in a format that can be used easily by those who need them, resources are wasted on processing this data. Under time or resource pressure, data is likely to go unused.
An assessment of the state of language data shows how much is missing.
The map here shows the result of our assessment of the state of language data across priority countries. Ratings of data quality are current to March 2022.
Click on countries for further details and resources.
How do we know which countries have good language data?
Language data or the lack of it, varies from country to country across all six criteria. Varying sources for language data also exist. We logged and evaluated these sources to score the 88 priority countries on the operational quality of their language data.
For some, lack of data is the main issue. For others, data is available, but effectively unusable.
24 of the 88 countries assessed (26%) have good operational language data.
Countries in the highest category, such as Benin , Cambodia, and Senegal , have recent, high-quality data accessible on integrated platforms, like IPUMS . This means that data can be effectively analyzed and used in a variety of ways, including through the curation and mapping we have done for these countries on our language data portal .
24 of the 88 countries (26%) were found to have poor operational language data.
At the other end of the scale are countries where little or no usable language data is available. These include Chad and Afghanistan, where data is paywalled, outdated, inconsistent or difficult to verify and use.
Both countries are known to be highly linguistically diverse, with high levels of humanitarian need: good operational data on the languages used could help improve the reach, effectiveness, and accountability of aid.
But at present, humanitarian and development practitioners are effectively working blind or relying on poor-quality, outdated information or proxy measures . Speakers of marginalized languages are paying the price.
The dashboard below shows our full assessment of all priority countries.
GLDR
Language data makes a difference
Below are examples of how languages are distributed in three countries we've worked with, and why this information is critical.
Bolivia: language data can correct assumptions and identify overlooked communities.
People often assume that Bolivia is a monolingual Spanish-speaking country. It is not.
There are 37 languages spoken in Bolivia. In some areas, as little as 4% of the population use Spanish as their main language.
Bolivia has one of the lowest incomes in the region, and is one of the largest recipients of development assistance from a range of donors .
Making that assistance effective depends on communicating with communities who don't speak Spanish. This is only possible if quality operational language data is available.
Philippines: quality language data supports disaster preparedness
When disaster strikes, people need access to information in a language they fully understand.
The Philippines is a highly linguistically diverse country, with over 120 languages spoken. That could have made it difficult for responders to communicate with communities affected by Typhoon Rai (Odette) in December 2021.
But, responders had access to detailed, high-quality data about the main languages spoken in the affected areas to inform their messaging and programming. Language data preparedness can enable responders to get lifesaving information to communities quickly, in the languages they understand.
Nigeria: service providers can collect language data to fill the gaps.
Where good language data is lacking, planners can supplement national language data to improve the effectiveness of their services for marginalized people.
Humanitarian organizations supporting communities affected by conflict in northeast Nigeria long relied on communicating with them in Hausa.
But, when the 2019 Multi-Sector Needs Analysis included language questions for the first time, responders finally had an evidence base for their communication and community engagement.
The data showed that just 31% of people surveyed spoke Hausa as their first language, and 41% couldn't read it well or at all. Clearly, communicating in Hausa alone is not enough.
With data highlighting language differences between locations, organizations can see which languages to use to reach communities in their area.
Help us build the evidence base for inclusive services through operational language data.
If quality language data is available where you operate, you can use it to inform planning and design decisions, either directly or by requiring it of contractors and funding partners.
As of March 2022, CLEAR Global has collated data for 31 countries.
If quality language data isn't available, you can help collect it or call for others with more capacity to collect it. Collecting quality language data ahead of an emergency, as in the Philippines, can enable a rapid, tailored response.
Adding language questions to surveys, as in the 2019 Nigeria MSNA, can support more evidence-based programming.
Together, we can leverage language data to leave no one behind.
For language data support, contact maps@clearglobal.org