Data Classification in Mapping

Do you really need anything other than natural breaks?

Michael Camponovo

April 15, 2021

As a GIS person I am fully aware of how we often spend so much time finding and massaging data and running analysis that the visualization of our data is rushed as we get closer and closer to our deadlines. As a result, we often accept default map software settings without much thought. I'm going to review some basic concepts related to data classification in this story map. Since we use data classification most often with choropleths, I'll also include a few resources at the bottom that I've found especially useful.

So what is data classification and why do we need to know about it? Let's start with a hypothetical. As a GIS Analyst, you've been asked to map data related to drunk driving fatalities in the United States to share with stakeholders at an upcoming meeting. In our hypothetical data set we have the number of alcohol related fatalities for each state and the population for each state. We decide our best option is to create a choropleth representing the rate of fatalities per 100,000 people. Our data set has 51 records, one for each state plus Washington DC. We could simply ask our GIS software to assign each of the 51 states a unique shade of red and be done. However, most people can't really distinguish between more than about 7~ish shades of a single color. The chance that a map reader will be able to make sense of the map other than a basic understanding that some states are darker than others is pretty slim. Enter data classification.

With data classification we try to group similar data values together into clusters and maximize the difference between adjacent clusters of values. By grouping clusters of similar values we can use fewer shades of the same color while increasing the legibility of our map. On the surface this sounds simple enough, just look for clusters and be done. Unfortunately, like everything in GIS, it isn't quite that simple. The next four sections explore a specific data classification technique using our drunk driving fatality data.

Let's go over some quick vocabulary here at the beginning:

Values/Data Values = This is the data we are trying to map. In a spreadsheet this is synonymous with a column of data.
Classes/Data Ranges/Class Breaks = These are the values that serve as the cut offs between classes. These are the values you see in the map legend
Number Line = An imagery horizontal line that starts at our minimum data value and ends at our maximum data value
Histogram = A chart to help us understand the distribution of the data values. A histogram looks like a bar chart with the data values arranged in order from smallest (minimum) to largest (maximum). Each bar represents a certain range of data values. In the example below, each bar represents the values that fall between two numbers. The height of the bar represents the number of data values in that range. In the example below, there are no data values between 0 and 1 or between 1 and 2 but there is one data value between 2 and 3. There is a histogram in each section of this story map below the map for reference. The bars are color coded to match the map. The vertical blue lines represent the class breaks (rounded to the nearest whole number.

Natural Breaks Classification Method

One of the easiest and most logical techniques for classifying data is the natural breaks technique. Imagine you sort your data values from largest to smallest in a spreadsheet program like Excel. As you scroll through the data you make note of where there are large gaps in-between adjacent data values. You can simply use those large gaps in the data as the breaks for our different classes (the actual process is more complex than that).

Speaking of different classes, how many classes should we use? Well...that depends. A good starting place is to use between 3-7 classes. More than that and the map viewer will have a hard time distinguishing between the classes. Are there reasons you might need more than 7 classes? Sure there are, just use them with care.

Let's explore the map below using the natural breaks data classification technique built into ArcGIS Online. Immediately we can see where some states have a dark shade of red while other states have lighter shades. And with only five classes, we can fairly easily associate each shade with a specific value in the legend.

If the natural breaks method is so easy, why wouldn't we just use this for all of our maps? Because the class breaks are unique to each data set, it makes it difficult to compare data sets over time or across different geographic units. For instance, the data below is from 2004. If we want to compare data from any other year, it is highly unlikely that the class breaks will remain the same.

You'll find a histogram for this data set directly below the map. For simplicity, we rounded the class breaks to the nearest whole number. You'll find the class breaks represented as vertical blue lines. We've also built some interactive histograms for each map and placed them at the bottom of this story map. You can use those to quickly compare the class breaks for each data classification technique as well as how many values fall within each data class.

Drunk Driving Fatalities Using the Natural Breaks Classification Method

Natural Breaks Histogram for the Data Values - Class breaks have been rounded

Equal Interval Classification Method

Another option for data classification is the equal interval classification method. In this technique we take all of our data values and line them up on an imaginary number line from the smallest value to the largest value. We take the difference between the max and min values (known as the range), and divide that by the number of classes we want.

For our data set the smallest value is 2.92 and the largest value is 11.67. If we subtract those numbers we are left with 8.75. We are using five classes for all of our examples so we divide 8.75 by 5 resulting in class sizes of 1.75. Now we start at the lowest value, 2.92, and just add 1.75 to it for a class break of 4.67. Next we add 1.75 to 4.67 and our next class break is 6.42. Repeat this and our next class breaks are 8.17, 9.92, and 11.67.

How to find class breaks using equal intervals

Like the natural breaks method above, the number of data values in each class can vary quite a bit. It is also possible to have class breaks without any values in it. Typically it is best to avoid the equal intervals technique if you have a data set heavily skewed or with outliers.

Drunk Driving Fatalities Using the Equal Interval Classification Method

Equal Intervals Histogram for the Data Values - Class breaks have been rounded

Quantiles Classification Method

With the quantiles method we sort our data from largest to smallest and count how many data values are in our data set. For the map below we have 51 records, one for each state plus Washington DC. Then we divide that total number of values by the number of classes. In our case we divide 51 by 5...which results in a decimal which doesn't make sense. Usually we make most of the data ranges the same and simply adjust a single class up or down a smidgen to account for the remainder. In the example below, we have 4 classes with 10 records each and one with 11.

A few limitations to be aware of when considering quantiles as a classification method:

The data will typically appear highly variable, even if the data actually isn't
Unlike the equal intervals approach, the class breaks can seem random

Drunk Driving Fatalities Using the Quantiles Classification Method

Quantiles Histogram for the Data Values - Class breaks have been rounded

Use the interactive maps below to compare the Equal Intervals and Quantiles data classification techniques. You can also turn on the map legends for both maps at the same time for easy comparison.

Interactive maps for both quantiles (left) and equal intervals (right)

Standard Deviation Classification Method

What if we don't care as much about the values as we do identifying the outliers in our data? In that case we could just break the data out using the standard deviation technique. Find the mean, use your GIS software, or some other statistics or spreadsheet software to find the standard deviation, and simply add and subtract that value from the mean until you hit the maximum and minimum values.

Before choosing this classification technique, consider the underlying shape of the data. Is it normally distributed, skewed, or bimodal? Does it make sense to use this technique? Also, be careful with your map legend. Those values are likely the standard deviations away from the mean, not the actual data values.

In terms of choosing colors for a choropleth map using standard deviation, the typical approach is to use a diverging color scheme to emphasize those areas that are further away from the mean. I provide an example below with a traditional diverging color scheme as well as a high-to-low sequential color scheme.

A comparison of diverging color scheme (left) and sequential color scheme (right) for standard deviation classification

Manual Classification Methods

What if none of the above options really work for your data? You could just put the class breaks wherever you want them. For instance, what if you're mapping county level median income. Do you need to make sure one of your class breaks is the federal poverty level? Feel free to explore these different techniques and find what works best for your data and your purpose.

Histograms

The histograms below indicate how many data values appear in each class break for each classification technique. For instance, compare how consistent the number of data values are in each class for the quantiles histogram to the equal interval and standard deviation histogram.

The histograms below are fully interactive if you want to explore them more fully.

Click inside the dashboard to interact with it
Hover your mouse over a column to see that data value
Click the small circle with the four outward pointing arrows to enlarge any single chart. Click the same circle to reduce the size of the chart
Click the cursor arrow with lock icon to deactivate the dashboard
Clicking the box with the out-pointing arrow will launch the dashboard in a new window or tab

Interactive Histograms of 4 Different Classification Methods

Supplemental Resources

Thanks for taking a few minutes and exploring how data classification can impact how your audience reads and interprets your data. There are a lot of much more detailed resources online if you want to learn more or see more examples.

This project was originally inspired by John Nelson's Mapping the Truth poster
The team at Axis Maps has an excellent cartography guide with a section dedicated to data classification
The team at Penn State has a section on data classification as part of their online program
The open source textbook Essentials of GIS has a section on data classification
GIS Geography and ESRI both have resources too
You should read Cynthia Brewer's article Basic Mapping Principles for Visualizing Cancer Data Using GIS

And since we've been talking a lot about choropleth maps, here are a few suggestions to make better choropleths

Start by reviewing the Axis Maps guide to choropleths
Use an equal area projection
Only map normalized data, not raw numbers
Decide whether your data values go from low to high or if they diverge, then choose an appropriate color scheme from ColorBrewer
For sequential data, pick a single hue (ROY G BIV) and vary the shade by making it darker or lighter.
For diverging data, you usually pick two different hues and then adjust the shade of each hue. Don't sweat the details, just pick an appropriate option from ColorBrewer.
Don't use the rainbow color scheme. Read why here , here , and here for examples.