Final demonstration census data
Spatial analysis
The privacy-loss budget (PLB) for the final demonstration data analyzed here is 19.61 and reflects improved accuracy over the previous April demonstration data (PLB=12.2).
Total population by census tract
Compare differential privacy estimates to published census values
This map shows 2010 census tract population counts. Click a tract to see the published value (SF1), the differential privacy estimated value (DP), and the population density.
Map legend
What are we looking for here? In the previous version of the demonstration data (April, PLB=12.2), the overestimates and underestimates (DP-SF1) ranged from -211 to +143. Here, the final demonstration data reflects a definite increase in accuracy (PLB=19.61). The over and underestimates for total population have a narrower range: -37 to +31.
Since mapping population counts by census tract can present a misleading view of where people live, here's a map showing people per square mile. Click the map to see tract data.
Map legend
What are we looking for here? Clicking the map to examine the data, shows how much distortion differential privacy adds to the published population counts in particular tracts.
Clicking a tract in Ada County, Idaho, for example (tract 2.02), shows there were 5,435 people in 2010 (SF1 value), and that differential privacy added 5 people (the DP value is 5,440). The population density for that tract is 414.6 people per square mile.
Map pop-up
Evaluate global relationships
With this final demonstration data, the scatterplot of total population (SF1) against injected differential privacy noise (DP-SF1) shows R2=0.0. In other words, there is zero relationship between these two variables. This is ideal.
Notice the two cases highlighted on the chart. A tract with 59 people gets 31 additional people with differential privacy. Another tract with 3,191 people loses 37 people.
What are we looking for here? Ideally, the noise added by differential privacy will be random. The number of people in a tract should not influence whether DP noise is positive or negative. The R2 of 0 indicates there is no relationship between population size and DP offset (this is excellent).
In contrast, the previous April demonstration data had a negative relationship with an R2 of 0.14 indicating that as a tract's population increased, the DP noise usually decreased. Consequently, the most populated tracts tended to report fewer people after DP was applied.
The scatterplot for the April demonstration data in which R2=0.14.
The scatterplot for the final demonstration data shown on the right, is a big improvement over the April data.
Evaluate spatial clustering of over and underestimates
This hot spot map may look empty, but it actually tells us a lot (all good). Except for a very small area south of Detroit where overestimates (DP-SF1) cluster spatially (p>0.90), there is no statistically significant clustering of over- and underestimates anywhere else. This is ideal.
There is only one small area where overestimates cluster spatially for the final demonstration data.
Map legend
Hot spot analysis tool parameters
What are we looking for here? Ideally, the spatial placement of positive or negative noise will exhibit a random spatial pattern in which overestimates balance out underestimates. We do see that here, and this reflects a big improvement over the April demonstration data.
April hot spot analysis result. Regions with clustering of negative noise (around Miami, for example) could have translated to underfunding or less representation for programs based on population counts.
Evaluate absolute differences spatially
This map shows mean absolute differences (|DP-SF1|) between the estimated and published total population values summarized for each tract and its nearest 50 neighbors . The largest mean differences are all fewer than 4 people.
Map legend
What are we looking for here? Ideally, differential privacy will impact locations similarly. That's the case here. Large mean absolute differences (as were seen with the previous April demonstration data) would mean some areas have higher data distortion than others, which wouldn't be right.
Examine the spatial distribution of absolute percent error
Large mean absolute differences for tracts with thousands will be less impactful than tracts with only a handful of people. To see the tracts that are impacted most because of few people and large absolute errors, compute the percent absolute error:
(|DP-SF1| / SF1) * 100
To avoid division by zero, the percent absolute error is set to zero for tracts where both SF1 and DP are zero (202 tracts fall into this category). For tracts where SF1 is zero but DP is larger than zero (12 tracts), the following formula was used to compute percent error:
(|DP-SF1| / 0.49) * 100
The map shows the tracts with the largest percent absolute errors.
Map legend
Some findings: There are 27 tracts where the percent absolute error is more than 100 percent (this is down from 185 tracts for the previous demonstration data). The largest percent absolute error is 2244.9 percent (down from 10,000 percent). All of the tracts with percent absolute errors larger than 100 percent are overestimates.
What are we looking for here? Errors larger than 100 percent would seem to border on data fabrication, and could result in misappropriated funds or unjustified representation for programs that base allocation on population totals. Here, however, only 27 tracts have percent absolute errors above 100 percent and none of them involve adding more than 12 people, so these results are probably reasonable.
Data relationships
For programs that base representation or funding on population counts, DP has the potential to create winners and losers. Regions that lose population could possibly lose some funding or representation. Ideally, the positive and negative noise will cancel, rather than concentrate spatially. The hot spot analysis map above indicates that the over- and underestimates do not exhibit statistically significant clustering (this is excellent). Charting the global relationship between population size and the injected DP noise (shown above), indicates no relationship (again, this is optimal). In the next analysis, we'll see whether there are statistically significant local relationships, specifically focusing on increases or decreases in population for communities of color.
Examine local relationships
While the global relationship between the DP injected noise (DP-SF1) and BIPOC (Black, Indigenous, and other people of color) for the 2010 data is 0 (which is perfect), here we'll also make sure there aren't any statistically significant local relationships .
The chart indicates the overall relationship between BIPOC and the DP offset is zero
Again, while the map looks empty, it is telling us something important: there are no statistically significant local relationships between the differential privacy offsets (DP-SF1) and BIPOC. This is great.
Map legend
ArcGIS Pro tool parameters to create the map results shown here
What are we looking for here? We are seeing the best possible result here: no local relationships between communities of color and the DP offsets (DP-SF1). This is a big improvement over the April demonstration data in which a large number of tracts were associated with negative local relationships.
April Local Bivariate Relationship map showing many regions with statistically significant local negative relationships. In these areas, tracts with larger numbers of BIPOC would be most likely to artificially lose people after DP was applied.
If differential privacy were to artificially decrease population counts primarily in communities of color, it could inadvertently increase racial inequities.
As a final analysis, compare the metrics for all tracts, to tracts with predominantly American Indian and Native Alaskan, BIPOC, and non-Hispanic White people.
What are we looking for here? This analysis determines whether the metrics are similar across tracts with different demographic characteristics. It doesn't appear there are any serious differences.
Some concluding thoughts
The analyses presented here confirm that the final demonstration data is an improvement over the April demonstration data, finding no issues with spatial clustering or unexpected relationships. A thorough analysis would repeat this evaluation for other variables and other geographies.
An important note: These analyses focused on identifying bias and did not assess how well the disclosure avoidance system (DAS) performs to protect privacy. Researchers at Harvard University, however, found they could re-identify race attributes equally well using both the published and DAS instrumented data. This is certainly a concern.