What is Differential Privacy? How will it affect our work?

A synthesis on Census 2020 differential privacy and its implications.

Privacy is constitutionally mandated, and accuracy is enshrined in the Census Bureau's mission to be the leading provider of demographic data for the nation. These two aims are inherently conflicting, so Census has to strike a balance.

People have good reason to be scared of their data being misused!

  • Will you list everyone who lives in your apartment even if there are people who are not on the lease?
  • Will you include your child who attends a school in another school district?
  • Will you include your mother who lives with you who overstayed her visa?
  • Will you include people living here who differ from what's filed with your SNAP benefits?

Hasn't the Census Bureau always protected privacy, isn't this mandated under Title 13?

Census has always done privacy protections in the form of top-coding, swapping, and suppressing data. But, this has been minimal. Given the advances in computing and the amount of external data available, these techniques are insufficient to protect privacy. Database reconstruction can be easily done these days.

Imagine solving a very large Sudoku puzzle. It's very computationally possible to reconstruct the data with modern computing methods.

The more options for race categories and relationship categories have been great to give a better picture of demographics. However, it has also made the potential for re-identifying people much greater.

I might be the only 35-year-old married white female renter on my block. While I might not care that the government knows that, the far scarier part of this is the potential for linking to other datasets:

  • credit cards
  • medical records
  • employment and earnings records
  • criminal records

Where you live matters for lots and lots of things. Unfortunately, geography is also one of the most useful pieces of information for re-identifying survey participants. -David Weir, University of Michigan

Census is modernizing and strengthening how they protect privacy in the statistics released starting with the 2020 Census.


Enter: Differential Privacy

Poll taken during the Association of Public Data Users (APDU) 2020 virtual conference.

Two major components of Differential Privacy as it relates to Census

Noise Injection

Think of an impressionist painting.

You can move the pixels around in such a way that the portrait as a whole stays the same, but at the pixelated levels, they have been moved around.

At its core, this is what differential privacy is: injecting statistical noise into a dataset that keeps the whole picture of the population intact.

The true computer science definition of differential privacy part is actually very easy. The noise is from a distribution centered on zero. The level of noise is set by a "privacy loss budget" called epsilon.

In some ways, the decennial census is an ideal dataset to do differential privacy with because it's a very large dataset with very few questions.

Post-Processing

The much much harder part is the post-processing. Negative counts and fractional counts are unacceptable to data users, but we also don't want a spike at zero. The most challenging part, however, is the requirement for sums adding up throughout the  Census Geography Hierarchy. 

First, a mini Census vocabulary lesson: Spine geography vs. Off-spine geography

Spine Geography levels: Nation, Regions, Divisions, States, Counties, Tracts, Block Groups, Blocks. Each lower geography is perfectly nested within the higher level geography, a bit like Russian Nesting Dolls. For example, each census tract is in one and only one county, and each county is in one and only one state. The more pointy-headed term for this is that Spine Geographies are MECE (Mutually Exclusive, Completely Exhaustive).

Off-Spine Geography levels: School districts cross county lines. Metro areas cross state lines. County subdivisions do not always cover the entirety of a county.

In general, the post-processing will advantage the spine geography levels while disadvantaging the off-spine geography levels. This presents a huge problem as Congressional Districts (an off-spine geography) are the first priority of the Census. Those districts are inherently not known today, as they will be defined using the 2020 Count.


Will the data be fit for use?

There are thousands of use-cases of decennial census data: apportionment, redistricting, allocation of funds, tribal governance, public health, disaster preparedness/emergency response, infrastructure planning, school planning, housing policy, family and household studies, LGBTQ issues, sampling frame for other surveys, just to name a few.

Development cycle for implementing all this

Using an Agile approach, they are going table by table, producing a demonstration file with the 2010 data, evaluating it, and then reassessing it. The evaluation takes advantage of ongoing data user engagement with various steering committees spanning multiple stakeholders such as Dept. of Justice, Tribal Governments, and more.

Census took the 2010 Census Data, and produced a version of that data using the differential privacy methodology to compare to the the publicly-released 2010 Census data. Different participants evaluated these two datasets based on their own use-cases. There was an October draft that was widely considered "not fit for use." The March draft was better, but still not acceptable for some use cases.

What they've heard:

  • Old and new files nearly identical at high levels of geography: state, large metros
  • Accurate for broad age groups, and large demographic groups (NH White and Hispanic)
  • Large discrepancies at the block groups and some census tracts, and smaller demographic groups such as American Indian / Alaska Native.
  • Very poor results for "off-spine" geographies such as school districts.
  • Problems with temporal consistency at small geographies (i.e, too much year-to-year variation). Consistency is important for calculating population disease and mortality denominators. It may be difficult to calculate change & trends in general.

Here are two deep-dives into specific use-cases:

Mortality Rate Ratios for counties for Overall, NH White, NH Black, and Hispanic were calculated from the demonstration file. October demonstration data was "not fit for use" for counties where small percentages of NH Black and Hispanic are used as denominators for mortality rates. March's was much better, but resulted in trade offs in other areas. For example, bias in age structure worsened (65+ had a lot more of a difference).

Housing vacancy rates in California were completely terrible in the October run, very much improved in the March run, but still had some impossible/improbable results, such as 100% vacancy rates in small areas.

As we get closer in the process, it gets harder to define the threshold for what's acceptable. (You can tell when it's really wrong, but when is it good enough?)

Increased accuracy for one use-case comes at the expense of another, it's a bit like Wack A Mole. -Beth Jarosz, Population Reference Bureau

Again, there has to be some error somewhere! Otherwise they don't meet the disclosure avoidance requirements.

Some questions on people's minds:

  • Will the Census Bureau be able to defend their published counts in court for redistricting cases?
  • Will there be something like MOE that will allow analysis of best case/worst case scenarios? A school district wants to know how many kindergarten teachers to recruit next year. The census provides the number of 4-year-olds in my district, but will I have a way to know how reliable this is?
  • What are the implications for longitudinal analysis? Can we compare 2020 data to 2010 data to understand trends?
  • The data for people associated with small geographies or small numbers (specific racial groups) will be less accurate than data for larger geographies and larger groups. This has implications for health statistics, and larger social equity issues. Is it possible for differential privacy to accurately count these populations while maintaining individual anonymity?

Implications and Considerations

Differential privacy is not new, but its application at this scale is new. There are considerations about setting a precedent; considerations on other Census products that are benchmarked to decennial, or use decennial as a sampling frame; considerations about user concerns/lack of trust in the process.

What's the global privacy loss budget (epsilon)? What allocation does each use-case get?

These are like dials that are infinitely tune-able. Who's tuning them? Who gets a seat at the table? Right now it's the Data Stewardship and Policy Committee at Census, made of up lots of career-Census professionals and executives who represent a lot of different Census branches. Should BLS, CDC, other agencies have a seat? Data users?

How much potential to identify people is acceptable?

It's largely a judgement call that Census is going to have make, weighing the priority use cases against the Census responsibility to protect privacy under Title 13. There's no formula for this.

Data Privacy Laws treat data as very binary in terms of "identifiable" vs. "de-identified." In reality, it's a spectrum. The thresholds requested from the data users have to be realistic in order to protect privacy. Everyone wants 100% accurate data, but Census *cannot* publish accurate data for every use-case while also ensuring privacy protection.

Any techniques for dealing with privacy are going to have an impact on the data's fitness for use. This was true for the older more minimal methods as well.

There's no established process for building a participatory process for establishing a privacy algorithm. Census should be applauded for all the outreach and gathering user feedback, and making this such an iterative process! One group who has not had a seat at the table are those who have a lot to lose with privacy breaches:

  • victims of domestic violence
  • tenants who will suffer consequences if there their census responses don't match their lease data
  • immigrant communities

The communications challenge

Guidance/education on this process, and on how to deal with "consistency" issues, error measurements, etc. will be its own project!

When the ACS first came out, explaining rolling 5-year estimates was hard. Teaching people that in order to get the small geography-levels, we need to pool several years' worth of data together to get a "reliable" estimate is still not easy. The reliability is public knowledge in the form of Margins Of Error (MOEs), and we all know that ACS data is used by many without any consideration for MOEs.

"We need an equivalent to the ACS Data Users Guides." was a sentiment that many people expressed during one of the APDU sessions. Census does plan to put out a Handbook similar to the ACS Handbook.


How can I get involved?

Use the  Public Demonstration File s. Look over the  FAQs . See what this data looks like for your use-case. Send feedback to 2020DAS@census.gov. While they've considered dozens and dozens of use-cases, you don't want to find out in the spring of next year that yours was missed!

Additional Resources

Various webinars, presentations, and session recordings:

The post-mortem after all of this will be very enlightening!

Note: This is quite different from Margins of Error in ACS. Margin of Error happens whenever we're using a sample to estimate values for the whole population. In the Decennial Census, they have the whole population (either from self-response or from imputation), but publishing even the most basic tables down to the block level is not protecting privacy. Census is actually injecting noise into the data here, something that has not happened for ACS other than the same minimal suppression & swapping techniques used in 2010.

Poll taken during the Association of Public Data Users (APDU) 2020 virtual conference.