Don’t leave the door to your data clean room open!

Using data clean rooms to join data between two parties does not always mean that the data is fully protected: outputs can also sometimes leak individual information. How can you mitigate this risk?

Blog Post

Damien Desfontaines

InsuranceCo, a health insurance company, is trying to evaluate whether they should recommend a new cancer treatment to their customers. Their goal is to run a trial in partnership with another company, HealthCo, and get information about the efficacy of this treatment. One complication is that they cannot directly share data with each other: InsuranceCo does not want to know who among their customers is participating in the trial, or what their diagnosis is. HealthCo, meanwhile, does not want to collect nor access any demographic information about the insured people.

They need to find a way to combine their data and derive insights not available to either party by itself. This collaboration problem occurs in a variety of other scenarios.

Two different government agencies want to combine their data to publish statistics empowering academic research and data-driven policy-making.
An online publisher is looking to evaluate the performance of a marketing campaign, and to do so, they need to join their data with the data collected by an advertising platform.
A transit agency wants to evaluate the economic impact of new public transportation infrastructure by combining train ride data with point-of-sale provider data.

In these domains, neither party can share their data with the other party, be it because of commitments they made in their privacy policy, data protection regulation, or competitiveness concerns.

a diagram showing a typical data clean room workflow. There are two input arrows; one is labeled “InsuranceCo” and reads “Demographic data about InsuranceCo customers”, the second is labeled “HealthCo”, and reads “Diagnosis data of people who participated in the clinical trial”. The input arrows lead to two tables that are joined into a third table, and an output arrow comes out of this table, and reads “Insights about treatment efficacy by demographics”. A red box above the join between tables reads “Risk: computation reveals sources”.

How to resolve this tension?

Data clean rooms to the rescue!

A possible approach to unlock such collaborations is to use a data clean room. A data clean room allows two or more parties, each in possession of a sensitive dataset, to combine their data and derive new insights not available to either party by itself. In our original example, InsuranceCo and HealthCo would both contribute their data to the clean room, then jointly compute the insights of interest. These insights could then be shared with either or both parties, or other data consumers.

How do data clean rooms mitigate the risk that uploading data to perform joint computation would expose the sensitive data to the other party? A variety of techniques might be used.

A third-party, trusted by both parties, can run the clean room and use traditional access control mechanisms to prevent data exposure.
Trusted hardware modules can encrypt data while in use, and, combined with remote attestation techniques, one can verify that only audited code is running on the data.
Technology like secure multiparty computation can allow both participants to encrypt their data prior to upload, using cryptographic methods to ensure only the output of the computation can then be decrypted.

Regardless of the method for safeguarding the sources during the computation, the results of the computation end up shared with others. This is, of course, the value of bringing these data sources together in the first place. But could this last step reveal more information about the original data than intended?

the same diagram as above, but the join between tables is surrounded by a box labeled “Data clean room”. The red box from the previous diagram is now covered by a green box that reads “Mitigated”. An additional red box has been added under the output arrow; it reads “Risk 2: outputs reveal sources”.

As we will see, this can happen surprisingly easily.

Output controls: the weak point of most data clean room architectures

In most data clean room use cases, there needs to be some level of control over what data can leave the clean room. Otherwise, one of the parties could simply join both datasets and return the results without any privacy protection! To avoid this output risk, most data clean rooms allow their users to enforce restrictions on the queries that can be performed. There are multiple examples of such restrictions:

Data clean rooms may remove sensitive fields in the source data, or prohibit them from appearing in queries.
Data clean rooms may include query logging features, so that a record of queries executed over the sources can be inspected by the parties.
Data clean rooms may implement an approval flow, authorizing queries to run only when all parties have had a chance to review and approve each query.
Data clean rooms may restrict allowed queries to aggregations only. Often, an additional safeguard commonly known as thresholding will automatically suppress small values in the aggregates returned by the queries.

Unfortunately, these restrictions can easily be defeated, even without meaning to! Let’s go back to our example to see how easily it can happen. InsuranceCo and HealthCo want to get aggregated insights about the efficiency of the treatment, sliced by demographic information.

They use a data clean room that implements all the restrictions above to compute the statistics of interest, and obtain the table below as output.

Here, we can observe the effect of thresholding: some cells with low counts have been redacted because they did not have enough individual users. However, you might have noticed that it is easy to recover the redacted information! Summing the gender breakdowns and comparing them to totals indicates that the actual count of negative diagnoses for nonbinary people is 0, while the count for positive diagnosis is 1. If InsuranceCo knows that there is a single nonbinary person in the source dataset, they have just learned their positive diagnosis — exactly what we were trying to avoid!

This toy example shows that even with simple aggregate releases, ad hoc protections like thresholding may not be enough to effectively mitigate risks of singling out individuals and learning sensitive data about specific people. This may create legal risk if the compliance of the data sharing project relies on the output being fully anonymized.

Such inadvertent data leakage is particularly likely with multiple data releases about the same dataset. When sharing or publishing many statistics, the results can even be more dramatic, and can allow a malicious party to recover the entire input dataset! The most prominent example of such a reconstruction attack is the one carried out by the U.S. Census Bureau. As reconstruction attacks are improving over time, it gets extremely difficult to estimate how much risk a statistical data release carries.

How to solve this problem, and make sure that the output data cannot be used to learn sensitive information about individuals in the input datasets?

Differential privacy: a reliable output control for data clean rooms

If you read our previous blog post about anonymization techniques, the answer will not surprise you: to obtain robust output controls in data clean rooms, there is no better method than differential privacy. This is the only approach that offers a rigorous guarantee that individuals cannot be singled out based on the results of queries executed in the cleanroom.

the same diagram as above, but the box around the join is now labeled “Data clean room with differential privacy”, and the second red box is also covered by a green box labeled “Mitigated”.

Differential privacy provides strong guarantees by adding carefully calibrated noise to aggregate query answers. This prevents inadvertent leakage and provably defeats reconstruction attacks, but retains the value in the output insights. The privacy protection of the sources can be precisely quantified, explained to stakeholders, and meets the most stringent regulatory requirements. As an added bonus, suppressing small counts is no longer necessary — in many cases, using differential privacy can make it possible to publish data with a smaller granularity than with thresholding, providing more business value.

In our earlier example, answering the query with differential privacy could return results like these.

As you can see, the numbers aren’t exactly the same as when using thresholding: a small amount of statistical noise was added to the results. This prevents an adversary from recovering individual information, while preserving the overall trends, and the ability for analysts to derive useful insights. If you’d like to learn more about how to unlock new data sharing use cases using the combined guarantees of a data clean room provider and Tumult Labs’ best-in-class differential privacy platform, don’t hesitate to reach out! We would love to hear more about your use case and help you make the most out of your data.

Read paper