The fundamental trilemma of synthetic data generation

Transparent Green shape

In this blog post, we outline the three key desiderata of synthetic data solutions — flexibility, accuracy, and privacy — and explain the fundamental trade-off between them.

Green shape on dark background
Blog Post
Damien Desfontaines

You have a dataset full of sensitive personal data, and you want to share it with a third party. Maybe they’re trying to perform scientific research, or unlock a business opportunity, or support evidence-based policy efforts. Either way, the request is reasonable, but because of privacy or compliance considerations, you can’t directly share the data. What are your options?

Synthetic data generation seems to be an attractive solution. The idea is simple: instead of sharing the real data directly, you use some fake data that “looks like” your real data, but only contains made-up records. A common way to do that is to train a model to learn the distribution of the data, and use it to generate fake records according to this distribution. If the real distribution was well-captured, then the synthetic data can be used to learn insights about the original data. As there is no obvious one-to-one correspondence between original and synthetic data points, it seems to offer nice privacy guarantees by design.

Like most technologies, synthetic data is not a silver bullet. It comes with inherent trade-offs, and in particular what we call the synthetic data trilemma: three key desiderata that cannot all be perfectly achieved simultaneously. These desiderata are privacy, flexibility, and accuracy.

A diagram with three points, with double arrows between each pair of points. One point is labeled “Privacy: how well-protected do you need the original data to be?”. Another is labeled “Flexibility: how many distinct use cases do you need to address with the synthetic data?”. The third one is labeled “Accuracy: for each use case, how well do you need the synthetic data to perform, compared to the original data?”.
  • Privacy, in the context of synthetic data, is about how well the data is protected. Can someone run an attack on the synthetic data you shared and find out information about individual people in the original dataset? Good privacy means that such an attack is infeasible.
  • Flexibility is about how many questions you want to answer — how many problems you want to solve — using the synthetic data. Good flexibility means that the synthetic data can address a wide variety of possible use cases.
  • Accuracy is about how well each individual use case can be solved using the synthetic data. Good accuracy for a given use case means that the synthetic data will perform almost as well as the original data in addressing it.

Why can’t you have all three?

Achieving a perfect score on all three desiderata goes against a decades-old theorem of data privacy: the fundamental law of information recovery. This result states that if you can obtain accurate enough answers to sufficiently many queries, then you can reconstruct most original records of the original dataset. So with sufficiently high levels of flexibility and accuracy, you automatically lose desired privacy guarantees.

This result stays true even with synthetic data. To learn more, you can consult this policy brief explaining this phenomenon, or this technical blog post outlining one kind of reconstruction attack in more detail.

What does this mean in practice?

Like in many privacy-related scenarios, there is an inherent trade-off between the three desiderata. If someone tries to convince you that synthetic data is a silver bullet… they’re probably not being quite honest about what this technology can do. But synthetic data does have a lot of promise — it simply requires finding a balance in the midst of its inherent trade-offs, and relying on robust technology.

Different use cases will end up in different places in the trade-off space. Here are a few examples.

  • Good privacy and accuracy, limited flexibility. Sometimes, you know in advance which questions you want to answer with the synthetic data. For example, you might want to use it to build a dashboard that reports statistical information to third-parties. In this case, you can choose a strategy that maximizes the accuracy of the dashboard statistics, while achieving high privacy guarantees.
  • Good privacy and flexibility, limited accuracy. In other cases, preserving the statistical properties of the data is not critical, but preserving schema and domain information is. For example, you might want to generate synthetic data to build and test infrastructure without giving access to production data to engineers. This is also possible to achieve with strong privacy guarantees.
  • Everything in between. The trade-off is not binary: sometimes, you don’t have a general idea of which questions you want to answer, and accuracy matters but isn’t critical. For example, you can give easy access to synthetic data to analysts so they can quickly run queries and test ideas. If they need to run a more precise analysis, they can later request access to the real data, with perhaps a lengthier process that involves additional safeguards.

All these scenarios can be addressed while using robust, future-proof privacy notions like differential privacy. At Tumult Labs, we are world-class experts at this: we published leading, award-winning research about provably private synthetic data generation. We are deploying differentially private synthetic data solutions for customers, and building a general-purpose platform to make this easy for others to use. Get in touch if you’d like to learn more!

Unleash the power and value of your data.