A Winning Approach to Generating Synthetic Data

A scientific paper, co-authored by our CEO Gerome Miklau, introduces a cutting-edge method for generating differentially private synthetic data.

Research

Michael Hay

At Tumult Labs, we are dedicated to helping organizations unlock the value of sensitive data while maintaining strong privacy protections. A scientific paper, co-authored by our CEO Gerome Miklau, introduces a cutting-edge method for generating differentially private synthetic data. This work, which won a national competition, showcases how synthetic data can maintain both privacy and utility, allowing organizations to safely use data without compromising individual privacy.

You can read the full paper here.

What Is Synthetic Data?

Synthetic data is artificial data that mimics the patterns of real-world data without containing any actual personal information. It’s widely used to analyze trends, train machine learning models, and test systems—making it invaluable across sectors like healthcare, finance, and government. However, if synthetic data is not properly protected, it can still reveal sensitive information.

The Importance of Differential Privacy

Not all synthetic data is created equal. Without proper privacy safeguards, synthetic data may be vulnerable to reverse engineering, where attackers can use patterns in the data to uncover real-world individual information. This is where differential privacy comes in.

Differential privacy is a rigorous mathematical technique that adds noise to the data, ensuring that no one can trace the synthetic data back to any individual. This protection is essential for preventing privacy breaches, even when external data is available.

The Paper’s Approach: Select, Measure, Generate

In their paper, Miklau and his co-authors outline a three-step process for generating differentially private synthetic data, known as Select, Measure, Generate:

Select Key Statistics: The process starts by identifying key patterns in the data, called marginal queries. These queries capture essential relationships, such as how income relates to education or employment.

Measure with Privacy-Preserving Noise: To protect privacy, noise is added to these statistics using a mechanism called the Gaussian mechanism. This ensures that individual records cannot be re-identified, even in the synthetic version of the data.

Generate Synthetic Data: Finally, these noisy statistics are used to generate synthetic data that reflects the underlying patterns of the original dataset without leaking personal details.

The researchers introduced two methods in the paper: NIST-MST, the mechanism that won the competition, and MST, a more general version that doesn’t rely on the existence of public data for the selection step. Both methods demonstrate that synthetic data can be generated at scale, with strong privacy guarantees, while preserving the critical value of the data.

Why This Matters

For organizations that rely on sensitive data, this method opens up new opportunities. It allows companies to safely share and analyze data while meeting the highest privacy standards. As privacy regulations tighten and the demand for data-driven insights grows, differentially private synthetic data offers a path forward.

At Tumult Labs, we are proud to lead the way in this field, helping organizations safely unlock the value of their data using the best privacy technology available.

Read paper

other Research articles

View All

Research

An innovative programming framework for authoring accurate, efficient and private algorithms

Designing a programming framework for writing complex yet safe differential privacy programs is no small task. This paper co-authored by Tumult Labs founders laid the foundation of the privacy framework used by our customers.

Research

Evaluating the usability of differential privacy tools with data practitioners

Researchers at University of Vermont ran a usability study to compare various differential privacy tools. Can you guess which platform study participants found easiest to use correctly?

Research

PrivateSQL: Reimagining and designing a new differentially private SQL query engine

Read the paper co-authored by Tumult Labs founders on building a differentially private relational database system that takes into account the complexity of multi-relational schemas and constraints.

Research

AIMing Higher: A Smarter Approach to Privacy-Preserving Synthetic Data

Learn how the AIM algorithm, co-invented by Tumult Labs CEO Gerome Miklau, improves upon existing algorithms for synthetic data generation by adapting to the user’s analysis needs and capturing key patterns in the input data.

Research