A Winning Approach to Generating Synthetic Data
A scientific paper, co-authored by our CEO Gerome Miklau, introduces a cutting-edge method for generating differentially private synthetic data.
At Tumult Labs, we are dedicated to helping organizations unlock the value of sensitive data while maintaining strong privacy protections. A scientific paper, co-authored by our CEO Gerome Miklau, introduces a cutting-edge method for generating differentially private synthetic data. This work, which won a national competition, showcases how synthetic data can maintain both privacy and utility, allowing organizations to safely use data without compromising individual privacy.
You can read the full paper here.
What Is Synthetic Data?
Synthetic data is artificial data that mimics the patterns of real-world data without containing any actual personal information. It’s widely used to analyze trends, train machine learning models, and test systems—making it invaluable across sectors like healthcare, finance, and government. However, if synthetic data is not properly protected, it can still reveal sensitive information.
The Importance of Differential Privacy
Not all synthetic data is created equal. Without proper privacy safeguards, synthetic data may be vulnerable to reverse engineering, where attackers can use patterns in the data to uncover real-world individual information. This is where differential privacy comes in.
Differential privacy is a rigorous mathematical technique that adds noise to the data, ensuring that no one can trace the synthetic data back to any individual. This protection is essential for preventing privacy breaches, even when external data is available.
The Paper’s Approach: Select, Measure, Generate
In their paper, Miklau and his co-authors outline a three-step process for generating differentially private synthetic data, known as Select, Measure, Generate:
- Select Key Statistics: The process starts by identifying key patterns in the data, called marginal queries. These queries capture essential relationships, such as how income relates to education or employment.
- Measure with Privacy-Preserving Noise: To protect privacy, noise is added to these statistics using a mechanism called the Gaussian mechanism. This ensures that individual records cannot be re-identified, even in the synthetic version of the data.
- Generate Synthetic Data: Finally, these noisy statistics are used to generate synthetic data that reflects the underlying patterns of the original dataset without leaking personal details.
The researchers introduced two methods in the paper: NIST-MST, the mechanism that won the competition, and MST, a more general version that doesn’t rely on the existence of public data for the selection step. Both methods demonstrate that synthetic data can be generated at scale, with strong privacy guarantees, while preserving the critical value of the data.
Why This Matters
For organizations that rely on sensitive data, this method opens up new opportunities. It allows companies to safely share and analyze data while meeting the highest privacy standards. As privacy regulations tighten and the demand for data-driven insights grows, differentially private synthetic data offers a path forward.
At Tumult Labs, we are proud to lead the way in this field, helping organizations safely unlock the value of their data using the best privacy technology available.