Empirical privacy metrics: The bad, the ugly… and the good, maybe?
In a talk for PEPR ‘24, Damien Desfontaines lists major issues with empirical privacy metrics for synthetic data generation, and explains how we could fix them.
Synthetic data generation makes for a convincing pitch: create fake data that follows the same statistical distribution as your real data, so you can analyze it, share it, and sell it. Supposedly, privacy and compliance are achieved because this synthetic data is anonymous.
How do synthetic data vendors justify such privacy claims? Their answer often boils down to empirical privacy metrics. Vendors recommend that users run measurements on their synthetic data and empirically determine whether it's safe enough to release. But how do these metrics work? How useful are they? And how much should you rely on them?
In a talk delivered to PEPR '24, Damien Desfontaines takes a critical look at the space of synthetic data generation and empirical privacy metrics, dispels some some marketing-fueled myths that are a little too good to be true, and explains what is needed for these tools to be a valuable part of a larger privacy posture.
You can watch the recording of the presentation below, or directly read its transcript.
At Tumult Labs, we’re building synthetic data generation solutions that provide robust privacy guarantees using the proven science of differential privacy. We’re also designing ways to perform empirical privacy evaluation that avoid the pitfalls described in this presentation. If you’d like to learn more or schedule a demo, let us know! We’d love to hear from you.