Benchmarking differentially private synthetic data generation algorithms

Transparent Green shape

Which synthetic data generation algorithms for tabular datasets offer the best privacy/utility trade-offs? Tumult Labs did the research. Read the results below.

Green shape in dark background
Research
Michael Hay

Summary:

A synthetic dataset consists of a collection of records which are generated to match the properties of an original data source. It is an appealing option for a range of data sharing settings and differential privacy is the best way to guarantee that synthetically generated data protects the privacy of the original source data.

In this paper, we benchmark twelve published methods for generating differentially private tabular synthetic data to see which most accurately preserve the properties of the source data. We present a systematic benchmark where the utility of the synthetic data is evaluated by measuring whether it preserves the distribution of individual and pairs of attributes, pairwise correlation, as well as on the accuracy of an ML classification model. In a comprehensive empirical evaluation we identify the top performing algorithms and those that consistently fail to beat baseline approaches.

other Research articles

View All

Unleash the power and value of your data.