A framework to evaluate the robustness of anonymization solutions

In this blog post, we introduce a conceptual framework to help prospective buyers of anonymization technology evaluate claims on the trustworthiness and security of potential solutions.

Blog Post

Say you’re in the market for anonymization software: tooling that unlocks the value of sensitive data, by enabling you to share or use it in a fully anonymized way. If you don’t have in-house experts in privacy-enhancing technologies, it can be hard to navigate and understand which solutions you should trust. What should you look for? What questions should you ask to vendors of anonymization technology?

Robustness is a critical requirement for our customers, who rely on the guarantees offered by the software we build to share or publish insights from extremely sensitive data. So we spent a lot of time and effort thinking about what it means for an anonymization tool to be secure, and how to constantly improve the security of our products.

In this article, we offer a framework to help you evaluate the robustness of anonymization tools. This framework is modeled as a pyramid: lower levels are most critical, and must be put in place in order to fully benefit from the additional assurances that upper levels provide.

Let’s take a look at each of these four levels, starting with the bottom one.

Relies on mathematically-proven, future-proof guarantees

From a security perspective, anonymization tools have one main task: protect your data. At a very fundamental level, how do they claim to do that? What privacy guarantees do they offer?

Different products have very different approaches to answering this question.

Some tools do not enforce a clear, quantifiable privacy guarantee. This is often because they use ad hoc techniques, for example, based on record-level de-identification.
Some tools rely on empirical privacy metrics, which claim to quantify privacy in the output data after generating it. This is particularly frequent for synthetic data generation tools.
Some tools rely on definitions of privacy with a clear, mathematically-formulated attacker model. These provide a guarantee that — barring implementation issues — cannot be broken by future attacks.

We believe that products falling in the first two categories are not robust enough to rely on in privacy-critical scenarios. Ad hoc approaches to anonymization have been repeatedly broken, and techniques that protect against specific attacks do not stand the test of time. Empirical privacy metrics in use today have fundamental issues that make them largely meaningless; even if these issues can be fixed, empirical metrics can only provide a lower bound on privacy risk, never future-proof guarantees. And securing anonymization deployments against future attacks is crucial: published data often cannot be “unpublished” later on.

By contrast, some solutions rely on robust mathematical foundations, often differential privacy. This notion can be proven to protect the anonymized data against all possible attacks, providing future-proof guarantees for the published data. This is the approach we focus on at Tumult Labs: our platform, Tumult Analytics, enforces differential privacy on its outputs.

Note that this distinction might not always be clear when reading the marketing materials of these products. Many tools promise to produce “fully anonymous data” and guarantee GDPR compliance, even though these claims are not backed by robust notions of privacy.

Transparent about methods and implementation

Any approach to anonymization, even if it relies on a robust mathematical foundation, is only as secure as its implementation. And while it’s easy for vendors to claim that a product is safe, it’s much harder for buyers of privacy technology to independently validate these claims. But one thing that can be evaluated more easily is the level of transparency: Are the methods and code publicly available?

Anonymization is similar to cryptography: it is easy to claim that a system provides some guarantee, and much harder to implement it correctly. Therefore, one should use caution when evaluating tools that do not publish the methods they rely on, nor their implementation code. Even for vendors who are willing to let their customers audit their code, not letting the research community independently evaluate their security claims is a bad sign. After all, robust software should not rely on security through obscurity.

At Tumult, this is a major reason why we decided to make the privacy-critical code in our platform, Tumult Analytics, entirely open-source, and published a whitepaper describing its architecture. We believe that anyone should be empowered to understand how our technology works to protect sensitive data, and verify our claims.

Designed for safety and auditability

Implementing anonymization software correctly is notoriously difficult and error-prone. Subtle issues can easily lead to severe vulnerabilities. Worse, for complex data publication mechanisms, keeping track of all the operations done on the data and their privacy-relevant properties can become very difficult. Because of such challenges, when evaluating anonymization solutions, it can be valuable to ask: What approach does the tool take to software safety and auditability?

This can be evaluated using a number of different signals. Some are common to all software projects: How well-tested and well-documented is the software library? How often is it updated? How responsive is the development team to questions and security reports? Others are more specific to anonymization software: How are floating-point vulnerabilities mitigated? What is the tool’s approach to providing end-to-end guarantees for complex programs? Good answers to these questions are a positive sign.

We take a multi-pronged approach to this. First, we follow good software development practices, like systematic testing, code reviewing, or monitoring our dependencies for vulnerabilities. Second, we invest in research to secure key components like noise addition, developing provable approaches to fixing floating-point vulnerabilities. Third, we built safety by design when designing Tumult Analytics: all the privacy-critical logic relies on Tumult Core, a framework where simple, easy-to-audit components can be composed in complex ways, and where the end-to-end privacy guarantees can be automatically deduced from the properties of individual components. You can read more in our technical whitepaper.

Audited by third-parties

Tools that provide clear privacy guarantees, are transparent, and can demonstrate how security was built into their design are already providing a high level of robustness. For extremely sensitive use cases, it might make sense to ask one more question to prospective vendors: Was the tool systematically audited by independent experts?

Transparency and auditability can allow third-parties to try to look for vulnerabilities and verify claims. But this kind of “community-based” audit is hard to predict, and will likely only cover a fraction of the overall codebase. Therefore, it can make sense to contract with independent experts to perform a systematic audit of the entire platform. Having an entirely separate team look through a large project can be insightful, and increase the level of confidence that the software actually provides its advertised privacy guarantees.

Tumult Analytics’ entire codebase was reviewed by third-party experts as part of an assessment before it was deployed in production by a U.S. government agency. No major security issues were found as part of this audit.

Conclusion

This “pyramid of robustness” is setting a high bar for anonymization tooling. Some of the upper levels might not be hard requirements for all use cases; your mileage may vary depending on compliance requirements, the sensitivity of your data, and the context of the anonymization use cases. But we hope that this conceptual framework can give you a better idea of what to ask potential vendors, and help you evaluate their claims.

If you’d like to get a demo of Tumult Analytics, or discuss our approach to building robust software for privacy-critical use cases, don’t hesitate to get in touch!

Read paper