Publishing Wikipedia usage data with strong privacy guarantees

Tumult Labs helped engineers at the Wikimedia Foundation design, implement, and deploy a differentially private solution to publish Wikipedia usage metrics in a provably secure way.

Blog Post

This case study presents the original problem, outlines the process that led to this publication, the challenges overcome along the way, and the lessons learned from this collaboration.

For nearly every hour of the past 20 years, the nonprofit that hosts Wikipedia has shared data publicly about the use of the online encyclopedia.

For example, the Pageview API provides information about how many times any given page on Wikipedia is visited each day. This data is vital for volunteer Wikipedia editors seeking to make data-driven decisions about which articles they edit next, academic researchers examining the information landscape, technologists assessing user behavior, and more. Open data is part of the Wikimedia Foundation’s ethos, with a belief that transparency can help improve collective understanding of and engagement with the internet.

The published information is, however, quite coarse. In particular, page visits from all around the world are aggregated into a single statistic, without any geographic detail. Indeed, location information can be extremely sensitive – Wikipedia editors or readers might not want their location to be revealed, even at the country levels, due to sensitive editing or reading behavior. More generally, people browsing Wikipedia rightly expect their behavior on the website to stay private, and privacy is a central tenet of the nonprofit’s values. Since it is well known that simply aggregating data is not enough to prevent re-identification risk, publishing data with a finer geographic granularity warrants an approach with rock-solid privacy guarantees for people browsing Wikipedia.

How do Wikipedia editors decide which parts of the online encyclopedia to improve? Like in most modern organizations, part of this decision process is data-driven: the community tries to understand which parts of Wikipedia are most popular among users to estimate which articles, article categories, and languages deserve most attention. This is in part why some data about user interactions on Wikimedia is published online; for example, the Pageview API provides information about how many times any given page on Wikipedia is visited each day.

So, the Foundation was faced with the question of how to make more data about Wikipedia usage patterns available, while still providing strong privacy guarantees to people browsing the website. This kind of challenge seemed particularly well-suited to the use of differential privacy. To evaluate whether differential privacy could, in fact, successfully tackle this challenge, Wikimedia Foundation engineers considered the use case of a frequently requested feature of the Pageview API: presenting the page view data, sliced by country of origin.

After an in-depth comparison of available open-source tools, the Wikimedia Foundation decided to use Tumult Analytics and started a collaboration with Tumult Labs to help them along the way. The pipeline is now deployed, and the published data provides useful insights to anyone interested in better understanding Wikipedia usage. This case study explains how the process worked from inception to deployment, and it highlights a few particularly notable points of this use case.

Tumult’s workflow for differential privacy deployments

Our work helping organizations launch a differentially private data product follows a standard workflow, with three main stages: Build, Tune, and Deploy. The entire process is outlined in the diagram below.

A diagram with three rows of two boxes with different colors. The top row, labeled “build”, starts from “Define problem statement and success metrics”, and an arrow points to “Implement first prototype”. An arrow from there goes to the second row, labeled “Tune”, starting with “Evaluate output quality”. Arrows in both directions go to the second box of the second row, “Optimize the algorithm, tune its parameters”. An arrow goes to this box to the third row, labeled “Deploy”, to a box “Finalize the algorithm, write documentation”, and a last arrow goes to “Deploy 🚀”.

The three main stages are as follows.

First, in the Build step, the goal is to gain a good understanding of the problem and its requirements, and implement a first-cut algorithm. Properly defining the problem and building a prototype exposes the levers inherent to the project: which choices did we make while building the prototype, and which of these choices can later be modified to pick different trade-offs between utility or privacy metrics?
Second, in the Tune step, we use these levers to experiment with different settings and optimize the algorithm. This is an iterative process: we evaluate our approach against the quality metrics defined in the first step, and make changes until the mechanism produces data that is fit for use.
Finally, in the Deploy stage, the algorithm is finalized, documented, and deployed in production.

In the rest of this case study, we zoom in on three different aspects of this process.

First, we outline the problem statement of this pilot data release, and the quality metrics that were chosen for this project.
Second, we give a high-level description of the differentially private algorithm designed for this use case.
Third, we give more detail about a specific technical challenge encountered in the course of this project: bounding user contributions without user identifiers.

Problem statement and success metrics

The goal of the first stage, Build, is to produce a first prototype that meets the high-level requirements for the data product. Of course, the first step is to outline these requirements. What is the problem we are trying to solve, and how do we quantify success? Both questions are answered in a problem statement document, co-authored by both the Wikimedia Foundation and Tumult Labs.

In this pilot, the input data looked like the following sample table.

A table with four columns and two sample rows with fake values. Columns are page_id, containing numbers, project, containing en.wikipedia.org or fr.wikipedia.org, country, containing “ch” and “us”, and date, containing 2022-08-15 twice.

Each line corresponds to a visit to a certain page, coming from a certain country (based on IP geocoding), on a certain date. For example, the first line corresponds to a visit to this Wikipedia article from Switzerland, on August 15th, 2022. The high-level goal is to compute a histogram of visits, grouped by date and country. The output data is expected to look like the following table.

Alt-text: A table with five columns and two sample rows with fake values. The rows are the same as before, and there is an additional highlighted column named “count”, with sample values 42 and 17.

Finally, the problem statement also describes desired privacy guarantees, and answers the question “What do we want to protect?”, in addition to “What do we want to achieve?” In this project, the high-level goal is to protect all the contributions of a single person during a single day.

Differential privacy requires randomization, or noise, to provide strong privacy guarantees to the people in the input data. This means that the differentially private output will not be exactly the same as the result of a simple group-by + count operation on the input data. The goal, then, is for the noisy output to be as close as possible to the true output. This notion of closeness is defined by quality metrics, which capture the statistical properties of the data that we expect to be most useful, and that we would like to preserve during the differential privacy computation.

In this project, we measured quality along three distinct dimensions.

The relative error distribution captures how noisy the released counts are. For each noisy count that we publish, how close is it, in relative error, to the real count? More precisely, we looked at the percentage of released counts having a relative error smaller than 10%, 25%, and 50%.
The drop rate captures the percentage of counts that exist in the true histogram, but have been suppressed by our algorithm, and do not appear in the differentially private output. An additional metric looks at this percentage, restricted to pages having a minimal number of visits – the reasoning being that suppressing a popular page is worse than suppressing a page with a very low number of views.
The spurious rate captures the percentage of counts that appear in the differentially private output, but did not appear in the true histogram.

For more information about the input/output data and these quality metrics, you can consult the original problem statement of this pilot.

High-level approach

Once the problem statement and the utility metrics have been defined, the next step is to design and implement an algorithm that addresses the requirements. In this project, the first-cut algorithm followed the same overall approach as the algorithm that ended up being deployed. The following diagram describes this high-level approach.

A diagram outlining the data publication algorithm. On the top left, a red database icon is labeled “Sensitive data”, and an arrow goes from there to a red box labeled “Limit the maximum contribution of each user”. On the top right, a green database icon is labeled “Public data”, and an arrow goes from there to a green box labeled “List all possible page/country pairs”. Both boxes have arrows going to a red box labeled “Group by page and country, compute exact counts”. An arrow goes from there to a box in a red-purple gradient labeled “Add differentially private noise to each count”. An arrow goes from there to a purple box labeled “Remove low counts”. And a final arrow goes from there to a purple box labeled “Publish the data 🚀”

This algorithm takes two distinct inputs.

Sensitive data, which is the input data capturing user interactions described above.
And public data, which in this case is the data already published in the Pageview API.

Indeed, the goal is to give provable, formal bound on the additional privacy risk that could occur due to this release. This progressive approach is typical of organizations piloting the use of differential privacy, and it is a good first step towards measuring the privacy risk of data releases in a more holistic way.

The first operation done in the sensitive data is contribution bounding. Here, differential privacy must hide all the contributions of a single Wikipedia user during a single day. This is impossible to achieve if a user can have an arbitrarily large contribution to the dataset during this time period, so we must truncate the input data to ensure that each user does not contribute more than a certain amount of data. This step is necessary for most differentially private data releases, but it caused unique challenges for this pilot – we come back to these in the next section.

After contribution bounding, the data is grouped by page and country, using the public data to generate the list of possible counts. Listing these groups is an important step to achieve differential privacy: if we simply use the counts that appear in the data, this can break privacy guarantees.

After generating the counts, the algorithm adds noise to each of them. After this step, the results are differentially private and safe to publish. However, the previous step might have added a large volume of spurious data: some of the page/country pairs generated from the public data might not exist in the actual sensitive data; for these counts, noise was added to a count of 0. To prevent too many of these to end up in the real data, we add a post-processing step to remove the counts below a certain threshold.

This last step captures a trade-off between the quality metrics defined initially: a higher threshold removes more counts, which increases the drop rate, but reduces the spurious rate and the relative error distribution.

One technical challenge: bounding user contributions

As mentioned above, differential privacy requires bounding the contributions that come from each person in the data. Often, this step can be achieved in a conceptually simple way: each person is associated with a unique identifier (e.g. stored in account information systems, or as cookie identifiers), so the system can simply count the number of times each identifier occurs in the data, and keep only a fixed number of contributions.

The parameters chosen for this pilot led us to consider a slightly different kind of contribution bounding, which is slightly more complicated than limiting the number of total rows associated with each user. Instead, we want each user to contribute only once to each count, but be able to contribute to at most 10 counts.

However, this contribution bounding step turned out to be more challenging to implement than originally expected. The technical infrastructure underlying Wikipedia implements strict data minimization practices. In particular, when a user reads Wikipedia, the logging infrastructure will not associate a unique identifier to the log records corresponding to this user. There is no such thing as a cookie ID that allows Wikimedia's systems to recognize whether two records came from the same user.

In this context, we sought to address how to perform contribution bounding to achieve differential privacy. We tackled this challenge in two steps.

Engineers at the Wikimedia Foundation implemented a novel system for client-side filtering. The key idea is to perform the contribution bounding step in a local way, happening on the end-user device. Each device will count how many contributions have been logged in a single day; after the contribution threshold is met, it will pass an additional flag to indicate that this additional contribution should be discarded prior to the differentially private mechanism.
Tumult Labs engineers designed and implemented an additional feature to the Tumult Analytics framework, to allow it to support this atypical privacy notion: rather than protecting a single row in the dataset, or all the rows with a given identifier, this new feature protects groups of 10 rows, all contributing to different counts in the output. This was made possible thanks to the extensibility of the underlying privacy framework, Tumult Core, whose design makes it easy to add additional features such as this one.

Results and next steps

On the sample data used for evaluation and tuning, we found that using zero-concentrated differential privacy with a daily privacy budget of ρ=0.15 (which corresponds to e.g. ε=1 for δ=10^-7) led to a publication of acceptable quality. More precisely:

more than 95% of counts had a relative error below 50%;
the drop rate among counts larger than 150 was below 0.1% (1 in 1,000 pages);
and the spurious rate is below 0.01% overall (1 in 10,000 pages), and below 3% for all but 3 countries.

The fact that we can precisely quantify the “privacy cost” of this release has an important benefit for future data releases: going forward, the Wikimedia Foundation will be able to use the tooling of differential privacy accounting to assess the cumulative privacy risk of multiple releases.

The data is now published online, along with the source code of the client-side filtering and the differentially private pipeline. This was made possible thanks to two important properties of differential privacy.

Its guarantee holds even if an attacker has access to auxiliary data, so the data can safely be published online without having to restrict its access to trusted or semi-trusted data users.
It does not rely on “privacy by obscurity”, so all the methods and parameters of the underlying algorithm can be published without additional risk.

We can’t wait to see how Wikipedia editors and others can make use of this open data!

In addition to this data publication milestone, this collaboration was also full of learning opportunities for both parties.

Engineers and data scientists at the Wikimedia Foundation increased their level of expertise in differential privacy. They led the implementation and deployment work, attended trainings led by Tumult experts, and developed client-side filtering infrastructure that might be reused for further use cases.
At Tumult Labs, we made good use of the valuable feedback from engineers and data scientists at the Wikimedia Foundation to continuously improve the platform. These improvements, ranging from interface usability to additional features and documentation, are now included in our open-source repository; several more are being worked on.

This project was also a great opportunity to validate the initial hypothesis that differential privacy can be used in practice to publish more data at the Wikimedia Foundation, and quantify privacy risks in a more systematic way. Building on this work, we continued to work with the Foundation's engineers, and we are helping them with additional data releases. Stay tuned for more updates!

Acknowledgements: This work resulted from a collaboration of Cléo Lemoisson, Gabriele Modena, Hal Triedman, Isaac Johnson, and Temilola Adeleye from the Wikimedia Foundation, and Damien Desfontaines, Michael Hay, Skye Berghel, and Tom Magerlein from Tumult Labs. The differential privacy library used in this work is open-source and available on our GitLab repository.

---------------------------------------------

¹This is, of course, a simplification of the actual data schema, and uses fictional examples.

² The data is also grouped by date, though this does not appear explicitly in the algorithm: instead, the input data is split by date, and the mechanism is run on all the data for each single day.

Read paper