Social web

Revealing Wikipedia usage data while protecting privacy

Blue figure

Wikipedia’s volunteers want a systematic way to prioritize where to focus their work. Which entries are being read most? By which readers where?
DP was the technology that solved for the twin, and potentially contradictory, goals of privacy preservation and actionable insights.

case study image

summary

Empowering Wikipedia editors to make decisions on what to edit next

Regularly-updated usage data is vital for Wikipedia editors seeking to make data-driven decisions about which articles to edit next.
Daily usage data - by subject and geography - is also used by researchers of internet behavior, and the information landscape.


Open data is part of the Wikimedia Foundation’s ethos. However, transparency - in particular with respect to the location and behavior of readers - can  put individuals’ privacy at risk. DP was sought as the technology that could surface insight while preserving user privacy.

key outcomes

0
%
Increase in number of data points released per day
0
%
Increase in number of page views released per day
<.0
0
%
Spurious rate
<.
0
%
Drop rate

goals

Publish more data

When data benefits various stakeholders, making more of it available adds value to numerous communities.

Publish data more frequently

Sharing data more often can enhance its utility.

Meet the Open Data mandate, safely

Ensure data is shared widely while maintaining strict standards of privacy and security.

our process

A collaborative, calibrated process to assure utility while maintaining privacy

Proven with industry leaders in the private and public sector, our process delivers on your specific goals.

Define the problem
In a problem statement, state the rationale for the data release, the plan for releasing the data, a privacy unit, possible error metrics, and a pseudocode first draft of the algorithm to be used.
arrowarrow
Confirm the viability of using DP
Using default hyperparameters, see if it is possible to conduct a differentially- private data aggregation.
arrow
Decide on error metrics to optimize for
Create a set of internal error metrics to evaluate the data release against. All differentially-private datasets have some noise added, but if the noise needed to provide privacy guarantees is too great, the dataset might no longer be useful.
blue figure
arrow
Experiment with a wide variety of hyperparameters
Until error metrics are optimized, conduct a grid search of hyperparameters (output threshold, noise type and scale, keyset, etc.) until one is found to be optimal (given the above error metrics).
arrowarrow
Productionize the pipeline
Turn the finalized aggregation into a finished script, integrate error calculation and privacy loss logging. Automate running the job regularly.

the results

“With Tumult Labs' open source software and expertise in technical implementation, the Wikimedia Foundation team is now able to release more granular, equitable, and safe data about how readers are using our platforms.”
- Hal Triedman, Senior Privacy Engineer

How do Wikipedia editors decide which parts of the online encyclopedia to improve? Like in most modern organizations, an essential part of the decision-making process relies on data. Wikipedia’s editor community seeks to understand which parts of the site are most engaged with, and by whom. This information helps them to prioritize where to add content and make other improvements. But this information can put user privacy at risk.


Wikipedia sought Tumult’s help in applying DP in a way that would maintain privacy while providing actionable insights.

More data released

Wikimedia is better able to meet its mission around Open Data

More frequent data releases

Editors gain clearer insights of where to put their attention

Safer data releases

Data serves researchers without risking privacy

Unleash the power and value of your data.