Social web

Revealing Wikipedia usage data while protecting privacy

Wikipedia’s volunteers want a systematic way to prioritize where to focus their work. Which entries are being read most? By which readers where? DP was the technology that solved for the twin, and potentially contradictory, goals of privacy preservation and actionable insights.

summary

Empowering Wikipedia editors to make decisions on what to edit next

Regularly-updated usage data is vital for Wikipedia editors seeking to make data-driven decisions about which articles to edit next. Daily usage data - by subject and geography - is also used by researchers of internet behavior, and the information landscape. 

Open data is part of the Wikimedia Foundation’s ethos. However, transparency - in particular with respect to the location and behavior of readers - can put individuals’ privacy at risk. DP was sought as the technology that could surface insight while preserving user privacy.

Request a Demo

key outcomes

Increase in number of data points released per day

Increase in number of page views released per day

<.0

Spurious rate

Drop rate

Solutions

DP can unlock the power of your data throughout your enterprise.

Publish data confidently

Expand the audience that can safely work with data.

Gain greater insight

Manage data-reuse over time without privacy risk.

Strengthen disclosure avoidance

Mathematically-assure the protection of individuals’ data.

goals

Publish more data

When data benefits various stakeholders, making more of it available adds value to numerous communities.

Publish data more frequently

Sharing data more often can enhance its utility.

Meet the Open Data mandate, safely

Ensure data is shared widely while maintaining strict standards of privacy and security.

our process

A collaborative, calibrated process to assure utility while maintaining privacy

Proven with industry leaders in the private and public sector, our process delivers on your specific goals.

Define the problem

In a problem statement, state the rationale for the data release, the plan for releasing the data, a privacy unit, possible error metrics, and a pseudocode first draft of the algorithm to be used.

Confirm the viability of using DP

Using default hyperparameters, see if it is possible to conduct a differentially- private data aggregation.

Decide on error metrics to optimize for

Create a set of internal error metrics to evaluate the data release against. All differentially-private datasets have some noise added, but if the noise needed to provide privacy guarantees is too great, the dataset might no longer be useful.

Experiment with a wide variety of hyperparameters

Until error metrics are optimized, conduct a grid search of hyperparameters (output threshold, noise type and scale, keyset, etc.) until one is found to be optimal (given the above error metrics).

Productionize the pipeline

Turn the finalized aggregation into a finished script, integrate error calculation and privacy loss logging. Automate running the job regularly.

the results

“With Tumult Labs' open source software and expertise in technical implementation, the Wikimedia Foundation team is now able to release more granular, equitable, and safe data about how readers are using our platforms.”

- Hal Triedman, Senior Privacy Engineer

How do Wikipedia editors decide which parts of the online encyclopedia to improve? Like in most modern organizations, an essential part of the decision-making process relies on data. Wikipedia’s editor community seeks to understand which parts of the site are most engaged with, and by whom. This information helps them to prioritize where to add content and make other improvements. But this information can put user privacy at risk.
‍
 Wikipedia sought Tumult’s help in applying DP in a way that would maintain privacy while providing actionable insights.

More data released

Wikimedia is better able to meet its mission around Open Data

More frequent data releases

Editors gain clearer insights of where to put their attention

Safer data releases

Data serves researchers without risking privacy

More resources

View All

Illuminating college outcomes, while protecting privacy

Case Study

Joining sensitive data sets from the Department of Education and the IRS in a way that protected privacy resulted in College Scorecard - a platform that allows students and families to simultaneously consider the cost and evidenced outcomes of a range of possible degrees.

The fundamental trilemma of synthetic data generation

Blog Post