The BPLIM workflow for anonymizing confidential data for research

Asjad Naqvi
The Stata Gallery
Published in
8 min readApr 23, 2024

--

This article has been jointly written with Paulo Guimarães and Gustavo Iglésias from the BPLIM team, at the Banco de Portugal.

Disclaimer: The article presents the views of the authors and not necessarily those of their respective institutions.

Data fuels the engine of research

As research becomes more and more data demanding, access to useful public datasets is becoming increasingly difficult. This is especially true for countries where strong data privacy laws (for example in the EU), and/or unwillingness to share data for various reasons is not uncommon.

To surmount this obstacle, some research data centers are exploring methods to provide access to confidential datasets for research endeavors. This is the case of Banco de Portugal Microdata Research Laboratory (BPLIM) which was created in 2016 with the primary goal of giving access to the micro datasets collected and maintained by the central Banco de Portugal to external researchers from over the world.

Initially, analysis of confidential datasets by external researchers involved remote access to a perturbed version of the original datasets on the servers of the bank. This access served the purpose of allowing researchers to prepare their scripts which were then run on the original data by BPLIM staff (see this Harvard Data Science Review (HDSR) article for a brief history of BPLIM along with a description of the general workflow for data access).

More recently, BPLIM has developed tools that allow external researchers to prepare their scripts on confidential datasets without the need to access the bank server. The process involves the creation of representative and anonymized pseudo datasets based on the original proprietary and confidential micro data. These pseudo datasets respect the meta data structure of the original data. What BPLIM shares with the researchers in not the actual pseudo datasets but the code (dofiles) that produces these pseudo datasets. This makes the process transparent and permits researchers to customize the pseudo data according to their needs. The pseudo data is used to produce the scripts which are then sent to BPLIM and ran on the original data by BPLIM staff. After standard output control they are shared with the external researchers.

These procedures benefit both the data provider(s), such that the identity of respondents is not disclosed, and the researcher(s), who can do analysis on a pseudo representative dataset and derive the answers to some pertinent policy-relevant questions from the original confidential data set.

At the time of writing this article, the procedures described below represent the forefront of how public institutions can (more easily) give access to confidential data in exchange for cutting-edge research analysis.

Workflow

Through an integrated and iterative process, BPLIM works closely with researchers, to produce and analyze confidential datasets that is summarized below:

The broad steps are defined as follows:

  • Step 1: Researcher visits BPLIM website to check the catalog of microdata sets including variable names and other metadata.
  • Step 2: Researcher(s) send a request to BPLIM to access a micro dataset(s) stating broad research question.
  • Step 3: BPLIM processes the data to generate a package that produces the pseudo representative datasets.
  • Step 4: Researcher creates the pseudo representative datasets and conducts analysis on the data storing all process in one or more scripts.
  • Step 5: Code is handed over to BPLIM to run the script(s) on the actual data.
  • Step 6: After output control results are shared with the researcher.

An Example

In this section we present a simple example on how BPLIM creates the pseudo data based on the publicly available Stata panel data set nlswork.dta.

At BPLIM, the process of creating a pseudo data is usually more complex, as researchers often work across multiple datasets with different time spans and distinct linking identifiers. When preparing the pseudo data, BPLIM staff must take those linkages into account and ensure the representativeness of units (people, companies, etc.) over time. In this example, we will show how to generate a pseudo data set for a single panel, ignoring the difficulties that arise when dealing with multiple data sets.

In our data pipeline, there are three main steps for creating pseudo data:

  1. Produce a meta data file with statistics.
  2. Create a sample of unit identifiers (may be more than one identifying variable).
  3. Generate the dofiles that creates the pseudo data.

We will demonstrate how to complete each one of these steps. If you want to follow along, please make sure you install the BPLIM Stata commands metaxl, dummyfi, and sampleid in your system:

** metaxl
net install metaxl, from("https://github.com/BPLIM/Tools/raw/master/ados/General/metaxl/")

** dummyfi and sampleid
net install dummyfi, from("https://github.com/BPLIM/Tools/raw/master/ados/General/dummyfi/")

The programs can be downloaded from the official BPLIM GitHub repository. There you will also find user guides for the metaxl and dummyfi commands.

Now let’s go step-by-step on how to use these programs to generate a pseudo dataset:

Step 1: The original confidential data

First, let’s import the data we want to work with into Stata:

webuse nlswork, clear

The imported data contains several variables, as we can see in the image below:

Here, the most relevant variables for the pipeline are idcode and year, the panel id and time variables respectively which allow us to replicate the time structure of the original data.

Step 2: Create a meta data file

The first step in the data pipeline is to extract the meta data with summary statistics:

metaxl stats, save(meta, replace) panel(idcode) time(year)

The above command saves the file meta.xlsx in the current working directory and contains multiple worksheets, each with information about the data. Below is a capture of the variables worksheet:

The variable worksheet contains information about the variable labels and value labels (for each language defined), type, format, and number of characteristics and notes. The additional information produced with metaxl stats are the statistics for each variable, which by default exports the mean, standard deviation, 5th, 50th, and 95th percentiles, and the share of zeros, negative values and missing values.

Since we provided a panel variable and a time variable, the worksheet also includes the share of time invariant observations and the minimum and maximum dates per variable.

The image shows that variable race has a value label defined racelbl. Each value label has its own worksheet, where we find every value defined and the corresponding label. The name of the worksheet is the name of the value label prepended by vl_. In this case, vl_racelbl:

Since we used metaxl stats, the frequencies for each level of the categorical variable are also exported and saved under freq_race. The name of the variable is part of the column because it is possible to have multiple frequency columns, since a value label may be ascribed to more than one variable.

All the information about variables that we have seen above will be used to create the pseudo data set.

Step 3: Sampling IDs

The next step is to sample the ids (unique unit identifiers). The units may be identified by a single variable or a combination of variables. In the case at hand, the variable idcode identifies individual units, from which we generate our sample.

Since we are working with longitudinal data, we must specify a time variable, which is year in our case. Setting a time variable allows sampled ids to maintain their time-specific characteristics, i.e., selected units will keep their temporal profile in the pseudo dataset.

Let’s generate a 10% sample:

sampleid idcode, sample(10) time(year) save(nlswork_ID) replace

The command above saves the sampled units in nlswork_ID.dta:

where we see that the time structure for each sampled unit id is also preserved in the saved file.

Step 4: Generate a dofile to create pseudo data

Having extracted the metadata and with the ids sampled, we have everything we need to create a do-file that will be responsible for
generating the pseudo data. The do-file is created with command dummyfi:

dummyfi idcode, meta(meta)   ///
masterid(nlswork_ID) ///
time(year) ///
do(gen_dummy) ///
name(nlswork_dummy) ///
replace

The identifying variable(s) — idcode — is(are) the first argument of the command. The meta data file and sample ids file are provided in
options meta and masterid, respectively. In option time we set the time variable —year, and the option do sets the name of the dofile that will be created. In option name we specify the name of the file under which we will save the pseudo data when we run gen_dummy.do.

The following image displays the content of gen_dummy.do:

If you look closely to the contents of the dofile and to the images of the meta data file, you will find that the statistics, frequencies, share of zeros, share of negatives, share of missing values, share of time invariant, minimum date and maximum date dictate how the data is going to be created. For example, categories of variable race will have similar frequencies to those reported in the meta data file (see lines 32, 33 and 34). According to the meta data file, variable birth_yr is time invariant, so it will also be generated as such in the pseudo data (line 19). Finally, the minimum date for variable union is 70, and this is also going to be the case in the pseudo data (line 87).

Moreover, the last command of the do-file uses metaxl to apply the meta data that was saved in meta.xlsx to the pseudo data set to ensure that it shares the same meta data as the original file.

The fact that the output of command dummyfi is a dofile that generates the pseudo data gives a lot of flexibility to the user. Here the researcher will also be able to modify the dofile in order to generate additional variables according to their needs. For example, if they want to merge it with datasets, generate derived variables, or want to introduce correlations between variables, and so forth.

Conclusions

The methods described in this guide can be used and adapted by a range of institutions, organizations, and ministries that are worried about data privacy or generally unwilling to share raw data. They can also be used by authors in the construction of replication packages, when the original data cannot be shared, allowing the code to run in a dataset with a similar structure.

By adopting a streamlined workflow described in this guide, both data-providing intuitions and researchers can benefit from research and results generated without comprising stringent data sharing protocols.

About the Authors

Asjad Naqvi is an economist based in Vienna, Austria. He is the editor of the Stata Guide and the Stata Gallery on Medium, and has written several Stata packages and guides.

Paulo Guimarães is head of the Banco de Portugal Microdata Research Laboratory (BPLIM). He is an economist and a long-time Stata user.

Gustavo Iglésias is a Data Analyst at Banco de Portugal Microdata Research Laboratory (BPLIM) and the creator of several Stata packages.

--

--

Asjad Naqvi
The Stata Gallery

Here you will find stuff on Stata, data visualizations, data wrangling, workflows, and programming.