Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December

Back to program

Is there a kink in the link? Exploring corrections and errors in multiply linked data sets

Friday 20th November, 10:00 - 11:30

How unequal is Croatia? Results from combined survey and administrative tax data

Dr Marko Ledic (Faculty of Economics Zagreb) - Presenting Author
Dr Ivica Rubli (Institute of Economics Zagreb)
Dr Ivica Urban (Institute for Public Finance Zagreb)
Mr Slavko Bezeredi (Institute for Public Finance Zagreb)

For the last two decades we are witnessing significant criticism towards the income inequality literature which builds its findings on survey data. The rationale for this criticism lies from i) the unrepresentative distribution of income in survey data and ii) the frequent practice of respondents to under-report their incomes. As an obvious consequence, it is very likely that one will obtain biased estimates of income inequality when relying on survey data. To avoid this problem, an increasing number of researchers focus on another type of data source – tax data – representing administrative data on taxpayers' income, which are collected by the state authorities to establish a tax liability. Using tax data, we have the possibility of analyzing individuals at the very top of income distribution in a more precise and comprehensive way.

In this paper we focus on obtaining a more credible income distribution data in Croatia which we use for analyzing income inequalities. We combine the household microdata from the EU-SILC (European Union Statistics on Income and Living Conditions) and the administrative tax data which contain the information on taxpayers who represent a part of the total population. Such corrected survey data will serve us as the input data for the Croatian microsimulation model (miCROmod) which will enable us to simulate the disposable income, taxes and social benefits in an impartial manner. Since the corrected survey data will retain the original structure of the survey data, i.e., preserve the representativeness by all other characteristics of individuals/households, we are allowed to use the latter characteristics in addition to income in the analysis of taxes and social benefits using miCROmod.

Although miCROmod is a very important tool for simulation of direct taxes and social benefits, the distributional effects of the entire tax-benefit system cannot be evaluated because the model does not include the component of indirect taxes. In order to improve the analysis of the Croatian tax-benefit system we introduce the indirect taxes in miCROmod in two steps. In the first step, we will estimate households’ consumption based on microdata on household consumption (APK). Using machine learning methods, we estimate the Engel curves for different categories of consumption which are available in the APK data and these Engel curves will be used to impute consumption into the EU-SILC. The second step consists of coding the rules for indirect taxation in miCROmod, which can then be run on data containing information on income and consumption, allowing us to simulate different types of indirect taxes that households pay. Having at hand such unified data set and microsimulation model, we will carry out microsimulation of joint direct and indirect tax reform in Croatia. This research has a more general contribution to the international literature, since the overwhelming majority of research has been carried out at the macroeconomic level with only a few studies that have been carried out using a comprehensive system of direct and indirect taxes integrated into the microsimulation model.

Does administrative health register data provide reliable information for health indicators in Finland?

Dr Hanna Tolonen (Finnish Institute for Health and Welfare) - Presenting Author
Dr Jaakko Reinikainen (Finnish Institute for Health and Welfare)
Professor Tiina Laatikainen (Finnish Institute for Health and Welfare)
Dr Päivikki Koponen (Finnish Institute for Health and Welfare)

Information about the health status of the population can be obtained through surveys and through registers such as hospital discharge or primary care registers. Surveys are expensive to conduct while data from hospitalizations can be generated automatically from e-health records without any additional data collection costs. But do these two different data sources cover the same target population and how good is their comparability for different health indicators?

In Finland, population-based health examination surveys have been conducted regularly since early 1970s, the latest FinHealth survey in 2017. It covered the adult population aged 18 and over living in mainland Finland. The random sample of 10000 persons was selected from the National population register. 58% of invitees participated to the survey and their survey data was linked to the administrative health records by personal identification code based on written informed consent. This combined data allowed us to compare the prevalences of several health indicators of the same individuals, derived separately from survey data and from register data.

The comparability of two different data sources varied substantially between the studied health indicators. Good comparability was observed for diabetes (survey 9% vs. registers 9% with Kappa (Cohen's kappa coefficient) =0.83), coronary heart disease (survey 10% vs. registers 9% with Kappa=0.75), and asthma (survey 9% vs. registers 10% with Kappa=0.72), while poor comparability was observed for obesity (survey 26% vs. registers 3% with Kappa=0.11), hypertension (survey 46% vs. registers 17% with Kappa=0.33), and depression (survey 7% vs. registers 4% with Kappa=0.45).

Administrative health registers can provide most reliable data for chronic health conditions such as diabetes, coronary heart disease and asthma which require regular medication and monitoring by health care professionals. Regarding obesity height and weight are currently not regularly measured and recorded in health services and information can be obtained only using ICD-codes related to obesity related diagnoses. For hypertension, survey results may be slightly overestimated due to one measurement occasion including three subsequent measurements which is not sufficient for clinical diagnosis of hypertension. On the other hand, it is difficult to identify hypertensive persons from registers due to missing recordings of measurements and various indications for medications used used also to treat hypertension.

It should also be remembered that administrative health registers cover only those seeking medical care and not the entire general population like surveys do. Obviously, also survey results may be biased due to selective non-response, and their validity depends on how well this can be adjusted in the estimation.

Linking surveys and digital trace data: Experiences from two pilot studies on factors influencing informed consent

Dr Henning Silber (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author
Dr Johannes Breuer (GESIS - Leibniz Institute for the Social Sciences)
Mr Christoph Beuthner (GESIS - Leibniz Institute for the Social Sciences)
Dr Pascal Siegers (GESIS - Leibniz Institute for the Social Sciences)
Dr Bernd Weiß (GESIS - Leibniz Institute for the Social Sciences)
Dr Sebastian Stier (GESIS - Leibniz Institute for the Social Sciences)
Professor FLorian Keusch (University of Mannheim)
Dr Tobias Gummer (GESIS - Leibniz Institute for the Social Sciences)

In recent years, survey researchers and substantive researchers have begun to collect and combine data from multiple sources. The combination of survey data with digital trace data (e.g., from social media, smartphone apps, or web browsing activities) appears especially promising. Since every data domain has its specific strengths and weaknesses, the combination of multiple data sources entails the promise that one data source can help to minimize the measurement error of another data source and vice versa. However, research on both the collection and the combination of surveys and digital trace data is still at an early stage, so that studies that inform researchers on how these combined data can be collected is highly relevant. One key issue is getting informed consent for linking the data from the participants.
We will report results from two studies conducted in Germany (2018 and 2019) that both used web survey methodology to collect multiple additional data sources. In Study 1, a non-probability online survey (N ~3,000 respondents) was used to collect Twitter and Health App Data. The study included an experimental design in which a random half of the respondents with a Twitter account (about one-third of the sample) was asked for their Twitter handle, while the other half was asked to download and upload their Twitter data via an online data-sharing platform (a new methodological approach that has been labeled data donation). Data donation was also used to ask the remaining respondents (about two-third of the sample) to share their smartphone (iPhone or Samsung) health app data. In Study 2, likewise, a non-probability sample (N ~ 2,000 respondents) was used to collect Twitter, Facebook, Spotify, and web tracking data (including mobile browsing and app usage). Both studies included incentive experiments and additional survey questions on attitudes and values regarding privacy and data sharing.
The first results of Study 1 show that respondents were more likely to provide their Twitter handle than donating their Twitter data. Regarding health data, respondents were more willing to donate their data when using an iPhone than a Samsung mobile device. Both findings point in the direction that respondents’ effort is one of the key variables which affect respondents’ data sharing behavior. The first results from Study 2 show that respondents were more willing to share their Twitter and Spotify than their Facebook data. Female respondents were less likely to share their social media data and privacy concerns influenced the willingness to share Facebook data. The number of useable social media data sets was also likely affected by the effort required to share the data (which was higher for Facebook).
Our presentation will discuss best practice advice for informed consent as well as lessons learned from those studies on nonresponse and factors driving dropout at various stages of the data collection process. We will also illustrate the high research potential of combining surveys and digital trace data, by presenting teaser results from the area of political communication and health research.

Simulation as a tool for examining data fusion

Dr Benjamin Williams (University of Denver) - Presenting Author
Dr Lynne Stokes (Southern Methodist University)

Combining multiple sources of data has become increasingly important in the survey science realm. This is particularly important when attempting to harness big data and/or non-probability surveys. Combining survey data with a big data source can strengthen the inferences made, but implementing data blending can be a difficult task.

One standard method to merge datasets, used usually when the data is made up of people, is record linkage. Record linkage combines two or more datasets which lack a unique identifier. Record linkage employs a probabilistic linking algorithm relying on information between the datasets which should agree if two records refer to the same entity. There is robust research on record linkage techniques and methodology, but little has been done to understand the errors which propagate through the algorithm due to its underlying randomness.

Often, the datasets refer to people, but when they do not, it can be difficult to clerically review potential links to determine false-positive and false-negative matching rates. This makes determining the appropriate threshold values, used to determine which potential links should be matches and which should be non-matches, quite difficult.

In this paper, we simulate a record-linkage scenario to mimic a data blending operation studied on behalf of the US National Oceanic and Atmospheric Administration (NOAA). NOAA is investigating a capture-recapture program to estimate the fish harvested by recreational anglers. The capture sample is a non-probability sample gathered from self-reporting on electronic devices and the recapture sample is an in-person survey of anglers competing a fishing trip.

In our paper, we first simulate the fish population and the two samples, and then employ a record linkage algorithm we created to blend the data sources. Because this is a simulation, we are able to examine the effects that data quality and record linkage threshold values have on the bias and variance of estimators of total. The simulation is large-scale and took nearly 4 days on a university super computer. We believe this simulation has value for others attempting to perform matching with no clear method for clerical review. It gives a way to forecast the effects of a wide variety of non-sampling errors that may arise in any data blending setting. We believe our methods are an important application of statistical simulation and offers a way to harness computational power in the survey science community.

Combining multiple data sources with synthetic populations. Applications to predictions and alleviation of privacy concerns.

Mr James Rineer (RTI International) - Presenting Author
Mr Georgiy Bobashev (RTI International)
Mr Sam Adams (RTI International)

Many modern research questions require knowledge acquired from multiple datasets. This need arises could happen when obtaining a single dataset is either difficult or impossible, or when the data already exists in multiple data sources. Examples include: predicting the effect of new cancer screening in on the US population, identifying obesity hotspots (i.e. geographic areas with unusually high prevalence of high BMI individuals), predicting the effect of interventions aimed to reduce opioid-related deaths, etc. (cite colorectal, breast and cervical cancer models, obesity, stunting, opioid). We present an approach that allows one to create an individual-level analysis dataset that allows one to probabilistically link multiple datasets of different nature, such as administrative data surveys, clinical data, medical records, etc. The key component of our approach is a synthetic population developed at RTI. Synthetic populations are statistically and spatially accurate representations of persons, their families, and their related social structure. Researchers can map variables from different datasets onto a synthetic population, resulting in a dataset that contains information from a variety of sources. This individual-level dataset is sufficient to produce reliable statistical inference with quantifiable uncertainty while still adhering to data privacy restrictions. However, the choice of method to map the variables can considerably impact the accuracy of the predictions. We describe three methods for linking datasets with synthetic data: resampling, modeling predictors independently, and modeling predictors sequentially. Resulting datasets could be then used for making estimations (e.g. geographic hotspots) or prediction of future outcomes with microsimulations or agent-based models. We provide examples of such linkage from cancer, obesity, and substance use research.