Program


Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December

Back to program

Poster session 1

* Marked poster recordings will not be available beyond Nov. 20
Friday 20th November, 10:00 - 11:30

Pre-testing data collection apps. Lessons learned from three app development projects.

Ms Deirdre Giesen (Statistics Netherlands) - Presenting Author
Mr Stefan Theunissen (Statistics Netherlands)

Pre-testing a new or revised questionnaire is generally considered good survey practice. Different pre-test methods can be used, depending on the stage of the questionnaire design process, the goals of the tests and resources available. Typically, pre-tests of questionnaires focus in some way or another on assessing the validity and user friendliness of the measurement instrument.
Compared to designing questionnaires, designing data collection apps is a rather new field. We hold that when designing data collection apps it is at least as important as in in the questionnaire design process to collect feedback from potential respondents. This feedback can provide valuable information on if and how the app should be improved in order to help respondents understand and perform the response task, improve the validity of the collected data and the user-friendliness of the app.
At Statistics Netherlands so far three different data collections apps have been developed: a transportation app, a household budget survey app and a time use app. These apps offer a hybrid form of data collection as they combine questionnaire-type questions with data collected via sensors (e.g. location measurement, taking pictures) and via linking of external data sources (e.g. scanner data, banking account data). As part of the development process we have conducted some small-scale pre-tests that were a mix of cognitive and usability testing. We tested both with working apps and with clickable prototypes.
When designing these tests new issues came up that may be typical for testing data collection apps. For example: should we restrict recruitment to test respondents who have experience with installing apps, should we test on the respondents’ devices, for which variety of devices should we test, should we test outside the lab to test sensor measurements in real-life conditions, are several contacts over time necessary to tap both the experience of the first use of the app and the experiences of extended use, can we test not yet implemented data linkage with mock up designs etc?
In this paper we will summarize the designs of each test and discuss some of the main issues in the design, the choices we made and our experiences. With this paper we hope to contribute to sharing and building knowledge on best practices in testing data collection apps.


No hope? More dope: A Study of the relationship between depression and illicit drug use in the United States using NHANES data

Miss Catherine Pollack (Dartmouth College) - Presenting Author
Miss Briana Krewson (Philadelphia College of Osteopathic Medicine)

Context: Understanding of the association between depression and illicit drug use may lead to enhanced proactive behavioral assistance in order to reduce cases of repeated- and new-use of illicit drugs.

Objective: To determine the association between depression and illicit drug use in the general US population (with subgroup analyses in pregnant women and obese individuals).

Design, Setting, & Participants: 18,178 individuals representing 80,340,447 non-institutionalized US adults over 18 years of age were assessed using cross-sectional data from the National Health and Nutrition Examination Survey (NHANES) over 12 years (2005-2016) who were questioned about illicit drug use and took a depression screening test.

Measures: The primary exposure was depression as measured by the PHQ-9 scale, and the main outcome was whether the participant had tried one of four illicit drugs (marijuana, cocaine, heroin, or methamphetamine) at least once.

Analysis: Weighted logistic regression both aggregated and stratified by year for each drug, logistic regression stratified by BMI and pregnancy, and linear regression for frequency of each drug use in last 30 days.

Results: Aggregated by year, each unit increase in depression was associated with higher odds of marijuana, cocaine, and methamphetamine use, but lower odds of cocaine use. While the relationship between depression and each drug between years was relatively consistent, each drug had certain years in which individuals were significantly more likely to use drugs for the same depression score. BMI was an overall effect modifier on the relationship between depression and every drug when aggregated by year, but yearly analysis suggested this modification was only significant in certain years.

Limitations: Recall and reporting biases could impact the accuracy of drug use data. Depression was assessed through a questionnaire rather than a clinical diagnosis. The NHANES data may under represent specific minority subpopulations. Not every illicit drug user is a drug abuser, so our results cannot be generalized to drug addicts. NHANES lacks use of direct individual temporal data, and our study lacks geospatial covariates.

Conclusion: Increasing depression scores lead to increased odds of marijuana, heroin, and methamphetamine use among the NHANES representative population between 2005-2015. This relationship varies by illicit drug type, annual changes in the data, and BMI.


Measuring exposure to public education campaigns with survey data and passively collected smartphone location data

Dr Kristine Wiant (--None--) - Presenting Author

Custom smartphone applications offer survey researchers opportunities to supplement infrequently collected self-reported survey data with continuously and passively collected measures of the survey participants’ behavior. In this presentation, we evaluate the feasibility of using a custom smartphone application designed to record the frequency and duration of visits to convenience stores as a proxy for exposure to public-education advertisements that are present in these stores.

Data for this presentation come from the first three waves of a longitudinal study that evaluates the effectiveness of a public education media campaign designed to motivate current cigarette smokers aged 25-54 to quit smoking. Print version of the public-education advertisements are present primarily in and around convenience stores where tobacco products are sold. Baseline survey data were collected as part of in-person interviews in 15 U.S. counties in which the campaign is active and in 15 U.S. control counties in which the campaign is not present. As part of the baseline survey experience, participants were invited to download a custom smartphone app that tracks visits to convenience stores. At each follow-up wave, baseline participants received invitations to complete a 40-minute follow-up survey online with in-person follow-up to non-responders. At the end of the follow-up survey, participants who did not originally consent to install the smartphone app were again offered this opportunity at each follow-up wave.

In this presentation, first we will review baseline cooperation rates to the smartphone app and app-refusal-conversion rates at each follow-up wave by mode (web versus in-person). We then compare post-consent drop-off in app-based participation, by wave and by mode; and we review feedback received from survey participants about the app. Finally, we compare participants’ self-reported time spent in convenience stores and self-reported exposure to the public education campaign with frequency and duration of convenience store visits, as passively recorded by the app.


Exploring the quantification and measurement of public procurement performance expectations gap in community roadworks in Uganda: Evidence from pilot survey

Mr charles Kalinzi (Makerere University) - Presenting Author
Professor Joseph Ntayi (Makerere University Business School)
Professor Moses Muhwezi (Makerere University Business School)
Dr Levi Kabagambe (Makerere University Business School)
Professor John Munene (Makerere University Business School)

Research addressing performance expectations gap in public procurement is sparse. The studies addressing expectation gaps are predominantly in auditing (see (Adams & Evans, 2004; Brennan, 2006; Humphrey, Moizer, & Turley, 1993)). Other studies in marketing field have focused mainly on customer value (Ancarani, 2009) and service quality (Bolton & Drew, 1991; Cronin, Taylor, & Taylor, 1992; Parasuraman, Zeithaml, & Berry, 1985; Zeithaml, Berry, & Parasuraman, 1996), mainly using a marketing lens. In all these studies, stakeholders have not been given a frontline in championing their expectations in performance of contracts but rather those of technocrats. It is evident in Uganda’s case that the citizens never directly get involved in community roadworks and have complained bitterly overtime over the extent to which these roadworks contracts measure up to the citizens’ expectations. Contentious issues have been raised regarding the level of procurement performance (performance efficiency), how the procurement officers perform their activities (the effectiveness of procurement engagements) as well as their ability and influence for procurement meaningful decisions that widened greater dissatisfaction of stakeholders (performance reasonableness). There is an inherent contradiction between what is expected out of the procurement practitioners and what actually comes in terms of works contract performance. We see certain aspects of expectations gap in marketing and auditing discipline that resemble what is occurring in public procurement today. This study, through a survey, intends to borrow this concept from auditing and marketing fields and replicate it in procurement management to investigate the extent to which community engagement, organizational pressures and path dependence contributes to procurement performance expectations gap and how this can be greatly minimized for improved performance. The central research question being asked is: How can stakeholder perceptions be quantified and measured to minimise its influence on performance expectations in roadworks contracts in Uganda?
Aim: This study is investigating the existence and nature of public procurement performance expectations gap using a new methodological approach of quantifying and measuring performance expectations of stakeholders in comparison with those of the technocrats using a combination of theories of Stakeholder Theory, Institutional theory and path dependence theory.
Design/methodology/approach: Quantification of stakeholder expectations is a new approach in social sciences because its highly subjective. This will be an attempt to quantify subjective views between the roadusers and the technical personnel in the roadworks sector
Data sources: This paper will use the results of pilot survey which are near complete for half of the country from the 2 biggest regions (out of the four regions)
Practical implications: Identifying these critical variables in future roadwork would enhance the success of holistic and will provide the first ever quantification and assessment measure in public procurement field.
Originality/value: This paper give new insights to managing procurement performance expectations to the satisfaction of stakeholders.


Consumers' shopping behaviour: exploring possibilities for more accurate estimates of consumer flows and expenditures

Mr Thijs Lenderink (I&O Research)
Dr Robbert Zandvliet (I&O Research) - Presenting Author

‘Koopstromenonderzoek’ provides insight into flows of consumers between places and purchase locations within those places, as well as into the levels of spending in these locations and how people value them.

This type of research is used by policy makers to monitor the (economic) functioning of services and facilities offered, by retailers associations to look for improvements for their purchase locations, and by large retailers and other commercial parties to investigate the need for either increasing or decreasing supplies. Often, in ‘koopstromenonderzoek’, the supply of shops in different branches is included, as well as vacancy rates.

In executing ‘koopstromenonderzoek’, surveys are used to ask respondents about the purchase location(s) for daily and non-daily shopping that they visited most recently. In this way, flows between places emerge. These flows are then matched with economic indicators (expenditures per household per year) for certain articles to transfer them into expenditures in both physical and online shops.

Using ‘koopstromenonderzoek’, it is possible to give insight into consumer (spatial) flows and how consumers value purchase locations at a level that is quite reliable (although we need to rely on people’s memory). However, estimates of the economic functioning are to some extent less accurate, as the indicators used are not up to date and company expenditures are not included.

For I&O Research, this was a main reason for exploring the possibilities to include big data in ‘koopstromenonderzoek’. In this way, survey-based consumer flows can be validated and more accurate estimates of revenues can be achieved.

Our explorative research will consist of two elements:
- Comparing expenditures at purchase locations with Statistics Netherlands (CBS) revenue figures. These CBS figures are based on tax data of individual shops as well as larger chains. This is done in an explorative way for a number of purchase locations in the Netherlands.
- Comparing self-reported flows (from surveys) with (adapted) telecom data in order to validate flows. This is done in an explorative way for the municipality of Enschede.

Apart from enabling more reliable estimates, the purpose of our study is to investigate more efficient ways to perform ‘koopstromenonderzoek’ in the future.


Comparative analysis of case findings through strategic HIV testing approaches in Kogi state, Nigeri

Mr Moses Luke (AIDS Healthcare Foundation) - Presenting Author
Mr Patrick Adah (Kogi State Agency for the Control of AIDS)

Background: The 2019 NAIIS indicates that Kogi State has a HIV prevalence of 0.9% with a HIV burden of 43,373 No. population of people living with HIV however only 23,768 No. population of PLHIV on treatment representing 55% with a gap of 19,605 No PLHIV yet to be diagnosed and linked-to-care. The first of the global 90:90:90 targets demands that HTS coverage must increase within the geographic areas and populations where the HIV burden is highest, and in previously under-served areas and populations at risk for HIV.
Targeted and Innovative HIV testing approaches are required alongside robust M&E systems that evolve to provide the information to enhance and focus testing efforts in order to meet this challenge.

Methodology: The study was conducted through a rank bi-serial approach through the comparison of the correlation coefficient derived from the yield of case findings for 2018 and 2019 HTS at 6 No AHF HCFs. Targeted PITC, Integrated service delivery (one-stop model), Home-based testing, Health fairs and multi-disease screening campaigns/events, Index-client testing and partner notification services provided the approach for the identification of case findings. Data was collected on client intake forms, HTS register and serology worksheet for the routine diagnostic tests for HIV 1 and 2 for clients by HCFs.

Results: To ensure quality, the data was validated prior to the comparison of the rank correlation coefficient analysis for 2018 and 2019 where the trend lines Y=21.1+(0.002) X and Y=58.5+(0.001) X were derived (95% CI). The analysis indicates that the RDTs for HIV outcomes for AHF health care facility (HCF) presents r2 values of 0.53 and 0.10 for 2018 and 2019 respectively for persons between the ages of 18 months to above 50 years has had a 43% drop in the usage of Rapid Test Kits while increasing the yield of case findings in 2019 compared to 2018.

Conclusion: The inference has an importance that has shifted the relevance of HTS towards greater efficiency due to the targeted strategic drive. 1 No yield in 2018 was dependent on 53 No RDTs while a 1 No yield in 2019 was dependent on 10No RDTs from 6 No AIDS Healthcare Foundation HCFs supported by CSOs, thus these findings emphasize strategic and targeted RDTs for improving the yield by 43% current assets (RTKs) utilization reduction.


Logistics regression analysis of non-retention among adolescent and young adults receiving antiretroviral therapy in Kogi state, Nigeria.

Mr Moses Luke (Ladoke Akintola University of Technology) - Presenting Author

Background: HIV is one of the world’s most serious public health challenges causing millions of young adults’ death, devastating and impoverishing families. The menace had turned millions of children into orphans. Amongst infected individuals including adolescent and young adults, retention in HIV care becomes worrisome after ART initiation, which is extremely imperative not only to reduce individual’s HIV-related mortality and morbidity but also as a means to deliver positive prevention intervention at reducing ongoing transmissions. The objectives of the study were to investigate factors associated with non-retention of HIV infected adolescents and young adults on antiretroviral treatment and their socio-demographic characteristics in the Kogi State.
Methods: A descriptive, cross-sectional study using a multistage sampling technique was used to select 307 respondents living with HIV and receiving antiretroviral treatment in Kogi State.
Result: The result showed that over half (52.1%) of HIV patients were adolescents, majority (58.6%) were female and single (85.7%). Amongst these patients, approximately one-fourth (19.9%) didn’t have formal education. There was significant association between lack of interest developed by these patients on ARVs and their retention in care. (β=-3.507,Odd Ratio [OR]=0.030,p<0.05). There was also significant association between stigmatization and patients’ retention in care (β=-3.404,Odd Ratio [OR]=0.033,p<0.05).
Conclusion: Stigmatization is a challenge to the HIV treatments, this result into patients’ lackadaisical attitude and misgivings about the way HIV patients are serious about treatments.


Enhancing the contribution of higher education in fourth industrial revolution

Mr Ndirangu Ngunjiri (University of Nairobi) - Presenting Author

Global society is changing because of the shifts in technological capacity; higher education must change with it. This paper explores the contribution of higher education in fourth industrial revolution; the societal changes from the fourth industrial revolution will require higher education to develop greater capacity for ethical and intercultural understanding, placing a premium on liberal arts-type education with modifications to adapt to the particular issues raised by fourth industrial revolution technologies and their disruptions to society. Rapid adjustment of higher education institution is needed by expanding its capacity to accommodate the acquisition of new knowledge by researchers. Social and educational transformations from the first three industrial revolutions can provide a starting point in our considering the potential transformations in higher education arising from the Fourth Industrial Revolution (4IR).The literature and analysis presented show a new approach of enhancing the contribution of higher education in fourth industrial revolution and help the universities in considering some changes in its restructuring in delivering four industrial revolution agenda. Literature analysed show that higher education’s institutions has a complex, dialectical and exciting opportunity which can potentially transform society for the better. The fourth industrial revolution is powered by artificial intelligence and it will transform the workplace from tasks based characteristics to the human centred characteristics. Therefore improving the quality of service in higher education can bring about a significant change in the society. The study used the data for the 35 respondents of higher education institutions. The study collected secondary data and diagnostic test was done on study variables which included the test of normality and reliability test. The test of normality showed that data was a little skewed and kurtotic and did not differ significantly from normality. Based on the results obtained from the analysis of the study, the study recommends that more studies to be done on the topic so as to establish unknown factors that enhance higher education in fourth industrial revolution. Out that all the independent variables the study found out they have a positive correlation with the dependent variable. The study recommends adoption and implementation of higher education in fourth industrial revolution as a continuous process of creating, acquiring and transferring knowledge as one or two practices may not yield the desired results. The study also recommends that higher education should embrace fourth industrial revolution so as to enhance efficiency economic growth.


Linking PIAAC data to individual administrative data: Insights from a German pilot

Dr Jessica Daikeler (GESIS -Leibniz Institute for the Social Sciences) - Presenting Author
Mrs Britta Gauly (GESIS -Leibniz Institute for the Social Sciences)
Mr Matthias Rosenthal (University of Stuttgart)

Linking survey data to administrative data offers researchers many opportunities. In particular, it enables them to enrich survey data with additional information without increasing the burden on respondents. German PIAAC data on individual skills, for example, can be combined with administrative data on individual employment histories. However, as the linkage of survey data with administrative data records requires the consent of respondents, there may be a bias in the linked dataset if only a subsample of respondents—for example, high-educated individuals—give their consent. This presentation provides an overview of the pilot project linking the German PIAAC data with individual administrative data. In the first step, we provide a literature overview on which factors are crucial for the data linkage decision and in a second step we illustrate characteristics of the linkable datasets and describe the linkage process and its methodological challenges. Finally, we provide an illustrative example of the use of the linked data, and investigate how the skills (numeric, literacy and computer skills) assessed in PIAAC are associated with the linkage decision.


Predicting presidential election trends by web data

Mr Kaan Ketenci (Survey Research Center at the University of Michigan - Ann Arbor)
Miss Deji Suolang (Survey Research Center at the University of Michigan - Ann Arbor) - Presenting Author

This paper intends to use web-based organic data to supplement and provide preliminary estimates for more costly and time-consuming nationally representative surveys. We explored the Democratic Party primaries for the 2020 United States presidential elections and focused on the CNN debate of democratic candidates held on 15th of October 2019. The aim is to use real-time web data to predict changes in voter support as indicated by benchmark surveys. Three main sources of web data are employed: Twitter API, Oddschecker web-scraping data, and Google Trends. We expected that the counts and sentiment scores of tweets, implied probabilities indicated by Oddschecker website, and the frequency of scaled Google searches correlate positively with voter support for a candidate. Predictions from each of these sources are combined in a final linear regression model estimating changes of voting intensions in reference surveys. The results suggested by three different data sources share some common features that Bernie Sanders, Pete Buttigieg and Amy Klobuchar have increased chances to win the election after the debate. To inform the computations and features of this kind of big data, we also evaluated how well each of the three approaches in this paper performs at predicting these voting intensions before any opinion poll data is released. Finally, we concluded with suggestions for future research and practice.


A comparison of classification and regression tree methodologies when modeling survey nonresponse

Ms Tien-Huan Lin (Westat) - Presenting Author
Ms Jennifer Kali (Westat)
Mr Michael Jones (Westat)
Mr William Cecere (Westat)

When computing survey weights for use during analysis of complex sample survey data, an adjustment for nonresponse is often performed to reduce bias in estimates. Many algorithms and methodologies are available to analysts for modeling survey nonresponse. Lohr et al. (2015) discussed possible benefits of using regression trees for estimating response propensities in surveys and how these methods might be used to reduce nonresponse bias. In this paper we extend their findings and recommendations. Using expanded simulations we evaluate the effect of the methods on the reduction of nonresponse bias and further investigate the sensitivity of the methods when using survey weights. We discuss the practicality and benefits of using these methods for estimating response propensities in surveys.


Which is your favorite music genre? A validity comparison of Facebook data and survey data

Dr Zoltán Kmetty (Eötvös Lorand University, Faculty of Social Sciences) - Presenting Author
Dr Renáta Németh (Eötvös Lorand University, Faculty of Social Sciences)

Our study aimed to approach validity issues that arise both in the case of surveys and Facebook data.. We use a novel joint data source of combined Facebook and survey data. As Facebook restricted the use of their API to access a large number of Facebook new methods have to be developed by social scientists on how to access FB data for social science research. The new generation of studies collects the data through the users, not the tech company. Our study is one of the first attempts using this new track in computational social sciences. One hundred fifty respondents took part in our study. After informed consent was obtained, respondents were asked to log-in to FB on the interviewers' notebook and to download their FB profile archive. The data covers a wide range of Facebook activities: posts, comments, likes and reactions, pages, friends, profile, and ads data. Besides sharing their Facebook data, participants had to fill out an online questionnaire that covers various topics.
Our chosen topic of this paper is music interest, a key indicator of cultural sociology. To our knowledge, there are no previous studies on the rate at which any population is categorized by Facebook to advertisers as interested in music genres, or on the relationship between self-reported interest, digitally expressed interest and ad-interest categories. We primarily focused on operationalization-related questions and validity issues. We created different music preference variables. Besides survey data, we calculated measures based on page likes and ads interest categorization by Facebook.
The results overall showed that the different measures (survey or Facebook data based) have only moderate correlations with each other. Some genres measured similarly, but there were significant differences too. A good example of the latter is world-music. It was the second most preferred genre of the survey, but based on FB data, it was at the lower end of the preference scale. Self-reported music taste correlated stronger with the page-likes than the ads interest. An important validity problem arose in the case of ads interest data. We did not find any ad interest category which fits one of our special Hungarian music genre called 'mulatós'. The category selection of Facebook limits the measurability of certain interest groups. Digital data opens the possibility to examine topics we could not examine earlier or re-examine topics with new approaches. However, all the data sources have their own validity problems. In the case of big data, social scientists often raise the problem of generalizability even if we have data from millions of users. We did not deal with generalizability, rather with a less studied topic, the validity of this data. Our paper adds a new contribution to this topic. Platform affordance, algorithmical classification, and different types of platform usage all influence the expressed (observable) interest of users. We need to consider all these effects when we are relying on social media data.


Policy supporting dashboards for policy evaluation

Mr Jorre Vannieuwenhuyze (imec-SMIT-VUB) - Presenting Author

Over the last decades, there has been an ever-growing demand for empirical-based and data-driven policy making, which require to underpin all policy decisions by objective empirical data and research. One tool that is increasingly used for such decision making is the electronic dashboard. Nonetheless, the use of dashboards within policy formation processes remains relatively unexplored. For that reason, based on the theory of the policy cycle and policy evaluation research we distinguish four types of policy supporting dashboards, that is the context, input, process and product evaluation dashboard.
This typology will be illustrated by the dashboards designed within the CUTLER project. This project aims to provide methods for big data analysis and communication through city dashboards in order to measure economic activities, assess environmental impact and evaluate social consequences of policy programs. The evidence shown through these dashboards is used to support decision making processes by informing, advising, monitoring, evaluating and revising decisions made by urban planners and policy makers. Based on the results of end-user experience surveys, we argue that dashboard design should start from a clear mapping of the policy questions at stake onto their corresponding evaluation goals. These goals, in turn, define how a dashboard should like, what kind of information should be shown, how the information should be visualized, and how the user should be able to interact with the dashboard.


Pilot study - building a probabilistic panel in France, Germany and Greece

Mrs Tanja Kimova (Kantar) - Presenting Author
Mr Jamie Burnett (Kantar)
Mr Yves Fradier (Kantar)
Mrs Elke Himmelsbach (Kantar)

Recently, conducting surveys online has become an increasingly used and credible alternative to classical survey approaches, such as face-to-face or telephone surveys. Despite all the efforts put into maximising representativeness of online surveys, there are still concerns over their ability to make inferences to the population. Traditional survey modes have been facing challenges of their own – declining response rates in turn increasing the costs of implementation make it more difficult to secure funds to employ traditional probabilistic modes for survey research.
For this reason, underpinned by a culture of innovative thinking, Kantar has been exploring ways to build online probabilistic panels in Europe as a way to bridge the gap between online surveys and probabilistic sampling, offering a platform where the advantages of both can be utilised.
The purpose of this paper is to describe a small-scale pilot for building a new probabilistic panel in France, Germany and Greece using offline recruitment modes. In France we test the suitability of a push to web recruitment methodology by sending out 3,750 postal invites to a random sample of individuals inviting them to complete an online questionnaire, using a frame that is widely available, rather one that can be difficult to obtain due to restrictions on use. In Germany and Greece we assess a phone to web design using a dual frame RDD recruitment design. For Greece we look at a bespoke recruitment survey design, where we interview 2,000 respondents, inviting them to complete an online questionnaire towards the end of the survey, whilst in Germany we make use of our dual frame RDD omnibus telephone survey as a piggyback for recruitment to the panel. Similarly to Greece, we interview 2,000 respondents and at the end of the interview we will propose that they take in an online survey.
We are testing a number of different sample design criteria that are likely to impact on response rates to the initial recruitment and subsequent online panel surveys. By way of example we will look at whether different incentive amounts as well as reminder strategies have a positive impact on joining the panel, and whether this impact is consistent across countries. The fieldwork is planned for January and early February of 2020.
This paper will add to the evidence base for what works when building survey panels with a probabilistic sample base. In particular, the use of different recruitment strategies is a novel feature.


Use of big data and machine learning at statistics Canada

Ms Christie Sambell (Statistics Canada) - Presenting Author
Mr Stan Hatko (Statistics Canada)

As part of Statistics Canada’s modernization efforts, the agency has been exploring ways to expand the use of alternative data sources to supplement or even replace entire surveys. Here are three examples of projects at Statistic Canada that use machine learning to incorporate big data sources in traditional surveys.

Several years ago, Statistics Canada began receiving scanner data—point of sales data from a major Canadian retailer. The product descriptions from each of the 13 million records per week needed to be classified to match the product classifications on the Retail Commodity survey. After a first attempt using traditional matching methods was not successful due to the lengthy processing time (over a week), a data scientist used MapReduce to chunk the file up and XGBoost to train a model that could be run in just a few hours. This model has now been used in Production for over a year and that retailer is no longer being sent the questionnaires. A Quality Assurance Framework was developed to monitor the quality of the machine learning model in Production.

The Freight Trucking Statistics Program has been receiving shipping manifests instead of questionnaires from five major shipping companies for several years. Like the scanner data project, the product descriptions needed to be classified to a standard classification but proved to be much more challenging as there were 582 classes and the product descriptions were not clean or systematic. A model using XGBoost was developed for the first company and now through active learning the model is being scaled up to the other four companies. Through this project the generalized system for classification at Statistics Canada (G-Code) was updated to include its first machine learning algorithms (XGBoost and FastText).

One area at Statistics Canada that really saw the potential of alternative data was the Agriculture Statistics Program. One of their projects was an experiment to see if greenhouses could be identified from satellite images in order to remove questions from the Census of Agriculture and to improve the frame for the annual Greenhouse survey. This had promising results using a convolution neural network. It was also determined that the same model could be retrained to detect solar panels for another survey. A second experiment is planned with higher resolution aerial images to see if further information could be determined, such as cover type and vegetation type, in order to further reduce the questions on the Greenhouse survey.


Application of data mining methods using demographic survey data: Analyses of attitudes towards gender roles and domestic violence in Turkey

Dr Ayse Abbasoglu Ozgoren (Hacettepe University Institute of Population Studies) - Presenting Author
Dr Anil Boz Semerci (Hacettepe University)
Dr Duygu Icen (Hacettepe University)

This study aims to focus on opportunities provided by data mining methods in social sciences with an application of data mining method, namely decision trees, to a conventional data source of demography, which is demographic and health survey. We plan to analyze attitudes of women towards gender roles and domestic violence in Turkey by employing both decision trees and classical logistic regression methodologies using the most recent Turkey Demographic and Health Survey (TDHS), 2018 TDHS, and compare the findings. The reason to choose the context of Turkey is twofold. First, Turkey is a unique context in terms of recent changes in norms within its demographic realm. In general, basic cultural norms change slowly and there has been a transformation from “traditional pro-fertility norms” to “individual-choice norms” in advanced industrial societies. According to TDHS data, the recent development in Turkey is contrary to this development, where fertility ideals are moving towards a pro-fertility norm, but gender norms are becoming more egalitarian and intolerance against domestic violence among women is on the rise. Hence, beforehand, Turkey presents an interesting case within its cultural context. Second, although there is a tendency to use the attitude variables as explanatory ones in population studies in Turkey, little has been done to analyze these variables as dependent variables. Contrary to this evidence, understanding and studying transitions in gender roles is important because traditional commitment to gender roles leads to (re) production of gender inequalities in the life course and egalitarian commitments may lead to deproduction of these inequalities. This paper aims to contribute to such perspective by having ideals and attitudes as dependent or outcome variables in population studies related to Turkey. This study relies on the necessity to study these recent developments in norms in Turkey employing techniques of data mining methods and data visualizations, namely decision trees. In other words, we aim to find out the profiles and attributes of women whose (i) attitudes towards gender roles are more egalitarian, and (ii) attitudes against domestic violence are more intolerant, using 2018 TDHS data. 2018 TDHS is the most recent TDHS with completed interviews with 7,346 women age 15-49 from 11,056 households. This study employs decision trees, which are the most popular classification algorithms being used in data mining and machine learning problems. Decision trees allow developing classification systems that predict or classify future observations based on a set of decision rules. Several algorithms are used in order to generate trees such as, C5.0, Classification and regression tree (C&RT), Quick, Unbiased, Efficient Statistical Tree (QUEST) and Chi-squared Automatic Interaction Detection (CHAID). Our covariates are age group, region, urban rural place of residence, wealth level, children ever born, marital status, mother literacy, father literacy, mother tongue, employment status and educational level. We compare our findings with classical logistic regression analyses results as well. Finally, based on our profiling analyses, we conclude and speculate on possible scenarios for the post-transitional stage of the demographic transition in Turkey.


How to combine survey data with administrative data successfully? Lessons from the German study on “Life courses and pension provisions” (LeA)

Mr Ulrich Brandt (Deutsche Rentenversicherung Bund) - Presenting Author
Dr Dina Frommert (Deutsche Rentenversicherung Bund)
Dr Thorsten Heien (Kantar)
Mr Marvin Krämer (Kantar)

For the German study “Life courses and pension provisions (LeA)”, extensive survey data on life courses and pension provisions are linked with administrative data from the respondents’ individual pension accounts. Because of the strict data protection rules in Germany, every respondent’s explicit consent is required to extract and link the administrative data. In practice, this means the respondents need to sign a consent form during the interview, which is asking a lot from respondents who have already consented to a lengthy survey. Next to this new form of nonresponse error, other sources of error may be introduced to the process of data collection, like bias inherent to the alternative data source or bias introduced by the technical process of the data linkage.
To reduce the burden of signing and sending off the record linkage form, LeA provided the respondents with an electronic consent form in addition to the “traditional” paper form. Compared to other record linkage projects in Germany, LeA shows especially high consent rates, which we attribute – next to other determinants – to the electronic consent form. The proposed paper will report the consent rates in general and for specific sub-groups of the population, followed by an analysis of the extent to which the electronic form was used. We will also estimate the amount of selectivity introduced to the sample by specifying a logit model. Because of the high consent rates, we do not expect noteworthy additional selectivity. Nevertheless, the results demonstrate that the level and the mode (electronic vs. paper) of consent vary between different groups, and reducing the effort needed to consent can play an important role in reducing selectivity.


Equipping social science researchers with the skills to use and analyse big data: The UK Data Service training programme

Dr Vanessa Higgins (UK Data Service, University of Manchester) - Presenting Author

The UK Data Service holds a large collection of social science data that are made available for re-use by researchers, including large cross-sectional government survey data and longitudinal/cohort survey data.

One function of the UK Data Service is an annual training programme to help data users to find, access and use data in research and teaching. More recently the training programme has been extended to cover big data/new forms of data. It is vitally important that the research community are equipped to deal with the new forms of data that are increasingly becoming available for research.

This paper will describe the UK Data Service training programme. It will include the aims of the programme, a description of the traditional survey data training that is delivered and the more recent training being delivered on the topic of big data/new forms of data. We will discuss how the new training programme has been developed, how it has been received by users and the insights that we have discovered from user discussions.


What if the labor force survey does not exist? A case study on the non-employed population exploring concepts, measurements and combinations of several (big) data sources.

Dr Vivian Meertens (Survey Methodologist) - Presenting Author
Dr Sabine Krieg (Survey Methodologist)
Dr Lianne Ippel (Maastricht University)

The design of official statistics is largely determined by history and international agreements. This is related to the desire to produce comparable figures over time and for different countries. This makes it more difficult to be innovative. For example, the Labor Force Survey has been conducted as a survey for many years and the questions are more or less fixed (with some redesigns in recent years). Even though the need for information may have changed, the information can nowadays be obtained probably much better from other sources (registers, big data).
That is why we are conducting a thought experiment: what if the Labor Force Survey did not exist, and we now had to set up a new set of labor market statistics? What would this statistic look like? What kind of labor market indicators would be worthwhile to attain? Would we still use a survey, or would other sources be a better alternative? What result is possible if a survey is combined with other sources?
In this project we focus on a specific concept of labor market indicators: the non-employed population in the Netherlands. The aim was to specify them and profile or divide them into several groups in order to find out what is going on with these people, for example whether they are looking for work, or why they are not.
The project starts with the mapping of the information needs, reflection on the underlying concepts, and mapping of registers and big data sources that can be useful in this respect. We look at resources that are already available, as well as resources that may become available in the future. We also look at how the information can be collected through survey questions. Finally, we explore how we can combine these information collected without (significant) measurement errors and reflect on applying the total survey error paradigm to new forms of data producing labor market statistics. This framework will be needed to encompass the new quality issues associated with combining and exploring several resources as registers and big data.

We describe and elaborate on the different stages of this ‘thought’ experiment in a theoretical way elaborating on concepts and several data sources, the opportunities of using and combining them. The quality of the various data sources are not actually analyzed yet but start from assumptions or application in other big data research. In the future, various variants will be worked out of how the statistics could be designed. A possible way to go could be one variant that is based on new sources, the other is based on a survey and another on the combinations of sources. Eventually, the variants will be compared. The extent to which the need for information is covered, what the costs are, but also how stable the source is, i.e. to what extent the figures will be comparable over time, will be taken into account.
This project is a collaboration with Marc Ponsen, Piet Daas, Martijn Souren and Mark Ramaekers, all Statistics


Cash use and financial literacy: insights from combining payments surveys with big data

Dr Kim Huynh (Bank of Canada) - Presenting Author
Mr Gradon Nicholls (Bank of Canada)
Miss Julia Zhu (Bank of Canada)

The Bank of Canada, as the sole issuer of banknotes in Canada, is interested in understanding the determinants of cash demand. One factor that has been found to be correlated with payment behaviour is financial literacy. For example, those with lower literacy have been found to use cash and debit cards more often at the point of sale rather than credit cards. They also hold more cash in their wallet, purse, or pocket on average, and are more likely to carry high-denomination banknotes.

The goal of the present research is to examine more closely the avenues in which financial literacy may impact cash demand. We take note of the various demand- and supply-side factors that are at play and show how financial literacy is associated with each. Previous literature has shown, for example, that low financial literacy is associated with high-cost credit card behaviour such as revolving on one’s credit balance and incurring fees for exceeding one’s credit limit. One possible story, then, is that those with low-literacy use cash because are credit-constrained. On the other hand, those with low financial literacy may prefer cash as a more intuitive way to manage finances, or may find payment innovation more difficult to understand.

Lusardi & Mitchell (2014) define financial literacy as the “ability to process economic information and make informed decisions about financial planning, wealth accumulation, debt, and pensions.” It has been measured using the “Big Three” skill-testing questions developed by Lusardi & Mitchell (2011) on many surveys around the world. These three questions, which test one’s knowledge of compounding interest, inflation, and risk diversification, were added to the Bank of Canada’s flagship survey on payments, the Methods-of-Payment (MOP) Survey (Henry et al., 2018) which we use for our analysis.

We combine our survey with two other sources of supply-side data using probabilistic matching. First is a survey conducted by Mintel Comperemedia tracking the number of credit card advertisements received by consumers by direct mail or email; second is data from TransUnion, which tracks the credit history of millions of Canadians. We use a variety of techniques, including a k-nearest neighbours algorithm and geographic proximity, to impute values for unmatched respondents. Credit card advertising and credit scores are variables that affect credit card use, but should only affect cash use indirectly. We therefore experiment with using these matched variables as exclusion restrictions in a model of cash and credit usage.