Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December
Back to program
The integration of machine learning into official statistics
|Friday 27th November, 11:45 - 13:15|
The integration of machine learning into official statistics
Dr Wesley Yung (Statistics Canada) - Presenting Author
Professor Hugh Chipman (Acadia University)
Dr Siu-Ming Tam (Australian Bureau of Statistics)
The interest in the use of Machine Learning (ML) for Official Statistics is growing rapidly and for many reasons such as processing secondary data sources such as Big Data, remaining relevant in the current landscape where many organizations are publishing statistics and the ability to improve timeliness of Official Statistics products. In response to this interest, the High-Level Group for the Modernisation of Official Statistics of the United Nations Economic Commission for Europe has established the Machine Learning Project which is investigating how best to integrate Machine Learning (ML) into Official Statistics. The objective of the project is to produce better official statistics through the responsible integration of demonstrated ML solutions. To demonstrate the added value of ML, the project is conducting several pilot studies in coding, edit and imputation, and the use of imagery. To support the sound use of ML, the project is proposing a framework to assure and evaluate quality of ML solutions in the context of Official Statistics. To effectively and efficiently integrate ML in the process flow, the project is gathering and summarizing good practices across statistical organisations. The knowledge, results, ML code and good practices gained by the project will be shared with the statistical community.
This presentation will give an overview of the project, the progress attained to date and will focus particular attention on the quality framework.
Case studies on machine learning for editing and imputation
Miss Anneleen Goyens (Flemish institute for technological research (VITO)) - Presenting Author
Dr Bart Buelens (Flemish institute for technological research (VITO))
Miss Fabiana Rocci (Italian National Institute of Statistics)
Miss Roberta Varriale (Italian National Institute of Statistics)
Mr Florian Dumpert (Federal Statistical Office of Germany (Destatis))
We report on case studies that have been conducted in the context of the UNECE Machine Learning Project, more specifically on the use of machine learning methods for editing and imputation. In this project, editing is meant only the part concerned with the identification of suspicious values and imputation is the filling of missing values or correction of errors. Case studies have been conducted to assess the usability, applicability and quality of machine learning methods for these purposes. The case studies were conducted by six different organizations in six different countries and they covered both the editing and imputation themes. In editing theme the following topics have been covered: living cost and food (UK), statistical register (ISTAT). On the imputation theme: tourism expenditures (Poland), attained level of education (Italy), energy statistics (Belgium), household and person data (Australia), and expenses for research and development (Germany). We present challenges and results, and discuss machine learning methods that were considered, including neural networks, random forests, penalty regression methods, support vector machines and Bayesian networks. Two case studies are explored in some greater detail: energy statistics and attained level of education. We outline the setup and design of these studies, the data that are required for training machine learning models, approaches to model selection and validation, and assessment of the results. In addition to measures of predictive accuracy, machine learning methods are compared with current best practices in the field. We present results of all case studies, and draw overall conclusions leading to lessons-learned. We conclude with recommendations and outline how the results from the case studies will contribute to further initiatives in the wider Machine Learning Project, including deployment of methods and adoption of machine learning in statistical production processes.
Algorithmic choices for sentiment coding of Flemish tweets
Dr Michael Reusens (Statistics Flanders) - Presenting Author
Dr Marc Callens (Statistics Flanders)
Dr Ann Carton (Statistics Flanders)
Dr Dries Verlet (Statistics Flanders)
The availability of public social media data and rapid advances in natural language processing algorithms to automatically interpret text have allowed for the analysis of a continuous stream of signals being sent out by people. Analysing these signals, such as Facebook posts, Twitter tweets, and Instagram photo’s, using natural language processing techniques could be a low-cost and high-frequency complement or alternative to survey analysis for measuring perceived quality of life.
In this paper we present the results of a pilot project that combines natural languages processing (NLP) and machine learning (ML) techniques to extract general sentiment from Flemish twitter data. This pilot study is part of UNECEs Modernisation project on Machine Learning (HLG MOS ML project 2019-2020) in the field of official statistics (coding, edit and imputation, imagery).
In this paper we will report on objectives, techniques, results, conclusions and lessons learned of the Flemish tweets coding project. More specifically we will look at the sensitivity of the results for different NLP and ML choices.
ABS use of machine learning to classifying addresses use on the Address Register
Mr Daniel Merkas (Australian Bureau of Statistics) - Presenting Author
Mr James Farnell (Australian Bureau of Statistics)
Miss Debbie Goodwin (Australian Bureau of Statistics)
The Australian Bureau of Statistics maintains an Address Register containing over 10 million residential addresses to provide a mail-out frame for the Population Census and for its household survey program. Addresses are classified by land use as residential, commercial, under-construction or vacant. Regular maintenance of the register involves the review of existing classifications, and the quarterly addition of 100,000 new addresses. Approximately 68% of new addresses are able to be resolved with the use of administrative data including postal, electoral, construction, and sales data. This leaves a large number for manual review using aerial imagery and other online tools. With trained analysts only able to resolve 200 addresses per day, this is a very resource intensive process. To automate this resource intensive, manual process, we have developed an aerial image classification model called Automated Image Recognition (AIR).
In this presentation I will explain the methods used to develop this model, including a description of which classes of addresses were most successful and measures of accuracy.
In implementing this model the ABS is able to classify addresses where no administrative data is available and further strengthen classifications when it is available. The AIR model helps resolve the future predictive nature of the administrative data model by observation of an address at a recent point in time. Using the AIR model we are able to detect whether the dwelling is habitable or is still either vacant land or under-construction, which has significant benefits for household survey and Population Census operations. Implementation of the AIR model in conjunction with administrative data has led to more efficient use of resources in compiling the Address Register, and improved quality through improved accuracy and timeliness.
Elaborating a new survey strategy for statistics Flanders: answers in the pocket.
Dr Dries Verlet (Statistics Flanders) - Presenting Author
Dr Tina van der Molen (Statistics Flanders)
Authors: Dries Verlet, Tina Vander Molen, Ann Carton (Statistics Flanders)
There are some big challenges in the field of survey research. For example, there is a growing need for more detailed, actual and instant information, the rapid developments of new techniques, declining response rates in traditional surveys, budgetary constraints, … Timeliness is also an issue in traditional surveys. However, there is a tension between the need of data to be collected and releasing them “in real time” to stay relevant in a dynamic changing world. Mixed methods are a (partly) solution for declining response rates, however there is still much to say about possible mode, selection and measurement effects. Despite those challenges, a survey is still the most common type of data collection, and a reliable method (if applied properly).
Nevertheless, the environment in which surveys are deployed, is changing. That’s why traditional surveys need to be redesigned. This was our main motivation to start with a new survey within Statistics Flanders. as is a new actor in the landscape of statistical authorities in Europe (since 2016 and recognized as ONA in 2020). This new context also gave new opportunities to rework our previously used survey strategy.
As a result of a broad consultation throughout literature and the practice of survey research, we developed a new survey strategy in Flanders. Some ingredients of the redesign of the survey strategy are: address-based, multimethod, respondent-centred, push to web and mobile first.
Based on the availability of a sample with address data (cf. national register), respondents will receive an invitation to participate in an online survey (address-based online surveying). However, we decided on a multimethod design: the online version in combination with a traditional paper version. This paper version of the survey is offered in a later phase of the fieldwork (cf. “web-plus-mail” design). After all, we had to recognize the fact that different population groups are better served with an other mode (cf. “tailored design”). The design of the survey should also be respondent-centred, rather than data-user-centred. Push to web is more than just “going online”. In our strategy we opt for a so-called “mobile first” design of the questionnaire. That was the biggest hurdle to take. Compared to the forms of survey we are used to, we had to redesign the survey instrument as such. This is not only a technical challenge. Foremost, it is a substantial challenge, involving the redesign of questions to be short, to the point, fitted on the smallest screen and to be completed easily. To some extent, this means “back to basics” in designing our survey.
In this paper, we discuss the development of our new survey strategy, to be deployed from April 2020 on. So, at the moment of the conference, in our paper and presentation we will reflect on the journey we took to develop the new survey strategy as well as on the experiences seen in practice.