BACK TO SCHEDULE SPEEDTALK

Speed talk session 3

High-Quality Imputations With Machine Learning Techniques

3 June 2026
13:15 – 14:00
ŠIBENIK IV

Presentation title
Flexible machine-learning-based imputation with options for sequential imputation and predictive mean matching.
The new function vimpute() from the widely used R package VIM offers a robust and flexible approach to handle missing data in complex datasets.

Read more Read less It was designed with the principles of statistical quality and reproducibility in mind. The function is using the mlr3 ecosystem to have access to state-of-the-art machine learning methods including random forest and XGBoost. Each variable with missing values can be imputed iteratively, leveraging all available information and dynamically optimizing model parameters to ensure high predictive accuracy.

A key innovation is the integration of automated hyperparameter tuning, which tailors the imputation model to each variables's characteristics. For numerical variables, predictive mean matching can preserve the original data distribution, while categorical variables benefit from stochastic imputation based on class probabilities, yielding realistic and statistically sound results.

The effectivness of vimpute() is demonstrated through a simulation study and real-world applications, highlighting its potential to improve data quality in imputed data sets.

Main author / Presenter
Alexander Kowarik
Statistics Austria

Read more Read less Dr Alexander Kowarik is head of the methods unit at Statistics Austria with more than 15 years of experience working at an NSI. He is an active contributor to the R open-source community with a focus on official statistics application and participating in several international projects related to the usage of new data sources for the production of official statistics.


CO-AUTHORS:

Eileen Vattheuer, Statistics Austria
Johannes Gussenbauer, Statistics Austria
Nina Niederhametner, Statistics Austria

Presentation title
Assessing Error in Machine Learning-Based Mass Imputation for Small Area Estimation
With the growing demand for high-resolution and high-quality statistics, the use of Small Area Estimates (SAE) has become increasingly important—particularly when high-quality administrative data can be linked to small or medium-sized surveys.

Read more Read less The SAE literature offers a wide range of models, such as the well-known Fay-Herriot model, which aim to produce unbiased estimates. However, these models are typically linear and rely on strong assumptions about error distributions.

In this work, we illustrate the application of machine learning techniques for mass imputation in the production of SAE. Unlike traditional models, machine learning approaches impose fewer assumptions on the data and often operate as “black boxes,” making error estimation for SAE particularly challenging. To address this, we propose the use of calibrated bootstrap replicate weights as a practical solution. We demonstrate the effectiveness of this approach through two case studies: estimating household income for school students by combining EU-SILC sample data with administrative registers, and producing tourism acceptance statistics for small areas across Austria. This work is co-funded by the European Commission Project “AIML4OS” – 101146355.

Main author / Presenter
Johannes Gussenbauer
Statistics Austria

Read more Read less Johannes Gussenbauer is a Methodologist at Statistics Austria, working in the Center for Methods. He has an academic background in Mathematics and Statistics. His expertise includes imputation, calibration, and error estimation for sample surveys. In addition, Johannes applies machine learning techniques to tasks such as text classification, mass imputation, and the analysis of web-scraped data.


CO-AUTHORS:

Nina Niederhametner, Statistics Austria
Alexander Kowarik, Statistics Austria

Presentation title
Standardization of feature engineering implementation using XML files in the imputation process of economic early estimates
In a fast-changing globalization world, early estimates of economic monthly indicators are every day more necessary in order to make accurate and quick decisions, by politician, by companies or even by citizens.

Read more Read less Machine learning techniques can be used to carry out the work of obtaining early estimates in the imputation phase with accurate results.

Standardization in the production of official statistics offers numerous advantages, primarily focused on enhancing the quality, as standardized procedures and methodologies, help avoid inconsistencies and errors.

An XML file is a both human-readable and machine-readable file, where the labels that describe the data are user-defined, so it can be adapted to different needs. It is focused on describing data, and as it is a text-based format, it can be shared and processed across different systems, platforms, and programming languages without compatibility issues.

In any process using Machine Learning techniques feature engineering is the most semi-manual part to implement the subject matter expert knowledge in the form of regressors. The regressors play a very important role in some machine learning techniques. Because of the role official statistics play in the society, these regressors should be easily understandable by the users and easily explicable by the producers.

At Statistics Spain we have developed a production system based on XML files and R scripts that can help us in the process of defining regressors for ML methods. It has been designed with standard scripts to be used for different economic domains and data sources.

In the XML file we can change some parameters of the statistical operation (for example, identification code) and also the classifications being used and their breakdowns, and the R scripts can be written to calculate the regressors defined, with the breakdowns required.

Main author / Presenter
Elena Rosa-Perez
Statistics Spain

Read more Read less Elena Rosa-Perez started working at Statistics Spain in 2008, where she has been involved in the production of science and technology and tourism statistics. She currently works in the production of transport, construction and industry statistics, carrying out different production process like editing and imputation, seasonal adjustment, quality improvement and dissemination tasks. She is also associate professor in the EMOS master at the Complutense University of Madrid.


CO-AUTHORS:

Sandra Barragán, Statistics Spain
José Manuel Martín del Moral, Statistics Spain
Beatriz Acereda, Statistics Spain
David Salgado, Complutense University of Madrid

Presentation title
Learning to Fill the Gaps
Each year, Statistics Portugal conducts the Survey on Construction Enterprises to collect key information on the structure and performance of the construction sector.

Read more Read less Survey responses are used to define stratification variables that underpin estimation procedures extending beyond the sampled units. The effectiveness of this methodology critically depends on the completeness of all variables used in the stratum definition. However, non-response and partial non-response persist for a subset of these variables, posing challenges to data quality, accuracy, and reliability.

To address these limitations, this study leverages an administrative data source of high statistical quality: the Simplified Business Information (IES), collected by the Tax Administration. IES is a census-based system subject to extensive validation through the Integrated Business Account System (SCIE) of Statistics Portugal, which consolidates enterprise-level information and supports coherent analysis of economic activity and sectoral dynamics. The dataset provides detailed business characteristics, including geographical classification according to the Nomenclature of Territorial Units for Statistics (NUTS), economic activity based on NACE, and key financial indicators such as turnover and employment. Additionally, it includes construction-specific financial variables related to material costs, construction expenditures, and infrastructure investment.

Within the framework of AIML4OS Work Package 9, this paper investigates the use of machine learning (ML) techniques to estimate missing stratification variables, with the objective of improving data completeness, accuracy, and overall statistical reliability. Through comprehensive exploratory analysis, the study derives enhanced profiling variables capturing interactions between economic activity (NACE) and turnover, as well as structural differences between two survey instruments designed for large enterprises (Type A) and small enterprises (Type B), defined by employment size.

Model performance is assessed not only by comparing ML-based estimates with observed survey responses, but also through temporal validation against historical responses from previous reference periods (n−1 and n−2). This longitudinal perspective introduces an additional quality control dimension, exploiting enterprise-level response stability as an implicit validation mechanism. By integrating survey and administrative data, the proposed approach reduces response burden, strengthens coherence across data sources, and enhances the robustness of estimation procedures.

Beyond improving imputation accuracy, the study contributes to methodological development in official statistics, supports higher-quality enterprise profiling, and deepens analytical insight into the economic structure of the construction sector.

Main author / Presenter
Pedro Sousa
Instituto Nacional de Estatística

Read more Read less Pedro holds a degree in Mathematics and Computer Science. Worked for several years in network and systems management. Since 2015, has been working at Statistics Portugal (INE), in the Methodology Department.


CO-AUTHORS:

Sónia Quaresma, Instituto Nacional de Estatística
Vasco Cordeiro, Instituto Nacional de Estatística
Pedro Campos, Instituto Nacional de Estatística

Cookies

This website uses cookies to ensure you get the best experience.

x