3 June 2026
16:30 – 18:00
ŠIBENIK IV
Presentation title
Using AI for Automatic Data Extraction and Classification in Official Statistics
Various processes concerning data collection at Statistics Austria still rely on the manual extraction and classification of data.
Read more
Read less
To improve the quality of these processes and reduce the amount of manual workload, we implemented an AI solution for four use cases over the last year, with three additional use-cases currently under development. This talk focuses on how we leverage AI to classify image and text inputs of clothing items necessary for the consumer price index by fine-tuning a text and an image model, correct code classifications of business balance sheets with a zero-shot approach, and automatically extract values from annual business reports in order to calculate various key figures using pre-trained models. In addition, we are developing several Optical Character Recognition (OCR) models to extract information from non-machine-readable files such as housing invoices and death certificates.
Nina Niederhametner
Statistics Austria
Read more
Read less
Nina Niederhametner has been working at Statistics Austria since 2023, where her main focus lies on the integration of machine learning and AI solutions into the statistical production process. This includes imputation, automatic data extraction, as well as classification.
CO-AUTHOR:
Presentation title
A Machine Learning Approach for Improving the Flash Labour Cost Index Using Survey and Auxiliary Data
The Labour Cost Index (LCI) is a key indicator for monitoring labour cost dynamics in the EU.
Read more
Read less
Its production relies on the Quarterly Labour Cost Survey, which collects detailed information from around 28,000 establishments. However, the introduction of an early estimate of the index for quarter t at t+45 days, means that its compilation is affected by the lack of questionnaires received at that date, which forces the use of partial imputations based on traditional methods and limits the accuracy of the provisional estimates. In addition, the current imputation procedure only allows questionnaires to be completed using information from the same quarter of the previous year, leaving a significant number of cases unestimated. This challenge has gained importance with the introduction of the flash LCI, which demands more timely and reliable preliminary results.
The aim of this study is to present a new methodology designed to improve the flash LCI using advanced machine learning techniques. This methodology seeks, on the one hand, to increase the percentage of imputed questionnaires until full coverage is achieved, and on the other, to reduce the gap between the provisional and final index.
The proposed process is structured into four main stages: (1) reducing the number of target variables in the questionnaire; (2) selecting predictors using a random forest model to identify the most relevant variables among more than 500 candidates; (3) training and validating different predictive models —random forest, naive Bayes and XGBoost— using data from the most recent available quarters and some auxiliary files with administrative data; and (4) producing the final prediction with constraints that ensure the statistical coherence of the results.
The results obtained point to an improvement over the method currently used: the average error between the flash index and the final index is below 1%, and all incomplete questionnaires can be imputed. This approach increases accuracy and makes it possible to provide a more robust early indicator aligned with European standards. This work demonstrates the potential of machine learning to strengthen official labour market statistics in contexts where timeliness and reliability are essential.
Raúl Fernández González
Spanish Statistical Office (INE)
Read more
Read less
From the beginning of my career, I have been dedicated to the fields of statistics, data analytics and methodological innovation. I hold a degree in Physics and Mathematics from the University of Oviedo, as well as a Master’s in Big Data & Business Analytics and the European Master in Official Statistics from the Complutense University of Madrid, in collaboration with EUROSTAT. I have been working in the Spanish Public Administration since 2023, developing my professional career at the National Statistics Institute of Spain (INE).
I currently serve as Program Director in the Labour Market Division, where I oversee the production and dissemination of official labour statistics, notably the Quarterly Labour Cost Survey. My work focuses on integrating advanced analytics and machine learning into statistical processes and contributing to the modernisation of official statistics. Prior to this role, I was involved in several artificial intelligence and business analytics projects.
Presentation title
Who Works in the Green Transition? A Patent-Based Analysis of Occupational Change Using Large Language Models
The green transition represents one of the most significant structural transformations of modern economies, reshaping production systems, energy infrastructures, and labor markets worldwide.
Read more
Read less
Understanding how technological change associated with Green Tech and energy transition evolves is therefore crucial for anticipating its economic and employment implications. While existing studies have analyzed green innovation using macroeconomic indicators, policy analysis, or firm-level surveys, patent data offer a unique and forward-looking perspective on technological trajectories and their potential impact on the world of work.
This paper provides a comprehensive patent-based analysis of Green Tech and energy transition technologies over the last decade. Using a curated set of patents identified through a combination of keyword-based searches and standardized technological classifications related to climate change mitigation and sustainable energy, we examine the evolution, composition, and intensity of green innovation.
To assess the implications of these technological developments for labor markets, the paper introduces a novel methodology that links patent technologies to occupational sectors. This association is performed using large language models (LLMs) and sentence similarity techniques, which allow for semantic matching between patent texts and occupational descriptions. By leveraging natural language representations, the approach captures nuanced relationships between technological content and job-relevant skills beyond traditional industry classifications.
Overall, the study contributes to the literature by connecting green innovation dynamics to labor market implications, offering evidence relevant for policymakers, firms, and workforce planning in the context of the global energy transition.
Giulio Massacci
Istat
Read more
Read less
I am a Data Scientist and researcher specializing in data manipulation within Big Data contexts, particularly in the field of Natural Language Processing. My work involves developing methodologies for the extraction, transformation, and analysis of large-scale textual data, with the goal of deriving meaningful insights for advanced language-based applications.
CO-AUTHOR:
Presentation title
High frequency time series imputation with auxiliary data: a deep learning approach
Sensor data—such as that derived from induction loops and traffic cameras—is a valuable, readily available source for estimating inbound road tourism, that lacks comprehensive passenger register data, unlike the detailed records available for ports, airports, and rail.
Read more
Read less
Another advantage of this data is its high temporal resolution. However, the intermittent malfunction of these sensors frequently introduces significant data gaps into the time series. Imputation is thus needed to fill up the gaps.
For this task, we propose a novel deep learning model for multivariate time series imputation. The architecture integrates Long Short-Term Memory (LSTM) layers to effectively capture temporal dependencies and features, alongside autoencoder-like layers to extract and leverage spatial features from co-located loops and cameras. It is possible to include extra additional regressors too.
The training process is sequential: the LSTM and autoencoder components are pre-trained independently and then combined using shallow, fully connected layers which are fine-tuned on the final dataset. This implementation, built using Python/Keras, employs custom layers and metrics essential for facilitating this multi-stage, sequenced training approach and effectively handling the time-varying nature of the gaps. Furthermore, this general architecture is readily adaptable for application to other multivariate high frequence time series with strong spatial dependencies.
Luis Sanguiao
INE (Spanish NSI)
Read more
Read less
Luis Sanguiao-Sande holds a PhD in Mathematics (2006) and is a statistician specializing in time series analysis, seasonal adjustment, and statistical learning methods for official statistics. His work focuses on the integration of administrative and alternative data sources into statistical production. He has published in several journals, such as Journal of Official Statistics and Sankhyā A, and has presented his work at international conferences, including the World Statistics Congress.
CO-AUTHOR: