3 June 2026
14:15 – 15:45
ŠIBENIK IV
Presentation title
No one is without knowledge except he who asks no questions.
As part of the AIML4OS ESSnet, and in particular WP 4 dedicated to assessing the AI/ML State of Play, a survey was undertaken to develop a clear, structured and comparable picture of how National Statistical Institutes (NSIs) across the European Statistical System (ESS) are adopting and applying Machine Learning (ML).
Read more
Read less
This paper intends to share this knowledge deemed essential for safeguarding the quality of official statistics at a time when new techniques increasingly start to influence data collection, processing and analysis. Understanding the current state of ML readiness directly supports improvements in all the foundational dimensions of statistical quality.
Results on organisational aspects show an ESS that is active and exploratory, with many teams experimenting with ML in pilot projects or specific operational problems. While this demonstrates high relevance, as innovation starts from concrete needs, it also reveals that ML is not yet embedded in systematic planning or production flows. Without stronger institutional frameworks, the potential gains in timeliness, coherence and comparability remain partial.
Concerning quality standards, the survey reveals an emerging but uneven landscape. Only a minority of NSIs have developed ML-specific guidelines for validation, documentation or performance monitoring. Most rely on general data protection and ethical frameworks rather than on methodological quality rules tailored to ML. This gap affects accuracy, reliability and transparency, showing the need for shared ESS guidelines that ensure consistent quality assurance as ML use expands.
On human resources, respondents identify clear skills shortages, particularly in programming, ML engineering and MLOps. While NSIs maintain strong statistical competence, the capacity to scale ML solutions or integrate them into production systems remains limited. Strengthening these skills is key to improving timeliness and maintaining high accuracy when automating or accelerating processing tasks. Planned investments in training and cooperation with research institutions indicate that NSIs recognise this necessity.
Finally, findings on technical challenges such as fragmented data infrastructures, limited access to good training data, insufficient computing resources and concerns over privacy, highlight structural bottlenecks that constrain the maturity of ML initiatives. These issues affect the potential to produce ML-enhanced outputs that are reliable, coherent and timely.
Overall, the survey shows an ESS that is motivated and progressing, but still uneven in ML maturity. The results, that we share in the presentation provide a valuable evidence base for prioritising capacity building, strengthening governance and developing common quality frameworks to ensure that ML contributes positively to all dimensions of statistical quality.
Sónia Quaresma
Instituto Nacional de Estatística
Read more
Read less
Sónia has a background in computer science with a major in artificial intelligence. Works at Statistics Portugal since 2000 and is the National Coordinator of the AIML4OS in Portugal. She is also the project leader (ISO TC 69 / WG 12 - NWI) for the Standard on Curation, Cleansing and Wrangling of Big and Large Datasets
Presentation title
Parameter Tuning Strategies for Random Forests in Imbalanced Data Settings: A Case Study on School Enrollment
The development of classification models on highly imbalanced datasets represents a significant challenge in machine learning applications, as the strong predominance of the majority class prevents the correct identification of minority classes, which are often associated with rare events.
Read more
Read less
In such contexts, traditional performance measures such as overall accuracy are inadequate, as they tend to favor models that appear highly accurate while providing limited predictive power for the classes of interest.
This paper investigates the role of hyperparameter tuning in improving the performance of Random Forest (RF) classifiers under imbalanced data conditions. In particular, we analyze the impact of key parameters—including the number of trees, maximum tree depth and the number of variables considered at each split—on the ability of the model to detect minority classes. The study emphasizes the importance of selecting appropriate evaluation metrics suited to imbalanced settings and highlights how parameter tuning can substantially enhance model effectiveness beyond default configurations.
The empirical analysis is based on two datasets derived from administrative information on school enrollment. The binary target variable indicates whether an individual enrolled in a school course in the previous academic year is not enrolled in the subsequent year. Dataset 1 includes individuals attending primary or lower secondary education and is characterized by extreme imbalance, with only 1.2% of cases corresponding to non-enrollment in the following year. Dataset 2 focuses on individuals attending upper secondary education, where the proportion of non-enrollment increases to approximately 13%.
Machine learning methods are applied to estimate individual probabilities of being enrolled in a school course, conditional on enrollment in the previous school year. These probabilities are then used to predict enrollment outcomes for the subsequent school year. The objective of this study is to accurately replicate the distribution of school enrollment at both macro and micro levels. The micro-level predictive accuracy is carefully evaluated, as it is crucial for assessing the feasibility of incorporating model-based information into statistical registers.
The analysis focuses exclusively on RF models, which offer a flexible framework for handling nonlinear relationships and interactions while allowing for the assessment of feature importance and parameter interpretability. The results demonstrate that careful hyperparameter tuning, combined with appropriate evaluation metrics, is essential for achieving reliable predictions in highly imbalanced educational data and for supporting their potential use in official statistics production.
Fabrizio De Fausti
Istat
Read more
Read less
Fabrizio De Fausti is a researcher at Istat with over ten years of experience in Official Statistics, specializing in methodological development and the application of machine learning techniques to support statistical processes. His work focuses on the use of supervised learning methods for data editing, control and correction, imputation, and classification, with the aim of improving data quality, accuracy, and production efficiency. He has extensive experience in the integration of administrative data, Big Data, and Earth Observation sources into official statistical workflows. He has coordinated research activities leading to the development of experimental statistics, including the quantification of urban green areas using high-resolution orthophoto imagery and advanced image analysis techniques. His expertise also includes the assessment of classification accuracy, GIS-based spatial analysis, and the development of automated pipelines for land cover mapping. His research contributes to the modernization of statistical production systems through the systematic adoption of AI-based and data-driven
CO-AUTHORS:
Presentation title
Combining Knowledge Graphs and LLM Agents for robust statistical classification: the GRAAL framework
Traditional supervised learning approaches for hierarchical classification in official statistics face significant challenges due to the scarcity of labeled training data and the high cost of manual annotation by domain experts.
Read more
Read less
While Large Language Models (LLMs) have recently emerged as promising solutions due to their natural language understanding capabilities, their application to complex statistical taxonomies presents critical limitations. These include difficulty managing thousands of categories within constrained context windows, inconsistent predictions across multiple runs and limited explainability for audit requirements. These limitations are particularly problematic for official nomenclatures like NACE or COICOP, where consistency and traceability are essential to ensure the quality of the classification.
To address these challenges, this paper proposes GRAAL (Graph-based Research with Agents for Automatic Labelling), an exploratory graph-centric and agent-based framework designed for automatic classification within large hierarchical nomenclatures used by INSEE. GRAAL represents classification systems as knowledge graphs, where nodes correspond to classification codes and edges encode hierarchical and semantic relationships. Rather than relying solely on text generation, LLM-based agents are equipped with specialized tools to navigate, query, and reason over the graph structure.
The framework employs a multi-agent architecture: (1) a graph-navigation agent that iteratively traverses the classification hierarchy to identify relevant codes, (2) an agentic Retrieval-Augmented Generation (RAG) mechanism that leverages structured graph knowledge for contextual reasoning, and (3) an evaluation agent that assesses semantic relevance and confidence scores. This approach provides explainability through traceable navigation paths, robustness via structured exploration constrained by hierarchical relationships, and consistency by enforcing taxonomic rules during classification. Additionally, GRAAL enables the generation of synthetic labeled data by leveraging graph structure and agent reasoning to produce consistent training examples for downstream tasks.
This exploratory study examines how combining LLMs with knowledge graphs and agent-based reasoning can provide a scalable and reliable solution for automatic labeling in official statistical production, offering new perspectives for addressing data scarcity challenges in statistical classification systems.
Théo Ferry
INSEE
Read more
Read less
Data Scientist at SSPlab, the innovation unit of INSEE
CO-AUTHOR:
Presentation title
The usefulness of good old ML models in real estate statistics
This paper presents the preliminary results from the experimental study whose aim was an assessment of the possibility of the use of new sources, methods and techniques to increase the quality of data on rental apartment prices in Poland for calculation of consumer price index (CPI).
Read more
Read less
For the purposes of calculating the CPI, data about rental market are usually collected by interviewers, which affects the high burden and costs of conducting research. As the rental market is fast growing in Poland, especially in large cities with a large number of immigrants and students, looking into more efficient way of collecting data and providing a high quality indicators for policy makers, becomes an important task for Statistics Poland.
Data for this project was collected from four real estate platforms using web scraping and API. The earlier studies demonstrate a high degree of convergence between transactional and offer prices and consequently the usefulness of this data source. Since April 2022 more than 300 thousand of offers for rent have been collected, which significantly exceeds the volume of data available earlier. However, for calculation of the CPI only offers that meet the definition of the representative could be used. Therefore, one of the most important step was to classify apartments into furnished and unfurnished. The classification was done based on the descriptions of offers and application of machine learning methods. Within the project several models were applied: Logistic Regression, Naive Bayes, Decision Tree and Support Vector Machine (SVM). The SVM model achieved the highest overall accuracy and F1-score (respectively 0.9 and 0.87) and was selected for further analyses. The model performed better for furnished than unfurnished class (respectively 0.93 and 0.81 of F1-score), what in the case of the highly unbalanced dataset, in which 75% of offers referred to furnished apartments and only 25% of offers to unfurnished ones, still can be assessed as a good result. Further evaluation of the SVM model, including statistical metrics and human annotation proved that this binary classifier performed well in the division of offers into separate classes based on their descriptions.
The initial results of the study demonstrate that web data may provide useful information on the elements of amenities available in the property and have a potential to augment real estate statistics, while traditional machine learning techniques still can play an important role in data processing, despite the greater interest in Large Language Models nowadays.
Klaudia Peszat
Statistics Poland
Read more
Read less
Klaudia Peszat has been working as a Deputy Director at the Data Integration Department, Statistics Poland. She is responsible for initiating, conceptualizing and coordinating research with the use of new data sources and integration of data from various sources.
Klaudia was involved in the ESSnet Trusted Smart Statistics – Web Intelligence Network project, which one of the goal was exploration of the potential to extend the Web Intelligence Hub with new data sources and use cases.
In 2018 she obtained a PhD in socio-economic geography at the University of Warsaw, where she used to work as a research assistant.