BACK TO SCHEDULE REGULAR

Session 19

Automatic Classification Using Artificial Intelligence

4 June 2026
11:00 – 12:30
ŠIBENIK V

Presentation title
Ensuring Quality: Evaluating the Strategic Use of AI in Official Cause-of-Death Data Production
Each year, about 650,000 death certificates must be coded according to the ICD-10 (International Classification of Diseases 10th Revision) to produce the official cause-of-death statistics of France.

Read more Read less Historically, the French Epidemiology Centre on Medical Causes of Death relied on a combination of automatic rule-based coding and manual human coding. Approximately 38% of certificates required manual coding, which could not be fully completed in time because of limited human resources, posing a risk to data quality and timeliness (France failed to deliver in time to Eurostat until 2020).

From the 2018 coding campaign onwards (except in 2020), a hybrid production system was implemented, combining three complementary coding methods: rule-based automatic coding, human coding, and AI-based prediction. These methods are strategically designed to work in synergy within each campaign. The automatic rule-based system processes the majority of certificates (over 62%), while the remaining cases are allocated between AI prediction and targeted human coding.

AI predictions are accompanied by confidence scores, which are used to identify certificates with the highest uncertainty and to prioritize them for manual coding. In addition, certificates belonging to specific populations of high public-health interest (such as mentioning AIDS, maternal or infant deaths, or included in research databases) are systematically coded by human experts to prevent potential biases introduced by AI. To ensure the long-term sustainability and quality of AI-based coding, the training dataset needs to be continuously updated using newly annotated certificates. Approximately 50% of manually coded certificates are randomly sampled, ensuring both representativeness of the training data and continued exposure of expert coders to the full diversity of certificates. This also supports ongoing improvements to the rule-based system, while recognizing that the marginal benefit of increasingly targeted sampling diminishes over time.

To assess the quality of each coding campaign and the balance between AI prediction and human labor, the AI test dataset is used to simulate the allocation strategy of the campaign. Results show that for 2023, if all 228,553 certificates remaining after rule-based coding were processed solely by AI, 83.5% would receive the exact same underlying cause of death as human coding. After the full hybrid campaign, this concordance increases to 92.6%. Differences in coding outcomes are further analyzed by cause-of-death group.

This article provides insights into the opportunities and risks associated with AI integration in official statistics by analyzing a production system in which AI is actively deployed, strategically supervised, and continuously evaluated for quality.

Main author / Presenter
Yann Aubineau
Inserm-CépiDc

Read more Read less Yann Aubineau is a data analyst at Inserm’s Centre for Epidemiology on Medical Causes of Death (CépiDc). His work focuses on the production, quality assessment, and dissemination of French mortality statistics, including international reporting to Eurostat and the WHO. He is co-author of recent national publications on causes of death and contributes to methodological work on series consistency, open data anonymization and dissemination, and the evaluation of AI-based coding of death certificates. Previously, he worked at Insee (the French National Institute of Statistics) on administrative data on public sector employment. He holds a Master’s degree in Quantitative Social Sciences from École Normale Supérieure Paris-Ulm / EHESS.


CO-AUTHORS:

Elisa Zambetta, Inserm-CépiDc
Aude Robert, Inserm-CépiDc

Presentation title
How to classify into predetermined OS quality dimensions open answers of the survey "User website satisfaction survey"
Quality is a central pillar of official statistics and is commonly articulated through multidimensional frameworks encompassing accuracy, timeliness, coherence, comparability, relevance, and accessibility.

Read more Read less With the increasing use of digital communication channels by National Statistical Institutes, user feedback collected through institutional websites represents a valuable but largely unstructured source of information for monitoring perceived quality. This paper investigates the feasibility of categorizing open-ended user questions submitted to the ISTAT website into established quality dimensions by means of semantic similarity and topic modeling techniques.

We apply state-of-the-art embedding-based methods, including BERTopic and Top2Vec, to represent textual queries in a latent semantic space and to identify clusters corresponding to quality dimensions. While these approaches are effective in capturing general thematic structures, our empirical results highlight a significant methodological challenge: embeddings associated with conceptually distinct quality dimensions—such as accuracy, timeliness, and time series coherence—exhibit high semantic similarity. This overlap reflects both the inherent interdependence of quality dimensions emphasized in the official statistics literature and the tendency of users to articulate multiple quality concerns within a single query. As a consequence, unsupervised topic models struggle to produce clearly separable categories aligned with the standard quality framework.

The findings suggest that purely unsupervised, similarity-based approaches may be insufficient for fine-grained quality classification in this domain. Future work will explore hybrid strategies that combine domain-informed supervision, quality-specific prompts, and hierarchical or contrastive embedding techniques. In addition, incorporating metadata, temporal context, and expert-annotated training sets may help disentangle overlapping semantic signals and improve the interpretability of results. The study contributes to the ongoing discussion on the use of natural language processing for quality management in official statistics and highlights the need for methodological adaptations when applying modern language models to conceptually dense and interrelated quality constructs.

The study contributes to methodological discussions on applying natural language processing to quality assessment in official statistics.

Main author / Presenter
Elena Catanese
Istat

Read more Read less Sentior researcher responsible for "Use of NLP techniques for OS"


CO-AUTHOR:

Roberta Roncati, Istat

Presentation title
Harnessing AI - Automatic classification of diseases
The integration of Artificial Intelligence (AI) into the production of official statistics marks a pivotal moment, promising unprecedented opportunities for enhanced efficiency, timeliness, and granularity.

Read more Read less This paper explores the transformative potential of AI, and in particular Machine Learning (ML) and Large Language Models (LLMs), across the production chain, from data collection to dissemination. Concurrently, it critically examines the inherent risks to the core principles of official statistics, including concerns around bias, transparency, methodological soundness, and public trust. Through a detailed analysis, including a practical example of automatic classification of disease cases, this paper outlines strategies and best practices for statistical organizations to responsibly harness AI while safeguarding the quality and integrity of their outputs.

Main author / Presenter
Georgios Lykos
Hellenic Statistical Authority - ELSTAT

Read more Read less Georgios Lykos is an Application Development specialist at the Hellenic Statistical Authority (ELSTAT), where he leads initiatives in integrating Artificial Intelligence into official statistical processes. He holds a degree in Informatics and an MSc in Data Science and Machine Learning. Currently, Georgios focuses on the development of intelligent chatbots and the fine-tuning of Large Language Models (LLMs) to automate the classification of complex natural language descriptions. His work is centered on leveraging generative AI to enhance data quality and operational efficiency within national statistics.


CO-AUTHORS:

Spyridon Dimas, Hellenic Statistical Authority - ELSTAT
Georgia Panagopoulou, Hellenic Statistical Authority - ELSTAT
Michail Vikentios, Hellenic Statistical Authority - ELSTAT

Presentation title
AI-assisted identification and estimation of hate crimes in Swedish police reports
Estimating finite-population totals for rare outcomes poses an important challenge in official statistics, particularly when the target variable requires manual annotation of large text corpora.

Read more Read less At Uppsala University, prediction-powered estimators for finite-population totals in highly imbalanced binary settings were developed by combining predictions from text classifiers with classical survey sampling estimators. The proposed methodology is applied to Swedish hate crime statistics, which is produced every two years by the Swedish National Council for Crime Prevention (SNCCP). First, a transformer-based classifier for identifying hate crime motives was trained using the complete population of police reports from 2007-2022, containing more than 22 million documents. When evaluated against expert annotations, it was shown to significantly outperform the Swedish police classification used in the current production of the statistics. Further, incorporating model predictions as auxiliary information yields unbiased and efficient estimates of hate crime totals, substantially reducing the required annotation effort and revealing under-reporting in police-flagged hate crime cases. The results demonstrate how modern text classification models can be rigorously integrated into finite-population inference, providing a practical framework for improving official statistics derived from large-scale textual data. SNCCP has produced statistics on police-reported hate crimes since 2006. Since 2020, hate crime statistics are based on reported crimes that the police classified as hate crimes, where SNCCP also identified hate as the underlying motive in the police report. The correspondence between which reports the police classified as hate crimes in 2024 and what was later on assessed to contain a hate crime motive by SNCCP is relatively low, at 59 percent. This raises the question of whether there are police reports that contain a hate crime motive that have not been classified as hate crimes by the police themselves, and if so, if there are any differences between the already identified hate crimes in terms of characteristics such as hate crime motive, type of crime, and gender of victim and perpetrator. To answer this, the reports from 2024 that the model predicted as hate crimes but the police did not identify as such will undergo the same rigorous annotation procedure as in the ordinary production of hate crime statistics. The results from this will be published in December 2026.

Main author / Presenter
Hannes Waldetoft
Uppsala University

Read more Read less Hannes Waldetoft, a PhD student in statistics at Uppsala University. My research is combining modern machine learning models with well-established survey sampling estimators. This is done in collaboration with the Swedish National Council for Crime Prevention (Brå). Anna Frenzel is an analyst at the Swedish National Council for Crime Prevention, which is an agency under the Ministry of Justice in Sweden. Anna has worked with producing hate crime statistics at the agency since 2009.


CO-AUTHORS:

Anna Frenzel, Swedish National Council for Crime Prevention
Måns Magnusson, Uppsala University
Jakob Torgander, Uppsala University

Presentation title
Assessing the Quality of Automatic Text Classification of Time Use Survey Diaries Using Deep Learning Models
The Time Use Survey conducted by Istat is characterized by a large amount of textual data entered by respondents in their diaries, which are currently classified through semi-automatic procedures.

Read more Read less In this work, we applied modern Artificial Intelligence techniques to automatically label a 20% test set of 1.4 million texts derived from these diaries, while the remaining part was used as the training set. We also developed a web-based application interface capable of querying the Deep Learning backend both to start the training phase and to label a production set, i.e. a collection of texts for which domain experts require automatic labeling. On the test set, 277,881 texts were correctly labeled, while only 49,179 were incorrectly labeled. The best-performing model was a Bidirectional LSTM with a 300-dimensional Word2Vec embedding space pre-trained on an Italian Wikipedia dump. We tested several other models before reaching this result, including a vanilla transformer with Word2Vec, an attention-based LSTM with Word2Vec, a traditional LSTM, a GRU, and others. Furthermore, a novelty of our approach is that we encoded all covariates as a unique input textual information which can feed all the NLP (Natural Language Processing) models. In all cases, we considered only non-foundational models, i.e. neither Large Language Models (LLMs) nor transfer-learning pre-trained models such as BERT. This choice was motivated by the fact that Time Use Survey data are confidential and affected by strict privacy constraints, which prevent the use of models such as LLMs that require fine-tuning or even inference on cloud-based systems. By contrast, the traditional (non-foundational) Deep Learning approach proved to be successful in all cases, as it can be trained quickly on local physical machines (non-cloud) while still achieving competitive — and possibly even superior — accuracy compared to LLM-based approaches. For this reason, we conclude that traditional Deep Learning represents a winning approach for automatic coding applications in Official Statistics, where privacy and data quality are fundamental requirements.

Main author / Presenter
Francesco Pugliese
Istat

Read more Read less Francesco Pugliese is a data scientist and machine-learning researcher at the Italian National Institute of Statistics (Istat) in Rome, where he works on methodological and advanced analytics for official statistics. He specializes in machine learning, deep learning, and artificial intelligence applications to statistical problems, contributing to research on topics such as automated classification of web content for statistical use and improving statistical production with AI techniques. Pugliese has co-authored multiple scientific publications in Istat’s Review of Official Statistics and other venues, exploring innovative methods to enhance data quality and analytical processes within official statistics. His work bridges advanced computational methods and official statistical practice, helping modernize the way large and complex data are processed and utilized at Italy’s national statistical office.


CO-AUTHORS:

Francesco Scalfati, Istat
Tania Cappadozzi, Istat
Manuela Michelini, Istat
Fabrizio De Fausti, Istat

Presentation title
Scaling consistency checks and globalization analysis in Multinational Enterprise Groups data flows through generative AI
The accelerating globalization of production systems poses a major statistical challenge for national statistical offices (NSOs).

Read more Read less Multinational Enterprise Groups (MNEs) increasingly organize their activities through complex cross-border arrangements—fragmented value chains, intra-group trade, intellectual property mobility, and centralized service hubs—that obscure the economic reality that official statistics aim to capture. To address these challenges, many NSOs have established Large Cases Units (LCUs), specialized teams responsible for profiling major multinational groups, integrating microdata from multiple sources, and ensuring the consistency of their statistical records across surveys, registers, and administrative data. However, as globalization intensifies and data volumes expand, traditional manual validation processes face strong scalability limits.

This paper presents a technological and methodological innovation designed to enhance the consistency analysis of MNEs statistical recordings and to strengthen globalization-related assessments within the national accounts framework. Building on an operational workflow used in NSO practice, we develop an integrated system capable of ingesting heterogeneous microdata—business registers, structural business statistics, foreign affiliates, R&D, trade, customs, and FDI—and processing them through an automated pipeline. The system is structured in four stages: preprocessing of new data, integration and transformation in SAS to construct analytical datasets, automated generation of pivot tables and PowerPoint outputs through modular VBA scripts, and final review and validation by statistical experts. This architecture enables reproducible, large-scale analysis of production arrangements and cross-border linkages within multinational groups.

A core contribution of the paper is the introduction of a generative AI component that supports LCUs by enabling massive-scale consistency checks. A custom GPT model, guided by domain-specific prompts, processes publicly available corporate information to identify evidence of global production arrangements aligned with SNA 2025 categories. It assesses the presence and plausibility of operational patterns and automatically produces technical reports summarizing findings, recent news, and consistency issues. By combining automated microdata processing with AI-driven qualitative analysis, the proposed framework significantly improves the capacity of NSOs to detect inconsistencies, understand multinational production structures, and maintain high-quality statistical registers in an increasingly globalized economy.

Main author / Presenter
Juan Cervigón, Jorge Novalbos
National Statistics Institute

Read more Read less Juan Cervigón is Head of Unit in the Large Cases Unit Division at the National Statistics Institute of Spain (INE), where he has led work on data dissemination and globalization statistics. He has extensive experience in official statistics, representing Spain in international task forces and expert groups related to international trade and globalization. Juan holds a Degree in Mathematics from the University of Valencia and a PhD in Economics from Universidad Autónoma de Madrid. Alongside his public-sector role, he brings over 25 years of private-sector experience, including senior executive positions in banking, real estate, and renewable energy, and has served as Vice President of the Spanish Wind Association. He is also an experienced academic lecturer, teaching statistics, machine learning, and sampling methods in English at Universidad Carlos III de Madrid and previously at Instituto de Empresa. His recent research focuses on globalization, multinational enterprises, and intangible assets in economic statistics. Jorge Novalbos is a Senior Statistician of the State and works in the “Large Cases Unit” Division at the INE. He holds a degree in Electrical Industrial Engineering and a Master’s in Industrial Engineering, both obtained from the Universidad de Castilla-La Mancha. His main responsibilities include the development and implementation of new indicators for measuring global trade, as well as the supervision and monitoring of the statistical register of globalization.


CO-AUTHORS:

Silvia Molina, National Statistics Institute
Sixto Muriel, National Statistics Institute
Jorge Novalbos, National Statistics Institute

Cookies

This website uses cookies to ensure you get the best experience.

x