BACK TO SCHEDULE REGULAR

Session 31

Artificial Intelligence and Machine Learning for Text-to-Code in Official Statistics

5 June 2026
09:00 – 10:15
ŠIBENIK II

Presentation title
Landscape of Text Classification with Machine Learning in Official Statistics: Experiences from the AIML4OS Project
Recent advances in natural language processing and machine learning have opened new opportunities to automate text-classification tasks that are central to official statistics, such as coding business activities, occupations, and household expenditures to standard classifications (e.g., NACE, ISCO, COICOP).

Read more Read less Within the European AIML4OS project*, Work Package 10 (WP10)** brings together eight National Statistical Institutes (NSIs) and four observers to foster collaboration on these topics. For many NSIs, applying ML/AI to text classification has become increasingly critical in recent years. WP10 aims to explore methodologies that enhance the accuracy and efficiency of statistical coding processes. To identify the most relevant areas of work, WP10 leveraged the unique opportunity of having many NSIs aligned on the same subject by organizing project presentations from each participating institute. This provided a comprehensive overview of current approaches and highlighted the challenges shared across organizations.

The overview reveals a common set of difficulties related to the availability and quality of labelled training data, noisy and ambiguous input texts, severe class imbalances, and the intrinsic complexity of multi-level statistical nomenclatures. Despite these challenges, a convergence in methodological choices emerges: most production systems rely on lightweight models such as fastText, logistic regression, or tree-based methods combined with word- or character-level n-gram representations and simple preprocessing pipelines. Several NSIs are also experimenting with transformer-based models, although their deployment remains limited due to computational, governance, and confidentiality constraints. A shared evaluation culture has developed across institutions, characterized by systematic cross-validation, class-sensitive performance metrics, and widespread use of confidence scores. Machine-learning components are embedded in operational workflows with varying levels of maturity, ranging from decision-support tools that propose codes to fully integrated auto-coding systems. Human-in-the-loop mechanisms, confidence thresholds, and user-friendly interfaces for coders remain essential for balancing efficiency gains with the accuracy, transparency, and trust required in official statistics.

Despite differences in national contexts, NSI projects across Europe are converging on a common set of methodological priorities, creating significant opportunities for collaboration and knowledge exchange. This recognition led to the establishment of five thematic clusters within WP10, each addressing a distinct challenge in applying machine learning to text classification in official statistics: generating synthetic data to overcome limited training material, using advanced NLP models to handle linguistic complexity, developing hierarchical classification methods, ensuring robust deployment and quality assurance in production environments, and adapting models to evolving classification standards such as the NACE revision.

*https://cros.ec.europa.eu/dashboard/aiml4os

https://github.com/AIML4OS

**https://aiml4os.github.io/WP10/

Main author / Presenter
Ariane Lestrade
Destatis

Read more Read less Ariane Lestrade is a statistician at Insee, currently seconded to Destatis, where she works as a Senior Data Scientist in the Artificial Intelligence and Imputation Unit. She currently leads Work Package 10 (“Text-to-Code”) on automatic coding for official statistics within the European project Artificial Intelligence and Machine Learning for Official Statistics (AIML4OS). She holds a Master’s degree in Statistics from ENSAI (Rennes, France) with the EMOS certificate (European Master in Official Statistics). Prior to her current position, she worked as a Data Scientist at the Banque de France, focusing on machine learning methods and the exploration of new data sources for central banking topics.


CO-AUTHORS:

Bogdan Levagin, Destatis
Susanne Wegner, Destatis

Presentation title
Automatic coding in official statistics: Enhancing quality assurance of the Machine Learning workflow
Machine Learning (ML) has become a widely adopted tool across industries, offering new opportunities for enhancing data processing, classification, and estimation from large datasets.

Read more Read less These advantages make ML based methods increasingly attractive for use in official statistics. Within the EU project named Artificial Intelligence and Machine Learning for Official Statistics (AIML4OS), various potential use cases for official statistics have been identified. The Work Package 10 focuses on text-to-code problems.

However, ML workflows introduce substantial quality‑related challenges. They are highly experimental, sensitive to data changes, and depend on complex pipelines that combine code, data, models, and infrastructure. Ensuring reproducibility, traceability, and transparency—key pillars of quality in official statistics—requires structured approaches that go beyond traditional statistical production methods.

MLOps, an emerging discipline that integrates DevOps practices into machine learning workflows, provides a structured approach to managing these challenges. By enforcing systematic version control of data, models and parameters; automating tests and validations; and documenting all experiment runs, MLOps enables reproducible and auditable ML processes aligned with the rigorous standards of official statistics.

Implementing such workflows require several dedicated tools and infrastructure, which are often missing in the regular statisticians’ toolbox. Since internal data sources contain usually sensitive information, cloud-based solutions can demonstrate proof-of-concept workflow but can be difficult to be integrated into production systems. Such a workflow involves multiple stakeholders, including subject matter teams for data expertise and ownership, data science teams for model development, and IT teams for infrastructure support. MLOps further facilitates a clear separation of concerns among these roles, ensuring that each team can focus on its core responsibilities while maintaining an integrated, reproducible workflow.

The present work introduces an example of an MLOps workflow applied to a text classification problem, relying primarily on the open-source version of GitLab, which can be hosted on-premises. This implementation ensures that ML-driven statistical outputs remain trustworthy, explainable, and compliant with institutional quality principles. By embedding MLOps workflow into official statistical production system, governance is strengthened, reproducibility is enhanced, and the responsible use of ML in producing high-quality official statistics is supported.

Main author / Presenter
Yu-Lin Huang
Université du Luxembourg

Read more Read less Yu-Lin Huang received his PhD degree from the Université d’Artois (France), in 2021. His research interests are in software engineering, focusing on complex system modelling and simulation methods such as multi-agent-based simulation. Yu-Lin joined the Security Design and Validation research group, SERVAL, headed by Prof. Yves Le Traon.


CO-AUTHORS:

Claude Lamboray, STATEC
Lucien May, STATEC
Yves Le Traon, Université du Luxembourg
Maxime Cordy, Université du Luxembourg

Presentation title
Measuring quality of real and synthetic generated data for training statistical classification coders
Building automatic coders for statistical classifications remains an unsolved challenge, despite the advancements in Natural Language Processing techniques and the emergence of Large Language Models (LLMs).

Read more Read less This challenge led to the creation of the Work Package 10 (WP10) within the European AIML4OS project. The insights outlined in this abstract were shared in WP10 and refined through feedback from other contributors.

Statistical classifications are usually complex, comprising many categories, each of them containing nuanced content. LLMs appear to be a promising strategy, especially when combined with Retrieval-Augmented Generation (RAG). Yet, this approach still suffers from several of the well-known drawbacks of LLMs: costly hardware requirements, risks of hallucination, limited replicability, and the lack of confidence scores. Consequently, smaller NLP alternatives such as fastText or BERT-based models are still (and will continue to be) widely used.

However, building automatic coders for statistical classifications with these small or medium-sized models requires large, diverse, and high-quality datasets. Generating training samples via LLMs (using ZeroGen or similar techniques) helps to fill data gaps or even build entire training sets. Once these samples are generated, a “tiny task model”, such as fastText or a BERT-based classifier, is trained on the resulting dataset.

ZeroGen is presented in the literature as a technique intended for scenarios where real data is absent and is generally not expected to compete with real data in most real-world problems. However, our empirical experiments with statistical classifications show that they may represent a special use case, where synthetic data can achieve performance similar to real data. This is due to the extreme complexity of some statistical classifications, as well as the extensive materials describing their contents, which support the generation of high-quality and varied synthetic descriptions.

Measuring the quality of the generated synthetic samples is key to identifying the best prompting strategies and LLMs for producing them, and to comparing synthetic data with survey or administrative data. Several quality metrics are being considered, including the mean cosine distance between embeddings of sample pairs, Confident Learning metrics, and the performance achieved by tiny task models trained on these datasets.

Preliminary results show that larger LLMs produce more diverse samples (requiring more synthetic examples to reach acceptable performance) yet the final performance asymptote they achieve is higher. In addition, tiny task models trained solely on synthetic data can compete with those trained on real data, while the best results are obtained when combining both sources.

Main author / Presenter
Andrés Jurado Prieto
Data scientist, Statistics Spain

Read more Read less Andrés Jurado Prieto is a data scientist in the Subdirectorate-General for Methodology and Design of Samples of the National Statistics Institute of Spain (INE), where he has been working for over two years. He holds a degree in Biochemistry from the University of Córdoba, with a specialization in computational simulation of biochemical processes, including the modeling of surface plasmon formation in living organisms. His current professional work focuses on the development of machine learning models for natural language text classification based on official statistical classification standards. His activities contribute to methodological innovation, data quality improvement, and the development of synthetic data for training and validating machine learning systems within official statistics.


CO-AUTHORS:

Adrián Pérez Bote, Head of unit, Statistics Spain
Sebastián Gallego Herrera, Data scientist, Statistics Spain
Carlos Sáez Calvo, Head of unit, Statistics Spain

Presentation title
Using ML and LLMs for NACE classification
The goal of this paper is to present initial results of the work on use of Machine Learning (ML) and Large Language Models (LLM) to classify text data from company websites according to the Polish Classification of Activities (PKD), which is a national adaptation of international Statistical Classification of Economic Activities (NACE).

Read more Read less It aims to verify the possibility of application of ML and AI to improve the quality of statistical business register. This work is carried out within the international project “One-stop-shop for Artificial Intelligence and Machine Learning for Official Statistics (AIML4OS)”.

The training dataset was created from publicly available company websites as they contain detailed information about business activity profile, products and services offered by companies, and in many cases are more up-to-date than those available in business registers. The collected data were highly unstructured, therefore the text pre-processing step was required. This included data cleaning, normalization, HTML tags removing and anonymization with the usage of regular expressions and Natural Language Processing (NLP). In order to reduce the imbalance of PKD classes the training set was also extended by synthetic descriptions generated by Large Language Models.

Subsequently the text was converted into numerical representation by TF-IDF vectorization method, which was the basis for training the models. The classical Machine Learning models, such as Logistic Regression, Decision Trees and Random Forest, as well as one modern language model (XLM-RoBERTa) were tested. The training process was focused on hyperparameter tuning, conducted by applying grid search and validated via three-fold cross-validation. It enabled to identify the best configuration both for Machine Learning models and the transformer model. The best result was achieved for Logistic Regression model (with TF-IDF vectorization). The XLM-RoBERTa model achieved the second best result.

In the next steps it is planned to integrate the Retrieval-Augmented Generation (RAG) architecture to the model. In RAG solutions the embedding of company description is the query to the vector database (ChromaDB), containing PKD definitions and the model standardized business activities descriptions. The best-fitting documents are provided to LLM model as a context. This approach is expected to improve the quality of predictions and reduce hallucinations of the models.

The proposed methodology with the usage of RAG architecture and vector representation of the text, can be used to improve also the quality of other classification, such as Classification of Individual Consumption According to Purpose (COICOP) or International Standard Classification of Occupations (ISCO).

Main author / Presenter
Ewelina Niewiadomska
Statistics Poland

Read more Read less Ewelina Niewiadomska, PhD, works at Statistics Poland as a Chief Specialist. She focuses on the application of new technologies and alternative data sources to improve statistical data production processes. Her research interests include machine learning and Natural Language Processing (NLP). She has contributed to several national and international projects, such as ESSnet Big Data, the Web Intelligence Network, TranStat, and SDG. Currently, within the AIML4OS project, she collaborates with colleagues on the use of Large Language Models to predict economic activity.


CO-AUTHORS:

Krystyna Piątkowska, Statistics Poland
Marcin Związek, Statistics Poland

Cookies

This website uses cookies to ensure you get the best experience.

x