3 June 2026
11:00 – 12:30
ŠIBENIK III
Presentation title
Assessing the Reliability of Web-Derived Data: Methods for Data Validation and Quality Evaluation in Surveys based on Web Data
The primary objective of this paper is to present a comprehensive methodological framework designed to validate datasets used for the production of experimental statistics derived from website content.
Read more
Read less
This framework aims to ensure that data collected from the web, characterized by its heterogeneity, dynamic nature, and varying degrees of reliability, can be systematically assessed for quality and suitability in statistical analysis. A secondary objective is to demonstrate the practical application of this framework through a detailed case study focusing on the extraction of specific measurements from website content. These measurements include, among others, indicators of social media presence, engagement in e-commerce activities, and other observable features that can be linked to the digital behaviour of website owners.
The research question of this study investigates whether it is feasible to construct a generalizable framework capable of evaluating the quality of web-based data across diverse contexts. To address this question, the paper presents an empirical case study that both illustrates the practical implementation of the proposed framework and highlights its methodological constraints. By examining real-world data, the study aims to identify which aspects of the framework can be universally applied and which require adaptation to particular analytical goals or domain-specific conditions.
The empirical analysis draws on data collected through the Web Intelligence Hub developed by Eurostat, which provides a structured environment for accessing, processing, and analysing large-scale web data. The use of this platform enabled the systematic extraction of relevant indicators and facilitated the assessment of data quality dimensions such as completeness, accuracy, and consistency.
The overall conclusion of the study is that while it is indeed possible to define a coherent set of indicators to measure and validate the quality of web-derived data, the degree to which such a framework can be generalized remains limited. The variability of web content and the diversity of analytical needs make it challenging to establish a universally applicable validation methodology. Nevertheless, the findings also demonstrate that numerous quality measures, such as metadata analysis, cross-source verification, and consistency checks, are broadly relevant and can substantially enhance the reliability and interpretability of web-based datasets. These common measures form an important foundation for improving the overall quality of web data used in experimental statistics, even if complete generalization across all use cases is not achievable.
Jacek Maślankowski
Statistics Poland
Read more
Read less
Dr. Jacek Maślankowski is an Assistant Professor at the University of Gdańsk and serves as a consultant at Statistics Poland. His scholarly and professional work focuses on modern data processing technologies and their practical application in economics and official statistics. His primary areas of expertise include the use of Big Data for public statistics and market analysis, advanced web data analysis techniques, and the design and implementation of Business Intelligence solutions.
He is the author of numerous scientific publications addressing Big Data methodologies and data quality issues, and a co-author of multiple analytical and methodological reports in this field. Since 2012, he has been actively involved in the processing and analysis of large-scale datasets within official statistics. He has participated in numerous international and national projects led by UNECE, Eurostat, and the European Commission, aimed at the development and integration of Big Data solutions into statistical production systems.
Presentation title
Assuring quality for statistics based on Mobile Network Operator Data: Lessons from the Multi-MNO Project
The informative potential of new data sources, including privately held data, has attracted growing interest from researchers in official statistics.
Read more
Read less
As with any source or methodology, understanding quality-related risks of such data has been a priority. Early international projects have shown that, due to the variety of the big data sources, the challenges associated with each class of data are so unique that a general quality framework encompassing all those sources would be difficult to build and too generic to be useful.
In this paper a quality framework for statistics derived from mobile network operator (MNO) data will be illustrated. This framework is one of the deliverables of the Multi-MNO project, carried out under a service contract awarded by Eurostat to a consortium of multiple organizations, specifically national statistical offices and industry partners, and also involving mobile network operators. This paper describes the main characteristics of the framework, highlighting similarities and differences from quality management approaches that are commonly employed for the assessment of statistics based on traditional statistical processes.
The framework addresses quality management of MNO data from different perspectives: business architecture, institutional elements, process and output quality, software quality. This work focuses on the central part of the assessment, namely the mitigation of input and throughput quality issues, where the core ideas can be used in a modular way to address new issues or introduce new corrective actions. Finally, we will discuss how the framework could also serve as an inspiration for the quality assessment of other innovative data sources, especially in the domain of privately held data.
Gabriele Ascari
Istat
Read more
Read less
Gabriele Ascari is a researcher at Istat with a background in statistics. He has been working since 2017 in the evaluation of statistical processes and product quality. He has been involved in multiple international projects focusing on data quality frameworks, quality improvement and the use of innovative data sources and methodologies in official statistics.
CO-AUTHOR:
Presentation title
Navigating Uncertainty: A Total Error Framework for MNO-Based Tourism Statistics
The use of Mobile Network Operator (MNO) data for tourism statistics offers National Statistical Institutes (NSIs) new opportunities to improve timeliness, granularity, and analytical scope, but also introduces methodological complexity and uncertainty that can undermine trust if not carefully managed.
Read more
Read less
To support the reliable integration of MNO data into official statistics, we propose a structured roadmap for applying a Total Error Framework to MNO-based tourism indicators, explicitly aligned with the core quality dimensions of official statistics.
The roadmap is organised around six foundational elements that frame the statistical production process. Target Statistics define the policy and user needs to be addressed (e.g. visitor numbers), anchoring the process in relevance. The Statistical Unit specifies who or what is being measured (persons, SIMs), with direct implications for accuracy, interpretability, and comparability. The Nature of MNO Data clarifies whether MNO data act as a primary or auxiliary source, shaping methodological choices and coherence with existing tourism statistics. Data Configuration characterises the structure and granularity of available data, including spatial and temporal resolution and macro versus micro formats. Methods encompass estimation, modelling, and correction techniques, while Total Error Analysis provides a systematic approach to identifying, assessing, and addressing the most relevant sources of error.
The roadmap unfolds through a sequence of interconnected steps consistent with the Total Error Framework. It begins with a clear definition of the target statistics the statistical unit and population of interest, ensuring conceptual clarity and relevance. The role of MNO data within the broader statistical system is then established, supporting coherence and comparability with other sources. Subsequent steps focus on mapping observation units and complementary sources and characterising data configurations, making explicit the trade-offs between timeliness, punctuality, and accuracy.
The final phase addresses error identification and mitigation. Potential sources of error, such as coverage, measurement, and modelling errors, are identified and classified. Where feasible, the most influential components are quantified. Appropriate correction and adjustment methods are then applied, followed by validation and review of assumptions. These steps enhance accuracy and reliability while supporting transparent communication of uncertainty.
By embedding quality considerations directly into methodological decision-making, the roadmap promotes accessibility, clarity, and institutional trust. Rather than treating quality assessment as an ex post exercise, it integrates quality management throughout the production process, providing NSIs with a coherent and trustworthy framework for the use of MNO data in tourism statistics.
Pedro Cunha, Sónia Quaresma
Instituto Nacional de Estatística
Read more
Read less
Pedro Cunha holds a degree in Systems and Computer Science Engineering from Universidade do Minho. Has been working at Statistics Portugal (INE) since 1996, within the Methodology and Information Systems Department, where he focus on the integration of Data in the Statistical Dawarehouse.
Sónia Quaresma has a background in computer science with a major in artificial intelligence. Works at Statistics Portugal since 2000 and is the National Coordinator of the AIML4OS in Portugal. She is also the project leader (ISO TC 69 / WG 12 - NWI) for the Standard on Curation, Cleansing and Wrangling of Big and Large Datasets
CO-AUTHOR:
Presentation title
Quality by design: adopting best practices for machine learning and AI through the SSP Cloud datalab
Official statistics increasingly integrates new methods relying on machine learning and AI.
Read more
Read less
Ensuring quality, transparency, and reproducibility in these new statistical processes brings significant challenges, particularly for teams without a significant expertise in maintaining such methods in production.
Quality is no longer solely a matter of methodological rigor, it now also depends on a working approach that requires a strict separation between code, data and execution environments, especially now that production processes move to cloud infrastructures. This separation is therefore necessary to ensure reproducibility and supporting a seamless transition to production.
Accordingly, it is necessary to adopt system version control tools such as Git which support collaborative development and keep full traceability of code. This structure gently introduces DevOps principles (continuous integration, shared standards, automated checks). Automating the workflows improves both robustness and transparency.
As machine-learning methods enters statistical production, the same principles extend into MLOps : tracking experiments, managing models over time and monitoring their behaviour. However adopting DevOps and MLOps practices is challenging, particularly for teams without a significant expertise in maintaining such methods in production.
To address this gap, we developped the SSP Cloud datalab. It provides a unified workspace where quality emerges naturally from its use. SSP Cloud enforces Git usage, provides containerized and reproducible environments. It also offers a suite of services covering the entire lifecycle of modern ML and AI projects. From experiment tracking and model management to supervision of advanced AI systems, semantic search and retrieval capabilities, without forgetting data annotation workflows. They are one-click, preconfigured services that guide any data scientist smoothly from prototype to fully operational pipelines without requiring deep technical expertise.
By aligning development, experimentation, and deployment within a single platform, SSP Cloud operationalizes the following principles: transparent versioning, reproducible workflows and continuous monitoring.
SSPCloud’s users progress from Git to DevOps and then to MLOps, enabling their AI/ML projects to become production-ready and aligned with modern technical standards.
SSP Cloud supports the modernization of statistical production by delivering the tools and promoting technical standards to ensure quality-by-design processes
Ines Hiverlet
Insee
Read more
Read less
Ines Hiverlet is a data engineer at Insee where she specializes in cloud-native solution and Kubernetes orchestration. She currently works on the SSPCloud platform, where she is responsible for managing various catalogs of services to empower users with scalable and efficient solutions. Her role focuses on ensuring seamless integration, high availability, and user-friendly access to cutting-edge tools and services.