5 June 2026
10:30 – 11:45
ŠIBENIK II
Presentation title
The WEB-FOSS-NL Approach to Statistical Scraping and Open Source Collaboration
The integration of web data into official statistics offers immense potential but also presents significant challenges regarding stability, coverage, and data quality.
Read more
Read less
While previous initiatives like the ESSnet Web Intelligence Network (WIN) demonstrated the potential of web data, further innovation is required to transition from experimental exploration to robust statistical production. This presentation introduces WEB-FOSS-NL, a project by Statistics Netherlands (CBS), which aims to explore the methodology of statistical scraping to enhance data quality. Moreover, it aims to implement generic open source building blocks for the statistical community to jointly experiment with this approach. Unlike "bulk scraping," which often indiscriminately harvests vast amounts of unstructured data, statistical scraping adopts a target-population-driven approach. This methodology prioritizes precision over volume, significantly reducing noise and processing burden—an important factor in maintaining statistical accuracy.
This hybrid approach aims to increase precision and reduce noise by combining advanced URL discovery for Business Register enhancement with focused scraping techniques and by utilizing machine learning and Large Language Models (LLMs) for content interpretation. The project seeks to develop generic open source software to improve the accuracy of linking web sources to statistical units and extracting complex variables, with an initial test focus on Online Job Vacancies. In this context data has been used for training purposes from the Eurostat Web Intelligence Hub (WIH).
A core objective of this project is to ensure transparency and reproducibility by developing generic, modular software building blocks released as Free and Open Source Software (FOSS). This comprises generic software for finding URLs from business register units via internet search, for focused scraping aligned with statistical scraping concept, and a first setup of a generic classifier for texts from the web for official statistics. Another objective of this project is to fosters cross-border quality harmonization through the establishment of Statistical Scraping Interest Group (SSIG) meetings, co-organized with Statistics Austria. In this presentations we will discuss the experiences, report on the lessons learned, and motivate how international open-source collaboration can contribute to a sustainable, high-quality web data production in the European Statistical System.
This work is co-funded by the European Commission Project “WEB-FOSS-NL” – 101225292
Olav ten Bosch
Statistics Netherlands
Read more
Read less
Olav ten Bosch is a project manager at the R&D department of Statistics Netherlands, with a background in Theoretical Computer Science and a PhD in Electrical Engineering. He works in International innovation projects in official statistics, new data sources, web data, open source software, metadata standards, and open data. Olav has been a member of various International standardization groups and ESS innovation projects on these topics.
Presentation title
Integration of Company Websites into Business Registers: Identification, Quality Assurance, and Analytical Use of URLs
This presentation demonstrates how National Statistical Institutes (NSIs) can systematically implement, maintain, and quality-assure company URLs within their National Statistical Business Registers (NSBRs).
Read more
Read less
It provides a detailed account of the methodology developed by Statistics Denmark and is structured around three core components. First, it describes how company URLs are identified through a multi-source approach that combines existing register information in the ABR with data from the national top-level domain register to ensure broad and consistent coverage. These data are supplemented by the use of a locally hosted Llama large language model (LLM), which analyses company names, legal identifiers, and auxiliary register attributes to identify and suggest potential URLs. Second, it outlines the quality assurance framework applied to the identified URLs, including automated checks, and procedures for accurately linking URLs to the correct enterprises and legal units. Third, it presents the preliminary development of content-based indicators derived from company websites, which are used to segment URLs according to observable characteristics and business activities. In this context, a locally hosted Llama large language model (LLM) is also applied to analyse website content and support the segmentation process.
Statistics Denmark participates in the Eurostat-funded OBEC project, where these web-based data sources are used to train a machine-learning model for the classification of companies’ economic activities based on NACE Rev. 2.1. In addition, the approach supports analytical segmentation of enterprises and contributes to reducing respondent burden by exploiting information already available online.
Dennis Pipenbring
Statistics Denmark
Read more
Read less
Dennis Pipenbring – Senior Adviser
MSc in Mathematics, University of Copenhagen
Senior Adviser at Statistics Denmark with 3 years of experience in profiling, data consistency, and professional development. Extensive background in register and survey data for research purposes (23+ years). Key contributor to Large Cases Unit (LCU) activities, including hosting and participating in international study visits and training programs. Skilled in developing teaching materials and delivering training, with 16 years as a secondary school lecturer in mathematics, chemistry, and computer science, and prior role as National Educational Consultant for Mathematics Education. Strong experience in WHO projects and advising PhD students. Proficient in programming and fluent in English.
CO-AUTHOR:
Presentation title
WEBFOSS-AT: Open-Source tools to find urls and perform a "focused" scrape
As the demand for timely and detailed insights continues to grow, National Statistical Institutes (NSIs) are increasingly looking to existing data sources, such as publicly available web content, as alternatives to traditional data collection methods.
Read more
Read less
Financial and operational limitations often make it impractical to collect this information directly from individuals or enterprises. However, integrating web scraping into official statistics poses significant challenges, including technical complexity, scalability, and long-term maintenance. These challenges pose a risk to the quality of outputs based on web data. To help overcome these obstacles, we introduce a new `R` package designed to streamline the creation and management of web scraping workflows, specifically tailored to the practical needs and constraints of NSIs.
The package focuses on simplifying the processes of data discovery and extraction through a user-friendly interface for defining scraping tasks, handling diverse data formats, and scheduling automated data collection. Emphasis is placed on modular architecture, enabling easy extension and customization with minimal coding effort. Built-in features for logging, error handling, and adherence to best practices in data collection support transparency, reproducibility, and operational reliability. Providing advanced functionalities such as snapshotting mechanisms to automatically resume interrupted scraping sessions and support for parallel execution will also be explored to ensure scalability and hight quality. “Focused” Scrape is an important feature for statistical scraping. It means to not a full scrape of a website but targeted to certain outcomes is demonstrated on an example from price statistics and an example from business statistics.
“Focused” scraping is a key feature for statistical web scraping. Rather than extracting an entire website, it targets specific, relevant information. This concept, along with the capabilities of the R package, is demonstrated through examples from price statistics and business statistics. This work is co-funded by the European Commission Project “WEBFOSS-AT” – 101225462 — STATA-WEB-SSIG.
Johannes Gussenbauer
Statistics Austria
Read more
Read less
Johannes Gussenbauer is a Methodologist at Statistics Austria, working in the Center for Methods. He has an academic background in Mathematics and Statistics. His expertise includes imputation, calibration, and error estimation for sample surveys. In addition, Johannes applies machine learning techniques to tasks such as text classification, mass imputation, and the analysis of web-scraped data.
CO-AUTHORS:
Presentation title
One year of statistical scraping at Statistics Hesse - insights in systems and lessons learned
The statistical office of Hesse, HSL, has developed and maintains a system for the systematic search of enterprise URLs for units from the statistical business registry, SBR, as well as a system of searching for keywords on enterprise websites, e.g.
Read more
Read less
to assist classification of economic activity or other statistical purposes. These systems are in productive use since summer 2025 and all statistical offices in Germany can use them.
Additionally, HSL has two other scraping applications in productive use for the alliance of statistical offices in Germany: one for scraping trade registry information for enterprises (e.g. names of managing directors necessary for legally binding letters; verbal description of economic activity) from the SBR, and another one for scraping several hotel booking sites in order to keep sampling frames up to date.
There have been several obstacles in both, developing and maintaining such systems. The presentation will give an schematic overview of the systems developed and maintained and also ventures an outlook: which next steps will be taken, what experiences and decisions will be necessary to maintain such systems.
By using such systems, intensive (but scarce) manual labor can be reduced (and used for other labor-intensive tasks) and a new level of standardization of such tasks can be achieved, thus potentially increasing reliabilty.
Tobias Gramlich
Hessisches Statistisches Landesamt (HSL)
Read more
Read less
Tobias used to be a social scientist with emphasis on quantitative survey methods. After leaving university (University of Konstanz, University of Duisburg-Essen), he joined official statistics with emphasis on examining new data sources for their use in official statistics. For example, he was engaged in projects investigating aggregated mobile phone data. Together with his colleagues, he investigated the use of data from the web for different purposes in official statistics. As software developers, they develop (mainly using R) and maintain systems for use by official statistics offices in Germany.