2025-2026 Webcast Lecture Series
Internet Scraping
Dr. Richard Haans
Friday, January 23 | 9:00 AM – 10:15 AM ET
Abstract
Websites represent a crucial avenue for organizations to reach customers, attract talent, and disseminate information to stakeholders. Despite their importance, strikingly little work in the domain of organization and management research has tapped into this source of longitudinal big data. In this paper, we highlight the unique nature and profound potential of longitudinal website data and present novel open-source code- and databases that make these data accessible. Specifically, our codebase offers a general-purpose setup, building on four central steps to scrape historical websites using the Wayback Machine. Our open-access CompuCrawl database was built using this four-step approach. It contains websites of North American firms in the Compustat database between 1996 and 2020—covering 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages. We describe the coverage of our database and illustrate its use by applying word-embedding models to reveal the evolving meaning of the concept of “sustainability” over time. Finally, we outline several avenues for future research enabled by our step-by-step longitudinal web scraping approach and our CompuCrawldatabase.

Dr. Richard Haans
Biography
I am an associate professor of Strategic Management and Entrepreneurship at the Rotterdam School of Management, Erasmus University Rotterdam. My research focuses on two related domains: 1) the question of how the two defining criteria of creativity—usefulness and novelty—relate to one another and shape the performance of individuals and organizations, and 2) the question of how different organizations strive to be from competitors to attain optimal performance (so-called ‘optimal distinctiveness’). This research agenda is positioned at the intersection of institutional theory, entrepreneurship, and strategic management. I supplement these lines of work with a strong methodological background—having published research on state-of-the-art methodologies such as curvilinear relationships and text analysis using machine learning.
Read more about my research.



