MSR 2025
Mon 28 - Tue 29 April 2025 Ottawa, Ontario, Canada
co-located with ICSE 2025

Large open-source software (OSS) communities are composed of multiple interrelated projects, hosting numerous repositories involving thousands of interacting contributors. Socio-technical studies about a community’s collaboration dynamics can benefit from historical data logs of the detailed activities performed by the projects’ contributors. This paper provides an automated mapping of raw public events in GitHub repositories to structured activities that more accurately capture the intent of contributors. It also contributes a large dataset containing three years of activities of the 180K+ contributors of NumFocus, a large OSS community supporting scientific research and data science. The dataset covers 58 projects, including 2.2M+ activities across 2,851 GitHub repositories. This dataset allows advanced studies of the NumFocus community collaboration dynamics, and the activity mapping process enables the possibility to create and use similar datasets for other OSS communities.