Can LLMs Replace Manual Annotation of Software Engineering Artifacts? (MSR 2025 - Technical Papers) - MSR 2025

Mon 28 - Tue 29 April 2025 Ottawa, Ontario, Canada

co-located with ICSE 2025

Who

Toufique Ahmed, Prem Devanbu, Christoph Treude, Michael Pradel

Track

MSR 2025 Technical Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

When

Mon 28 Apr 2025 13:00 - 14:00 at Canada Hall 3 Poster Area - MSR Poster (Monday)
Tue 29 Apr 2025 11:10 - 11:20 at 214 - Software ecosystems and humans Chair(s): Ahmad Abdellatif
Tue 29 Apr 2025 13:00 - 14:00 at Canada Hall 3 Poster Area - MSR Poster (Tuesday)

Abstract

Experimental evaluations of software engineering innovations, e.g., tools and processes, often include human-subject studies as a component of multi-pronged strategy to obtain greater generalizability of the findings. However, human-subject studies in our field are challenging, due to the cost and difficulty of finding and employing suitable subjects, ideally, professional programmers with varying degrees of experience. Meanwhile, large language models (LLMs) have recently started to demonstrate human-level performance in several areas. This paper explores the possibility of substituting costly human subjects with much cheaper LLM queries in evaluations of code- and code-related artifacts. We study this idea by applying six state-of-the-art LLMs to ten annotation tasks from five datasets created by prior work, such as judging the accuracy of a natural language summary of a method or deciding whether a code change fixes a static analysis warning. Our results show that replacing some human annotation effort with LLMs can produce inter-rater agreements equal or close to human-rater agreement. To help decide when and how to use LLMs in human-subject studies, we propose model-model agreement as a predictor of whether a given task as suitable for LLMs at all, and model confidence as a means to select specific samples where LLMs can safely replace human annotators. Overall, our work is the first step toward mixed human-LLM evaluations in software engineering.

Link to Preprint

https://arxiv.org/abs/2408.05534

Toufique Ahmed

IBM Research

United States

Prem Devanbu

University of California at Davis

United States

Christoph Treude

Singapore Management University

Singapore

Michael Pradel

University of Stuttgart

Germany

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Session Program

Mon 28 Apr
Displayed time zone: Eastern Time (US & Canada) change

	13:00 - 14:00	MSR Poster (Monday)Data and Tool Showcase Track / Technical Papers / Mining Challenge / Program at Canada Hall 3 Poster Area

	13:00 60m Talk		SPRINT: An Assistant for Issue Report Management Data and Tool Showcase Track Ahmed Adnan , Antu Saha William & Mary, Oscar Chaparro William & Mary Pre-print
	13:00 60m Talk		Combining Large Language Models with Static Analyzers for Code Review Generation Technical Papers Imen Jaoua DIRO, Université de Montréal, Oussama Ben Sghaier DIRO, Université de Montréal, Houari Sahraoui DIRO, Université de Montréal Pre-print
	13:00 60m Talk		Can LLMs Replace Manual Annotation of Software Engineering Artifacts?Technical Track Distinguished Paper Award Technical Papers Toufique Ahmed IBM Research, Prem Devanbu University of California at Davis, Christoph Treude Singapore Management University, Michael Pradel University of Stuttgart Pre-print
	13:00 60m Talk		Dependency Update Adoption Patterns in the Maven Software Ecosystem Mining Challenge Baltasar Berretta College of Wooster, Augustus Thomas College of Wooster, Heather Guarnera The College of Wooster
	13:00 60m Talk		Popularity and Innovation in Maven Central Mining Challenge Nkiru Ede Victoria University of Wellington, Jens Dietrich Victoria University of Wellington, Ulrich Zülicke Victoria University of Wellington Pre-print
	13:00 60m Talk		Chasing the Clock: How Fast Are Vulnerabilities Fixed in the Maven Ecosystem? Mining Challenge Md Fazle Rabbi Idaho State University, Arifa Islam Champa Idaho State University, Rajshakhar Paul Wayne State University, Minhaz F. Zibran Idaho State University Pre-print
	13:00 60m Talk		SCRUBD: Smart Contracts Reentrancy and Unhandled Exceptions Vulnerability Dataset Data and Tool Showcase Track Chavhan Sujeet Yashavant Indian Institute of Technology, Kanpur, Mitrajsinh Chavda Indian Institute of Technology Kanpur, India, Saurabh Kumar Indian Institute of Technology Hyderabad, India, Amey Karkare IIT Kanpur, Angshuman Karmakar Indian Institute of Technology Kanpur, India Pre-print
	13:00 60m Talk		TerraDS: A Dataset for Terraform HCL Programs Data and Tool Showcase Track Christoph Buehler University of St. Gallen, David Spielmann University of St. Gallen, Roland Meier armasuisse, Guido Salvaneschi University of St. Gallen Pre-print
	13:00 60m Talk		Mining a Decade of Contributor Dynamics in Ethereum: A Longitudinal StudyFOSS Award Technical Papers Matteo Vaccargiu University of Cagliari, Sabrina Aufiero University College London (UCL), Cheick Ba Queen Mary University of London, Silvia Bartolucci University College London, Richard Clegg Queen Mary University London, Daniel Graziotin University of Hohenheim, Rumyana Neykova Brunel University London, Roberto Tonelli University of Cagliari, Giuseppe Destefanis Brunel University of London Pre-print
	13:00 60m Talk		CoMRAT: Commit Message Rationale Analysis Tool Data and Tool Showcase Track Mouna Dhaouadi University of Montreal, Bentley Oakes Polytechnique Montréal, Michalis Famelis Université de Montréal Pre-print Media Attached File Attached
	13:00 60m Talk		A Dataset of Software Bill of Materials for Evaluating SBOM Consumption Tools Data and Tool Showcase Track Rio Kishimoto Osaka University, Tetsuya Kanda Notre Dame Seishin University, Yuki Manabe The University of Fukuchiyama, Katsuro Inoue Nanzan University, Shi Qiu Toshiba, Yoshiki Higo Osaka University Pre-print
	13:00 60m Talk		A Dataset of Contributor Activities in the NumFocus Open-Source CommunityData/Tool Track Distinguished Dataset Award Data and Tool Showcase Track Youness Hourri University of Mons, Alexandre Decan University of Mons; F.R.S.-FNRS, Tom Mens University of Mons Pre-print
	13:00 60m Talk		Does Functional Package Management Enable Reproducible Builds at Scale? Yes.Technical Track Distinguished Paper Award Technical Papers Julien Malka LTCI, Télécom Paris, Institut Polytechnique de Paris, France, Stefano Zacchiroli LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France, Théo Zimmermann Télécom Paris, Polytechnic Institute of Paris Pre-print
	13:00 60m Talk		HaPy-Bug - Human Annotated Python Bug Resolution Dataset Data and Tool Showcase Track Piotr Przymus Nicolaus Copernicus University in Toruń, Poland, Mikołaj Fejzer Nicolaus Copernicus University in Toruń, Jakub Narębski Nicolaus Copernicus University in Toruń, Radosław Woźniak Nicolaus Copernicus University in Toruń, Łukasz Halada University of Wrocław, Poland, Aleksander Kazecki Nicolaus Copernicus University in Toruń, Mykhailo Molchanov Igor Sikorsky Kyiv Polytechnic Institute, Ukraine, Krzysztof Stencel University of Warsaw Pre-print File Attached
	13:00 60m Talk		Do LLMs Provide Links to Code Similar to what they Generate? A Study with Gemini and Bing CoPilot Technical Papers Daniele Bifolco University of Sannio, Pietro Cassieri University of Salerno, Giuseppe Scanniello University of Salerno, Massimiliano Di Penta University of Sannio, Italy, Fiorella Zampetti University of Sannio, Italy Pre-print
	13:00 60m Talk		Out of Sight, Still at Risk: The Lifecycle of Transitive Vulnerabilities in Maven Mining Challenge Piotr Przymus Nicolaus Copernicus University in Toruń, Poland, Mikołaj Fejzer Nicolaus Copernicus University in Toruń, Jakub Narębski Nicolaus Copernicus University in Toruń, Krzysztof Rykaczewski Nicolaus Copernicus University in Toruń, Poland, Krzysztof Stencel University of Warsaw Pre-print
	13:00 60m Talk		Refactoring for Dockerfile Quality: A Dive into Developer Practices and Automation Potential Technical Papers Emna Ksontini University of Michigan, Meriem Mastouri University of Michigan, Rania Khalsi University of Michigan - Flint, Wael Kessentini DePaul University
	13:00 60m Talk		Cascading Effects: Analyzing Project Failure Impact in the Maven Central Ecosystem Mining Challenge Mina Shehata Belmont University, Saidmakhmud Makhkamjonoov Belmont University, Mahad Syed Belmont University, Esteban Parra Rodriguez Belmont University
	13:00 60m Talk		MaLAware: Automating the Comprehension of Malicious Software Behaviours using Large Language Models (LLMs) Data and Tool Showcase Track BIKASH SAHA Indian Institute of Technology Kanpur, Nanda Rani Indian Institute of Technology Kanpur, Sandeep K. Shukla Indian Institute of Technology Kanpur Pre-print
	13:00 60m Talk		Investigating the Understandability of Review Comments on Code Change Requests Technical Papers Md Shamimur Rahman University of Saskatchewan, Zadia Codabux University of Saskatchewan, Chanchal K. Roy University of Saskatchewan

Tue 29 Apr
Displayed time zone: Eastern Time (US & Canada) change

	11:00 - 12:30	Software ecosystems and humansData and Tool Showcase Track / Technical Papers / Program at 214 Chair(s): Ahmad Abdellatif University of Calgary

	11:00 10m Talk		The Ecosystem of Open-Source Music Production Software – A Mining Study on the Development Practices of VST Plugins on GitHub Technical Papers Andrei Bogdan University of Amsterdam, Mauricio Verano Merino Vrije Universiteit Amsterdam, Ivano Malavolta Vrije Universiteit Amsterdam Pre-print Media Attached
	11:10 10m Talk		Can LLMs Replace Manual Annotation of Software Engineering Artifacts?Technical Track Distinguished Paper Award Technical Papers Toufique Ahmed IBM Research, Prem Devanbu University of California at Davis, Christoph Treude Singapore Management University, Michael Pradel University of Stuttgart Pre-print
	11:20 10m Talk		Investigating the Understandability of Review Comments on Code Change Requests Technical Papers Md Shamimur Rahman University of Saskatchewan, Zadia Codabux University of Saskatchewan, Chanchal K. Roy University of Saskatchewan
	11:30 10m Talk		Mining a Decade of Contributor Dynamics in Ethereum: A Longitudinal StudyFOSS Award Technical Papers Matteo Vaccargiu University of Cagliari, Sabrina Aufiero University College London (UCL), Cheick Ba Queen Mary University of London, Silvia Bartolucci University College London, Richard Clegg Queen Mary University London, Daniel Graziotin University of Hohenheim, Rumyana Neykova Brunel University London, Roberto Tonelli University of Cagliari, Giuseppe Destefanis Brunel University of London Pre-print
	11:40 10m Talk		Is it Really Fun? Detecting Low Engagement Events in Video Games Technical Papers Emanuela Guglielmi University of Molise, Gabriele Bavota Software Institute @ Università della Svizzera Italiana, Nicole Novielli University of Bari, Rocco Oliveto University of Molise, Simone Scalabrino University of Molise
	11:50 5m Talk		A Dataset of Contributor Activities in the NumFocus Open-Source CommunityData/Tool Track Distinguished Dataset Award Data and Tool Showcase Track Youness Hourri University of Mons, Alexandre Decan University of Mons; F.R.S.-FNRS, Tom Mens University of Mons Pre-print
	11:55 5m Talk		Jupyter Notebook Activity Dataset Data and Tool Showcase Track Tomoki Nakamaru The University of Tokyo, Tomomasa Matsunaga The University of Tokyo, Tetsuro Yamazaki University of Tokyo
	12:00 5m Talk		CoPhi - Mining C/C++ Packages for Conan Ecosystem Analysis Data and Tool Showcase Track Vivek Sarkar University of Washington, Anemone Kampkötter TU Dortmund, Ben Hermann TU Dortmund Pre-print
	12:05 5m Talk		MARIN: A Research-Centric Interface for Querying Software Artifacts on Maven Repositories Data and Tool Showcase Track Johannes Düsing TU Dortmund, Jared Chiaramonte Arizona State University, Ben Hermann TU Dortmund Pre-print File Attached
	12:10 5m Talk		GitProjectHealth: an Extensible Framework for Git Social Platform Mining Data and Tool Showcase Track Nicolas Hlad Berger-Levrault, Benoit Verhaeghe Berger-Levrault, Kilian Bauvent Berger-levrault
	12:15 5m Talk		Myriad People. Open Source Software for New Media ArtsFOSS Award Data and Tool Showcase Track Benoit Baudry Université de Montréal, Erik Natanael Gustafsson Independent artist, Roni Kaufman Independent artist, Maria Kling Independent artist Pre-print
	12:20 5m Talk		OpenMent: A Dataset of Mentor-Mentee Interactions in Google Summer of Code Data and Tool Showcase Track Erfan Raoofian University of British Columbia, Fatemeh Hendijani Fard Department of Computer Science, Mathematics, Physics and Statistics, University of British Columbia, Okanagan Campus, Ifeoma Adaji University of British Columbia, Gema Rodríguez-Pérez Department of Computer Science, Mathematics, Physics and Statistics, University of British Columbia, Okanagan Campus
	12:25 5m Talk		Under the Blueprints: Parsing Unreal Engine’s Visual Scripting at Scale Data and Tool Showcase Track Kalvin Eng University of Alberta, Abram Hindle University of Alberta Pre-print

	13:00 - 14:00	MSR Poster (Tuesday)Mining Challenge / Data and Tool Showcase Track / Technical Papers / Program at Canada Hall 3 Poster Area

	13:00 60m Talk		Chasing the Clock: How Fast Are Vulnerabilities Fixed in the Maven Ecosystem? Mining Challenge Md Fazle Rabbi Idaho State University, Arifa Islam Champa Idaho State University, Rajshakhar Paul Wayne State University, Minhaz F. Zibran Idaho State University Pre-print
	13:00 60m Talk		MaLAware: Automating the Comprehension of Malicious Software Behaviours using Large Language Models (LLMs) Data and Tool Showcase Track BIKASH SAHA Indian Institute of Technology Kanpur, Nanda Rani Indian Institute of Technology Kanpur, Sandeep K. Shukla Indian Institute of Technology Kanpur Pre-print
	13:00 60m Talk		A Dataset of Contributor Activities in the NumFocus Open-Source CommunityData/Tool Track Distinguished Dataset Award Data and Tool Showcase Track Youness Hourri University of Mons, Alexandre Decan University of Mons; F.R.S.-FNRS, Tom Mens University of Mons Pre-print
	13:00 60m Talk		Popularity and Innovation in Maven Central Mining Challenge Nkiru Ede Victoria University of Wellington, Jens Dietrich Victoria University of Wellington, Ulrich Zülicke Victoria University of Wellington Pre-print
	13:00 60m Talk		TerraDS: A Dataset for Terraform HCL Programs Data and Tool Showcase Track Christoph Buehler University of St. Gallen, David Spielmann University of St. Gallen, Roland Meier armasuisse, Guido Salvaneschi University of St. Gallen Pre-print
	13:00 60m Talk		SPRINT: An Assistant for Issue Report Management Data and Tool Showcase Track Ahmed Adnan , Antu Saha William & Mary, Oscar Chaparro William & Mary Pre-print
	13:00 60m Talk		Does Functional Package Management Enable Reproducible Builds at Scale? Yes.Technical Track Distinguished Paper Award Technical Papers Julien Malka LTCI, Télécom Paris, Institut Polytechnique de Paris, France, Stefano Zacchiroli LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France, Théo Zimmermann Télécom Paris, Polytechnic Institute of Paris Pre-print
	13:00 60m Talk		Dependency Update Adoption Patterns in the Maven Software Ecosystem Mining Challenge Baltasar Berretta College of Wooster, Augustus Thomas College of Wooster, Heather Guarnera The College of Wooster
	13:00 60m Talk		A Dataset of Software Bill of Materials for Evaluating SBOM Consumption Tools Data and Tool Showcase Track Rio Kishimoto Osaka University, Tetsuya Kanda Notre Dame Seishin University, Yuki Manabe The University of Fukuchiyama, Katsuro Inoue Nanzan University, Shi Qiu Toshiba, Yoshiki Higo Osaka University Pre-print
	13:00 60m Talk		Investigating the Understandability of Review Comments on Code Change Requests Technical Papers Md Shamimur Rahman University of Saskatchewan, Zadia Codabux University of Saskatchewan, Chanchal K. Roy University of Saskatchewan
	13:00 60m Talk		Refactoring for Dockerfile Quality: A Dive into Developer Practices and Automation Potential Technical Papers Emna Ksontini University of Michigan, Meriem Mastouri University of Michigan, Rania Khalsi University of Michigan - Flint, Wael Kessentini DePaul University
	13:00 60m Talk		Combining Large Language Models with Static Analyzers for Code Review Generation Technical Papers Imen Jaoua DIRO, Université de Montréal, Oussama Ben Sghaier DIRO, Université de Montréal, Houari Sahraoui DIRO, Université de Montréal Pre-print
	13:00 60m Talk		Cascading Effects: Analyzing Project Failure Impact in the Maven Central Ecosystem Mining Challenge Mina Shehata Belmont University, Saidmakhmud Makhkamjonoov Belmont University, Mahad Syed Belmont University, Esteban Parra Rodriguez Belmont University
	13:00 60m Talk		CoMRAT: Commit Message Rationale Analysis Tool Data and Tool Showcase Track Mouna Dhaouadi University of Montreal, Bentley Oakes Polytechnique Montréal, Michalis Famelis Université de Montréal Pre-print Media Attached File Attached
	13:00 60m Talk		Can LLMs Replace Manual Annotation of Software Engineering Artifacts?Technical Track Distinguished Paper Award Technical Papers Toufique Ahmed IBM Research, Prem Devanbu University of California at Davis, Christoph Treude Singapore Management University, Michael Pradel University of Stuttgart Pre-print
	13:00 60m Talk		Do LLMs Provide Links to Code Similar to what they Generate? A Study with Gemini and Bing CoPilot Technical Papers Daniele Bifolco University of Sannio, Pietro Cassieri University of Salerno, Giuseppe Scanniello University of Salerno, Massimiliano Di Penta University of Sannio, Italy, Fiorella Zampetti University of Sannio, Italy Pre-print
	13:00 60m Talk		Mining a Decade of Contributor Dynamics in Ethereum: A Longitudinal StudyFOSS Award Technical Papers Matteo Vaccargiu University of Cagliari, Sabrina Aufiero University College London (UCL), Cheick Ba Queen Mary University of London, Silvia Bartolucci University College London, Richard Clegg Queen Mary University London, Daniel Graziotin University of Hohenheim, Rumyana Neykova Brunel University London, Roberto Tonelli University of Cagliari, Giuseppe Destefanis Brunel University of London Pre-print
	13:00 60m Talk		SCRUBD: Smart Contracts Reentrancy and Unhandled Exceptions Vulnerability Dataset Data and Tool Showcase Track Chavhan Sujeet Yashavant Indian Institute of Technology, Kanpur, Mitrajsinh Chavda Indian Institute of Technology Kanpur, India, Saurabh Kumar Indian Institute of Technology Hyderabad, India, Amey Karkare IIT Kanpur, Angshuman Karmakar Indian Institute of Technology Kanpur, India Pre-print
	13:00 60m Talk		Out of Sight, Still at Risk: The Lifecycle of Transitive Vulnerabilities in Maven Mining Challenge Piotr Przymus Nicolaus Copernicus University in Toruń, Poland, Mikołaj Fejzer Nicolaus Copernicus University in Toruń, Jakub Narębski Nicolaus Copernicus University in Toruń, Krzysztof Rykaczewski Nicolaus Copernicus University in Toruń, Poland, Krzysztof Stencel University of Warsaw Pre-print
	13:00 60m Talk		HaPy-Bug - Human Annotated Python Bug Resolution Dataset Data and Tool Showcase Track Piotr Przymus Nicolaus Copernicus University in Toruń, Poland, Mikołaj Fejzer Nicolaus Copernicus University in Toruń, Jakub Narębski Nicolaus Copernicus University in Toruń, Radosław Woźniak Nicolaus Copernicus University in Toruń, Łukasz Halada University of Wrocław, Poland, Aleksander Kazecki Nicolaus Copernicus University in Toruń, Mykhailo Molchanov Igor Sikorsky Kyiv Polytechnic Institute, Ukraine, Krzysztof Stencel University of Warsaw Pre-print File Attached