MSR 2025
Mon 28 - Tue 29 April 2025 Ottawa, Ontario, Canada
co-located with ICSE 2025

For software testing research, Defects4J stands out as the primary benchmark dataset, offering a controlled environment to study real bugs from prominent open-source systems. While Defects4J provides a clean and valuable dataset, our goal is to explore how fault localization techniques perform under less-controlled development scenarios. In this paper, we revisited Defects4J to study the changes that developers made to fault-triggering tests after the bugs were reported/fixed. We aim to introduce a new evaluation scenario within Defects4J, focusing on the implications of regression tests and test changes added after the bug was fixed. We analyze when these tests were modified relative to bug report creation and examine SBFL performance in less-controlled settings. Our findings show that 1) 55% of the fault-triggering tests were added to replicate the bug or test for regression; 2) 22% of the tests were changed after the bug reports, incorporating information related to the bug; 3) developers often update tests with new assertions or changes to match source code updates; and 4) SBFL performance differs significantly in less-controlled settings (down by at most 415% for Mean First Rank). Our study points out the diverse development scenarios in the studied bugs, highlighting new settings for future SBFL evaluations and bug benchmarks.