MSR 2025
Mon 28 - Tue 29 April 2025 Ottawa, Ontario, Canada
co-located with ICSE 2025

Software development teams increasingly rely on machine learning models to automate routine tasks, yet current models struggle to effectively suggest which files need modification when addressing issues or implementing new features. We present CodeFix-Bench, a comprehensive benchmark for developing and evaluating code change localization models, built upon the SWE-benchmark dataset of real-world GitHub issues and pull requests. Our benchmark provides 2,294 high-quality instances where each input consists of an issue description and complete codebase, and the task is to identify files that require modification to resolve the issue.

As a baseline implementation, we evaluate several traditional information retrieval approaches (BM25, TF-IDF, VSM-COS) on this task. Our experiments reveal that while BM25 achieves promising results Top-5 accuracy of 77.78%, significant challenges remain in handling large codebases (MAP drops to 0.59 from 0.63) and understanding implicit code dependencies. Analysis of failure cases highlights opportunities for more sophisticated model implementations that could incorporate code structure, historical patterns, and developer feedback.

This paper makes three key contributions: (1) a carefully curated benchmark for evaluating change localization models, (2) strong baseline implementations using traditional IR methods, and (3) detailed analysis of current limitations that can guide future model development. Our benchmark and findings provide a foundation for creating more effective models that can assist developers in navigating and modifying large codebases.