CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java Repositories
This program is tentative and subject to change.
Modern programming languages are constantly evolving, introducing new language features and APIs to enhance software development practices. Software developers frequently face the challenge of upgrading their codebase to adapt new programming language versions, which is a tedious and time-consuming process. Recently, large language models (LLMs) have demonstrated potential in automating various code generation and editing tasks, suggesting their applicability in automating code upgrade efforts as well. Despite their promise, there exists no benchmark for evaluating the code upgrade ability of LLMs, as distilling relevant code changes related to programming language evolution from real-world software repositories’ commit histories is a complex challenge. In this work, we introduce CoUpJava, the first large-scale dataset for code upgrade in Java. CoUpJava comprises 10,697 code upgrade samples, distilled from the commit histories of 1,379 open-source Java repositories and covering Java versions 7–23. The dataset is divided into two subsets: CoUpJava-Fine, which captures fine-grained method-level refactorings towards new language features, and CoUpJava-Coarse, which includes coarse-grained repository-level changes encompassing new language features, standard library APIs, and build system upgrades. Our proposed dataset provides high-quality samples by filtering irrelevant and noisy changes and verifying the compilability of upgraded code. Moreover, CoUpJava reveals diversity in code upgrade scenarios, ranging from small, fine-grained refactorings to large-scale repository modifications.