MSR 2025
Mon 28 - Tue 29 April 2025 Ottawa, Ontario, Canada
co-located with ICSE 2025
Mon 28 Apr 2025 16:10 - 16:20 at 214 - LLMs for Code Chair(s): Ali Ouni

Large Language Models are being extensively used for AI-assisted programming and code generation. The challenge is to ensure that the generated code is not only functionally correct but also safe, reliable and trustworthy. In this direction, we conduct a comprehensive empirical analysis of AI-generated code to assess whether large language models (LLMs) can produce correct and higher-quality code than humans. We evaluate the code quality of 984 code samples generated by GPT-3.5-Turbo and GPT-4 using various prompt types (simple, instructional, and enhanced) against input queries from the HumanEval dataset. We also enhance the HumanEval benchmark by calculating code quality metrics for the human-written code it contains. Code quality metrics are calculated using established tools like Radon, Bandit, Pylint, and Complexipy, with human-written code serving as a baseline for comparison. To quantify performance, we employ the TOPSIS method to rank the models and human code by their proximity to ideal and anti-ideal code quality metrics. Our results demonstrate that GPT-4, when used with advanced prompts, produces code closest to the ideal solution, outperforming human-written code in several key metrics. Our work provides evidence that LLMs, when properly guided, can surpass human developers in generating high-quality code. Our code and datasets are available online.

Mon 28 Apr

Displayed time zone: Eastern Time (US & Canada) change

16:00 - 17:30
LLMs for CodeTechnical Papers / Data and Tool Showcase Track / Tutorials at 214
Chair(s): Ali Ouni ETS Montreal, University of Quebec
16:00
10m
Talk
How Much Do Code Language Models Remember? An Investigation on Data Extraction Attacks before and after Fine-tuning
Technical Papers
Fabio Salerno Delft University of Technology, Ali Al-Kaswan Delft University of Technology, Netherlands, Maliheh Izadi Delft University of Technology
16:10
10m
Talk
Can LLMs Generate Higher Quality Code Than Humans? An Empirical Study
Technical Papers
Mohammad Talal Jamil Lahore University of Management Sciences, Shamsa Abid National University of Computer and Emerging Sciences, Shafay Shamail LUMS, DHA, Lahore
Pre-print
16:20
10m
Talk
Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code
Technical Papers
Jiho Shin York University, Clark Tang , Tahmineh Mohati University of Calgary, Maleknaz Nayebi York University, Song Wang York University, Hadi Hemmati York University
16:30
5m
Talk
Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code
Data and Tool Showcase Track
Timur Galimzyanov JetBrains Research, Sergey Titov JetBrains Research, Yaroslav Golubev JetBrains Research, Egor Bogomolov JetBrains Research
Pre-print
16:35
5m
Talk
SnipGen: A Mining Repository Framework for Evaluating LLMs for Code
Data and Tool Showcase Track
Daniel Rodriguez-Cardenas William & Mary, Alejandro Velasco William & Mary, Denys Poshyvanyk William & Mary
Pre-print
16:50
40m
Tutorial
Harmonized Coding with AI: LLMs for Qualitative Analysis in Software Engineering Research
Tutorials
Christoph Treude Singapore Management University, Youmei Fan Nara Institute of Science and Technology, Tao Xiao Kyushu University, Hideaki Hata Shinshu University
File Attached
Hide past events