Can LLMs Generate Higher Quality Code Than Humans? An Empirical Study (MSR 2025 - Technical Papers)

Who

Mohammad Talal Jamil, Shamsa Abid, Shafay Shamail

Track

MSR 2025 Technical Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 28 Apr 2025 16:10 - 16:20 at 214 - LLMs for Code Chair(s): Ali Ouni

Abstract

Large Language Models are being extensively used for AI-assisted programming and code generation. The challenge is to ensure that the generated code is not only functionally correct but also safe, reliable and trustworthy. In this direction, we conduct a comprehensive empirical analysis of AI-generated code to assess whether large language models (LLMs) can produce correct and higher-quality code than humans. We evaluate the code quality of 984 code samples generated by GPT-3.5-Turbo and GPT-4 using various prompt types (simple, instructional, and enhanced) against input queries from the HumanEval dataset. We also enhance the HumanEval benchmark by calculating code quality metrics for the human-written code it contains. Code quality metrics are calculated using established tools like Radon, Bandit, Pylint, and Complexipy, with human-written code serving as a baseline for comparison. To quantify performance, we employ the TOPSIS method to rank the models and human code by their proximity to ideal and anti-ideal code quality metrics. Our results demonstrate that GPT-4, when used with advanced prompts, produces code closest to the ideal solution, outperforming human-written code in several key metrics. Our work provides evidence that LLMs, when properly guided, can surpass human developers in generating high-quality code. Our code and datasets are available online.

Link to Preprint

https://github.com/shamsa-abid/PythonCodeMetricsCalculator/blob/main/MSR2025_Technical_Preprint_LLMGeneratedCodeMetrics.pdf

Mohammad Talal Jamil

Lahore University of Management Sciences

Shamsa Abid

National University of Computer and Emerging Sciences

Shafay Shamail

LUMS, DHA, Lahore