How Effective are LLMs for Data Science Coding? A Controlled Experiment
This program is tentative and subject to change.
The adoption of Large Language Models (LLMs) for code generation in data science offers substantial potential for enhancing tasks such as data manipulation, statistical analysis, and visualization. However, the effectiveness of LLMs in this domain has not been thoroughly evaluated, limiting their reliable use in real-world data science workflows. This paper presents a controlled experiment that empirically assesses the performance of four leading LLM-based AI assistants—Microsoft Copilot (GPT-4 Turbo), ChatGPT (o1-preview), Claude (3.5 Sonnet), and Perplexity Labs (Llama-3.1-70b-instruct)—on a diverse set of data science coding challenges sourced from the Stratacratch platform. Using the Goal-Question-Metric (GQM) approach, we evaluated each model’s effectiveness across task types (Analytical, Algorithm, Visualization) and varying difficulty levels. Our findings show that ChatGPT and Claude lead in success rate, with ChatGPT excelling in analytical and algorithm tasks. Hypothesis testing reveals that only ChatGPT and Claude maintain effectiveness above a 60% assertiveness baseline, while no models reach significance at a 70% threshold, underscoring their strengths in specific areas but also limitations at higher standards. Efficiency analysis indicates no significant differences in execution times among models for analytical tasks, although Claude and Copilot demonstrate more stable performance, while ChatGPT and Perplexity exhibit higher variability. This study provides a rigorous, performance-based evaluation framework for LLMs in data science, equipping practitioners with insights to select models tailored to specific task demands and setting empirical standards for future AI assessments beyond basic performance measures.