Machine learning-based software vulnerability detection requires high-quality datasets, which is essential for training effective models. To address challenges related to data quality, diversity, and comprehensiveness, we constructed ICVul, a dataset emphasizing data quality and enriched with comprehensive metadata, including Vulnerability-Contributing Commits (VCCs). We began by filtering Common Vulnerabilities and Exposures from the NVD, retaining only those linked to GitHub fix commits. Then we extracted functions and files along with relevant metadata from these commits and used the SZZ algorithm to trace VCCs. To further enhance label reliability, we developed the ESC (Eliminate Suspicious Commit) technique, ensuring credible data labels. The dataset is stored in a relational-like database for improved usability and data integrity. Both ICVul and its construction framework are publicly accessible on GitHub, supporting research in related field.
Sofia Reis Instituto Superior Técnico, U. Lisboa & INESC-ID, Rui Abreu Faculty of Engineering of the University of Porto, Portugal, Corina Pasareanu CMU, NASA, KBR
Luis Soeiro LTCI, Télécom Paris, Institut Polytechnique de Paris, Thomas Robert LTCI, Télécom Paris, Institut Polytechnique de Paris, Stefano Zacchiroli Télécom Paris, Polytechnic Institute of Paris
BIKASH SAHA Indian Institute of Technology Kanpur, Nanda Rani Indian Institute of Technology Kanpur, Sandeep K. Shukla Indian Institute of Technology Kanpur