Home
Reports
IEEE DataPort : From Miscontextion to Misconceptions: Rethinking LLM-Based Vulnerability Detection through Context-Rich Evaluation - 2026

Download Report IEEE DataPort : From Miscontextion to Misconceptions: Rethinking LLM-Based Vulnerability Detection through Context-Rich Evaluation - 2026

PY, TXT, JSON by Yue Li

Information

Format: PY, TXT, JSON Publisher: IEEE DataPort Publication Date of the Electronic Edition: 02/06/2026 ISBN: 10.21227/nmks-tb34

Description

Large Language Models have become promising tools for automated vulnerability detection, supported by their success in code generation, repair, and integration into developer workflows. Yet a key question remains: Are LLMs truly effective at detecting real-world vulnerabilities?Current evaluations, typically limited to isolated functions or files, neglect the broader execution and data-flow context essential for understanding real vulnerabilities. This leads to misleading conclusions and flawed rationales, reducing the reliability of prior studies.Therefore, in this paper, we challenge three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales. We argue that these beliefs are artifacts of context-deprived evaluations.To address this, we propose CORRECT (Context-Rich Reasoning Evaluation of Code with Trust), a new evaluation framework that systematically integrates contextual data into LLM-based vulnerability detection. We build a dataset of 2,000 vulnerable–patched program pairs across 99 CWEs and evaluate 13 LLMs from four model families.Our framework elicits both binary predictions and natural-language rationales, further validated with LLM-as-a-judge methods. Our findings overturn existing misconceptions. When provided with sufficient context, state-of-the-art LLMs achieve substantially better performance (e.g., 67% accuracy and an F1 score above 70% on key CWEs), with precision nearing 0.8.We show that most false positives stem from reasoning errors rather than misclassification, and that while model and test-time scaling improve performance, they introduce diminishing returns and trade-offs in recall. Finally, we uncover new flaws in current LLM-based detection systems, such as limited generalization and overthinking biases.

$15 $3Discount Coupon Delivery time: Instant