IEEE DataPort : From Miscontextion to Misconceptions: Rethinking LLM-Based Vulnerability Detection through Context-Rich Evaluation - 2026
Download ReportIEEE DataPort : From Miscontextion to Misconceptions: Rethinking LLM-Based Vulnerability Detection through Context-Rich Evaluation - 2026
PY, TXT, JSON
by Yue Li
Information
Format: PY, TXT, JSONPublisher: IEEE DataPortPublication Date of the Electronic Edition: 02/06/2026
?
ISBN: 10.21227/nmks-tb34
$15$3Discount Coupon
Delivery time: Instant
Description
Large Language Models have become promising tools for automated vulnerability detection, supported by their success in code generation, repair, and integration into developer workflows. Yet a key question remains: Are LLMs truly effective at detecting real-world vulnerabilities?Current evaluations, typically limited to isolated functions or files, neglect the broader execution and data-flow context essential for understanding real vulnerabilities. This leads to misleading conclusions and flawed rationales, reducing the reliability of prior studies.Therefore, in this paper, we challenge three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales. We argue that these beliefs are artifacts of context-deprived evaluations.To address this, we propose CORRECT (Context-Rich Reasoning Evaluation of Code with Trust), a new evaluation framework that systematically integrates contextual data into LLM-based vulnerability detection. We build a dataset of 2,000 vulnerable–patched program pairs across 99 CWEs and evaluate 13 LLMs from four model families.Our framework elicits both binary predictions and natural-language rationales, further validated with LLM-as-a-judge methods. Our findings overturn existing misconceptions. When provided with sufficient context, state-of-the-art LLMs achieve substantially better performance (e.g., 67% accuracy and an F1 score above 70% on key CWEs), with precision nearing 0.8.We show that most false positives stem from reasoning errors rather than misclassification, and that while model and test-time scaling improve performance, they introduce diminishing returns and trade-offs in recall. Finally, we uncover new flaws in current LLM-based detection systems, such as limited generalization and overthinking biases.
$15$3Discount Coupon
Delivery time: Instant
Offline Request
If your request can be solved, it will be priced. After receiving your payment, we will proceed your order.