Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice?
We ran 300 vulnerability-finding scans to measure how repeatable an agentic LLM security review is on the same code, prompt, and harness. The headline result is not that one scanner "wins" a self-referential leaderboard. It is that LLM security findings are unevenly repeatable: reference-matched findings were stable, but extra-model reports varied widely from run to run.