My experiment harness executes this N times (N=100 for this particular experiment) and saves the results. […]
o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives.
…
Combining the code for all of the handlers with the connection setup and teardown code, as well as the command handler dispatch routines, ends up at about 12k LoC (~100k input tokens), and as before I ran the experiment 100 times.
o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it. More interestingly however, in the output from the other runs I found a report for a similar, but novel, vulnerability that I did not previously know about.
A practical demonstration of “even a stopped clock is right twice a day.”
From the researcher’s blog post: (https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-2025-37899-a-remote-zeroday-vulnerability-in-the-linux-kernels-smb-implementation/)
A practical demonstration of “even a stopped clock is right twice a day.”
Code analysis is a bit more complex than a clock.