

It found it 8/100 times when the researcher gave it only the code paths he already knew contained the exploit. Essentially the garden path.
The test with the actual full suite of commands passed in the context only found it 1/100 times and we didn’t get any info on the number of false positives they had to wade through to find it.
This is also assuming you can automatically and reliably filter out false negatives.
He even says the ratio is too high in the blog post:
That is quite cool as it means that had I used o3 to find and fix the original vulnerability I would have, in theory, done a better job than without it. I say ‘in theory’ because right now the false positive to true positive ratio is probably too high to definitely say I would have gone through each report from o3 with the diligence required to spot its solution.
There’s a lot of assumptions about the reliability of the LLMs to get better over time laced into that…
But so far they have gotten steadily better, so I suppose there’s enough fuel for optimists to extrapolate that out into a positive outlook.
I’m very pessimistic about these technologies and I feel like we’re at the top of the sigma curve for “improvements,” so I don’t see LLM tools getting substantially better than this at analyzing code.
If that’s the case I don’t feel like having hundreds and hundreds of false security reports creates the mental arena that allows for researchers to actually spot the non-false report among all the slop.