Privacy-oriented Tile Alternative

WalnutLum@lemmy.ml · 7 days ago

There’s a lot of assumptions about the reliability of the LLMs to get better over time laced into that…

But so far they have gotten steadily better, so I suppose there’s enough fuel for optimists to extrapolate that out into a positive outlook.

I’m very pessimistic about these technologies and I feel like we’re at the top of the sigma curve for “improvements,” so I don’t see LLM tools getting substantially better than this at analyzing code.

If that’s the case I don’t feel like having hundreds and hundreds of false security reports creates the mental arena that allows for researchers to actually spot the non-false report among all the slop.

WalnutLum@lemmy.ml · 7 days ago

It found it 8/100 times when the researcher gave it only the code paths he already knew contained the exploit. Essentially the garden path.

The test with the actual full suite of commands passed in the context only found it 1/100 times and we didn’t get any info on the number of false positives they had to wade through to find it.

This is also assuming you can automatically and reliably filter out false negatives.

He even says the ratio is too high in the blog post:

That is quite cool as it means that had I used o3 to find and fix the original vulnerability I would have, in theory, done a better job than without it. I say ‘in theory’ because right now the false positive to true positive ratio is probably too high to definitely say I would have gone through each report from o3 with the diligence required to spot its solution.

WalnutLum@lemmy.ml · 7 days ago

I’m not sure if the Gutenberg Press had only produced one readable copy for every 100 printed it would have been the literary revolution that it was.

WalnutLum@lemmy.ml · 8 days ago

The Blog Post from the researcher is a more interesting read.

Important points here about benchmarking:

o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives. For comparison, Claude Sonnet 3.7 finds it 3 out of 100 runs and Claude Sonnet 3.5 does not find it in 100 runs.

o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it. More interestingly however, in the output from the other runs I found a report for a similar, but novel, vulnerability that I did not previously know about. This vulnerability is also due to a free of sess->user, but this time in the session logoff handler.

I’m not sure if a signal to noise ratio of 1:100 is uh… Great…

WalnutLum@lemmy.ml · 8 days ago

This would feel a lot less gross if this had been with an open model like deepseek-r1.

WalnutLum@lemmy.ml · 12 days ago

“The future” is whatever the majority of young people decide it will be, regardless of it’s the past or not.

WalnutLum@lemmy.ml · 27 days ago

What other digital payment system than crypto allows cold wallets?

WalnutLum@lemmy.ml · 1 month ago

No, I’m pretty sure “try volunteering yourself” is a perfectly reasonable response to “why has nobody volunteered to fix this?”

WalnutLum@lemmy.ml · 1 month ago

Privacy-oriented Tile Alternative

WalnutLum@lemmy.ml · 1 month ago

pre installing flatpaks

Did the room just get a bit colder or is it just me

WalnutLum@lemmy.ml · 1 month ago

Libreoffice has a database engine and frontend that’s pretty applicable to Microsoft Access

WalnutLum@lemmy.ml · edit-2 1 month ago

guix and/or nix

Both are functional package managers and manage dependency trees better than flatpak IMO (also the package description languages mean you can manipulate the package definitions at install time much easier)

If you can’t find a package in guix/nix then it behooves you to use flatpak

WalnutLum@lemmy.ml · 2 months ago

The problem is that the road between creating a piece of software that does something well, and then creating simplification layers on top of it is typically much longer than just “edit a config file” and “here’s a readme”.

You need extra documentation, config gating and workflow, warnings, UI/UX work etc.

I know there are Linux elitists but kind of expecting that much extra work for what is still at it’s core mostly volunteer software seems like it’s own form of elitism.

WalnutLum@lemmy.ml · 4 months ago

You’re going to have to learn python.

Here’s a good overview: https://huggingface.co/docs/transformers/training

WalnutLum@lemmy.ml · 4 months ago

Or open source groups can make a fully open repro of it: https://github.com/huggingface/open-r1

WalnutLum@lemmy.ml · 7 months ago

Mostly decking about.

WalnutLum@lemmy.ml · 8 months ago

Less like surrogates and more like The Muppets

WalnutLum@lemmy.ml · 8 months ago

SpaceX launched about 429,125 kg of spacecraft upmass in Q1, followed by CASC with about 29,426 kg

Smaller satellites (<1,200 kg) represented 96% of spacecraft launched in Q1, 76% of total upmass

So the way I’m personally reading this is 2/3 of this is starlink launches

WalnutLum@lemmy.ml · 8 months ago

I still think it’s better to refer to LLMs as “stochastic lexical indexes” than AI

WalnutLum@lemmy.ml · 8 months ago

SLS is on track to be more expensive when adjusted for inflation per moon mission than the Apollo program.

You do realize that Artemis III requires 15 Starship launches just to fuel the thing enough to get to the moon? Why are you comparing it to Apollo?

WalnutLum@lemmy.ml · 8 months ago

NASA still has the SLS.

NASA is Already changing it’s plans because of a lack of a starship to test so I would say whether or not the Artemis mission gets delays by another ten years or if SpaceX gets shitcanned and they use the SLS depends entirely on their next flight test