Model Evaluation and Threat Research is an AI research charity that looks into the threat of AI agents! That sounds a bit AI doomsday cult, and they take funding from the AI doomsday cult organisat…
Reading the paper, AI did a lot better than I would expect. It showed experienced devs working on a familiar code base got 19% slower.
It’s telling that they thought they had been more productive, but the result was not that bad tbh.
I wish we had similar research for experienced devs on unfamiliar code bases, or for inexperienced devs, but those would probably be much harder to measure.
Using a tool that lowers your productivity by 1/5 instead of not using it at all is “not that bad” to you? A tool that costs an awful lot to run, requires heavy security compromises, too?
On the other parts… experienced devs on unfamiliar code base is a common task. They just get familiar with the code base in a few weeks. And they can base their familiarity on actual code, not hallucinated summaries which would be meaningless.
Inexperienced devs are not going to improve with something else doing the job for them, and LLM without guidance is not able to do any substantial work without deviating into garbage first, then broken garbage second. We already have data on that too.
I don’t understand your point. How is it good that the developers thought they were faster? Does that imply anything at all in LLMs’ favour? IMO that makes the situation worse because we’re not only fighting inefficiency, but delusion.
20% slower is substantial. Imagine the effect on the economy if 20% of all output was discarded (or more accurately, spent using electricity).
Reading the paper, AI did a lot better than I would expect. It showed experienced devs working on a familiar code base got 19% slower. It’s telling that they thought they had been more productive, but the result was not that bad tbh.
I wish we had similar research for experienced devs on unfamiliar code bases, or for inexperienced devs, but those would probably be much harder to measure.
Using a tool that lowers your productivity by 1/5 instead of not using it at all is “not that bad” to you? A tool that costs an awful lot to run, requires heavy security compromises, too?
On the other parts… experienced devs on unfamiliar code base is a common task. They just get familiar with the code base in a few weeks. And they can base their familiarity on actual code, not hallucinated summaries which would be meaningless.
Inexperienced devs are not going to improve with something else doing the job for them, and LLM without guidance is not able to do any substantial work without deviating into garbage first, then broken garbage second. We already have data on that too.
1% slowdown is pretty bad. You’d still do better just not using it. 19% is huge!
I don’t understand your point. How is it good that the developers thought they were faster? Does that imply anything at all in LLMs’ favour? IMO that makes the situation worse because we’re not only fighting inefficiency, but delusion.
20% slower is substantial. Imagine the effect on the economy if 20% of all output was discarded (or more accurately, spent using electricity).
Yes it suggest lower cognitive load.
I’m not saying it’s good, I’m saying I expected it to be even worse.