Does AI really make coding faster?

Does AI really make coding faster?

For the past couple of years, AI’s frontier models have been making a bold promise: that using coding assistants results in faster code, fewer bugs and less grunt work for developers. Tools like GitHub Copilot and Cursor—powered by large language models (LLMs) such as Claude or GPT—are designed to automate the tedious parts of programming so human programmers can focus on the harder, more creative problems in their codebase.

At least, that’s been the pitch so far. But METR (short for Model Evaluation and Threat Research and pronounced “meter”), a Berkeley nonprofit that evaluates frontier models capabilities, wanted to see if there was real evidence to support that claim. What they found flips the narrative: coding assistants might actually slow developers down.

METR researchers observed the work of 16 experienced developers who had been contributing to large, open-source repositories that they’ve worked on for several years. Each developer provided a list of real tasks they’d normally tackle, from bug fixes to new features. The researchers then randomly split the tasks into two groups: one where developers could use AI tools, and one where they couldn’t.

AI in the mix
When AI was allowed, the devs could choose whatever tools they wanted; most chose Cursor Pro paired with Claude 3.5 or 3.7 Sonnet. They recorded their screens as they completed each task, then reported what they thought their total implementation time was. The study’s results were surprising. “When developers are allowed to use AI tools, they take 19% longer to complete issues—a significant slowdown that goes against developer beliefs and expert forecasts,” the paper’s authors wrote.

We asked IBM’s AI Advocacy Lead PJ Hagerty and Distinguished Engineer Chris Hay to take a look at METR’s study and share their impressions.

Hagerty cautioned that the hype around AI assistants may be outpacing its real-world utility. “The promise of AI making people more productive stems from the tech leadership and gen AI companies looking to capitalize on the hype of AI,” he told IBM Think. “In reality, the AI is learning as it goes and likely using the same resources a junior developer might use—Stack Overflow, Github and general googling—but with none of the context.”

“I think it's a relevant result,” Hay added. “But I don't think we should be going, ‘Wow, AI is useless. I’m faster doing it myself.’ I think there is a point, though, that for certain tasks, maybe it is quicker just to do it yourself rather than convincing the AI.”

Perception isn’t always reality
The other half of the study’s results are just as interesting: the devs expected AI to speed up their work by 24% before they started. Yet even after they experienced the 19% slowdown, they still believed AI had sped them up by 20%.

So what’s behind this perception gap? We checked in with METR’s Nate Rush, one of the study’s authors. “This is a great question, and one that our work does not fully speak to,” Rush told IBM Think. “Ideally, future work will further explore how developers’ expectations on AI usefulness affect how they use the tools [and] why this perception gap exists.”

Beyond the perception issue, the study raises a number of important questions: is time savings the only way we should be measuring developer productivity, anyway? How do metrics like code quality and team impact fit into the overall picture?

“Our study only speaks to time savings, which is only a measure of one aspect of productivity,” Rush said. “There is no ‘one right metric,’ but likely a collection of metrics that are informative about the impact of AI tools.” He added that while this study focused on time, his team has found the SPACE framework of developer productivity (SPACE is short for Satisfaction, Performance, Activity, Communication and Efficiency) useful for thinking about future directions.

Another question: could the model versions—in this case, Claude 3.5 and 3.7 Sonnet—have affected performance time? “Here’s the reality,” Hay said. “I think the versions do matter. Claude 4 Sonnet is significantly better. Claude 4 Opus is significantly better. We’re not talking a small amount of better. We’re talking a lot amount of better.”

According to Quentin Anthony, one of the study’s 16 participants, the human element is another important consideration. “We like to say that LLMs are tools, but treat them more like a magic bullet,” he wrote on X. “LLMs are a big dopamine shortcut button that may one-shot your problem. Do you keep pressing the button that has a 1% chance of fixing everything? It’s a lot more enjoyable than the grueling alternative, at least to me.” (Anthony added that social media distractions are another easy way to cause delays.)

So, as AI coding assistants evolve and improve, where will they have the most sustainable long-term impact on software development? “Once they become stable, trustable and useful, I think code assistants will best sit at the QA layer—testing, quality assurance, accessibility,” Hagerty said. “Things that are constrained and rules-based are the best application of these tools.”

That’s because, he said, writing code is fundamentally different from checking it. “Coding itself is a creative activity. It’s building something from nothing in a unique ecosystem. AI assistants miss that nuance. But they can likely test using a system of rules that are more general and universal.”

Source: ibm.com