Mike Vizard dissects SonarSource’s recent report on code written by large language models, breaking down how AI-generated code can introduce security risks and long-term challenges for DevOps teams.

Risks and Caveats of Using LLMs to Write Code: SonarSource Analysis

Author: Mike Vizard

Overview

Recent research by SonarSource investigates the real-world impact of using large language models (LLMs) such as GPT-4o, Claude Sonnet 4, and Llama-3.2 to generate code. While these models are capable of producing highly functional and syntactically correct software, the study uncovers several risks and caveats that developers and DevOps teams must consider.

Key Findings from the SonarSource Report

The analysis covered over 4,400 Java programming assignments, using SonarSource’s proprietary framework.
Evaluated LLMs: Anthropic’s Claude Sonnet 4 and 3.7, OpenAI’s GPT-4o, Meta’s Llama-3.2-vision:90b, and OpenCoder-8B.

Strengths

High success rates in generating executable, syntactically correct code: Claude Sonnet 4 achieved a 95.57% HumanEval pass rate.
LLMs show strong understanding of common algorithms, data structures, and can facilitate code translation and framework boilerplate automation.

Security and Maintainability Concerns

High-Severity Vulnerabilities: All tested models frequently embedded issues like hard-coded credentials and path traversal bugs. Llama-3.2-vision:90b produced ‘blocker’-level vulnerabilities in over 70% of cases; GPT-4o and Claude Sonnet 4 also had high rates.
Technical Debt: Over 90% of issues identified were ‘code smells’—dead code, redundant code, and poor structure—raising long-term maintainability issues.
Risk Trade-Offs: Improvements in functional correctness often led to more serious bugs. For example, Claude Sonnet 4’s benchmark improvement resulted in a 93% rise in high-severity vulnerabilities.

Implications for DevOps

Teams should not trust LLM outputs blindly. Every model has embedded biases and ‘coding personalities’ that affect output quality.
Large volumes of verbose, messy code make debugging and comprehension more difficult, sometimes requiring additional AI tools to review and understand generated code.
There is uncertainty about DevOps adoption and trust in AI coding tools; productivity may improve, but long-term risks could offset gains.

Recommendations

Always review and audit code generated by LLMs with automated and manual processes.
Treat AI coding tools as opinionated resources, not infallible solutions.
Weigh productivity gains against the increased risk and tech debt from AI-generated code.

References:

This post appeared first on “DevOps Blog”. Read the entire article here