Risks and Caveats of Using LLMs to Write Code: SonarSource Analysis
Mike Vizard dissects SonarSource’s recent report on code written by large language models, breaking down how AI-generated code can introduce security risks and long-term challenges for DevOps teams.
Risks and Caveats of Using LLMs to Write Code: SonarSource Analysis
Author: Mike Vizard
Overview
Recent research by SonarSource investigates the real-world impact of using large language models (LLMs) such as GPT-4o, Claude Sonnet 4, and Llama-3.2 to generate code. While these models are capable of producing highly functional and syntactically correct software, the study uncovers several risks and caveats that developers and DevOps teams must consider.
Key Findings from the SonarSource Report
- The analysis covered over 4,400 Java programming assignments, using SonarSource’s proprietary framework.
- Evaluated LLMs: Anthropic’s Claude Sonnet 4 and 3.7, OpenAI’s GPT-4o, Meta’s Llama-3.2-vision:90b, and OpenCoder-8B.
Strengths
- High success rates in generating executable, syntactically correct code: Claude Sonnet 4 achieved a 95.57% HumanEval pass rate.
- LLMs show strong understanding of common algorithms, data structures, and can facilitate code translation and framework boilerplate automation.
Security and Maintainability Concerns
- High-Severity Vulnerabilities: All tested models frequently embedded issues like hard-coded credentials and path traversal bugs. Llama-3.2-vision:90b produced ‘blocker’-level vulnerabilities in over 70% of cases; GPT-4o and Claude Sonnet 4 also had high rates.
- Technical Debt: Over 90% of issues identified were ‘code smells’—dead code, redundant code, and poor structure—raising long-term maintainability issues.
- Risk Trade-Offs: Improvements in functional correctness often led to more serious bugs. For example, Claude Sonnet 4’s benchmark improvement resulted in a 93% rise in high-severity vulnerabilities.
Implications for DevOps
- Teams should not trust LLM outputs blindly. Every model has embedded biases and ‘coding personalities’ that affect output quality.
- Large volumes of verbose, messy code make debugging and comprehension more difficult, sometimes requiring additional AI tools to review and understand generated code.
- There is uncertainty about DevOps adoption and trust in AI coding tools; productivity may improve, but long-term risks could offset gains.
Recommendations
- Always review and audit code generated by LLMs with automated and manual processes.
- Treat AI coding tools as opinionated resources, not infallible solutions.
- Weigh productivity gains against the increased risk and tech debt from AI-generated code.
References:
This post appeared first on “DevOps Blog”. Read the entire article here