James Maguire summarizes a New Relic survey on the value of full-stack observability and AI-driven monitoring in cutting IT outage costs, with insights into productivity, incident detection, and enterprise tooling strategies.

Full-Stack Observability and AI: Mitigating IT Outage Costs

Author: James Maguire

Overview

A recent global survey by New Relic exposes the high cost of IT outages, with the median price tag at $2 million per hour and a median annual impact of $76 million. These costs arrive as organizations increasingly depend on distributed, software-intensive stacks powered by modern AI, including large language models (LLMs) and agentic services.

Key Survey Findings

Survey sampled 1,700 IT and engineering leaders/practitioners across 23 countries and 11 industries.
High-impact outages are costly and frequent, especially for organizations lacking comprehensive observability.

Full-Stack Observability Defined

Full-stack observability is described as visibility across five technology layers:

Infrastructure
Applications & Services
Security Monitoring
Digital Experience Monitoring (DEM)
Log Management

Benefits identified by the survey:

Organizations with full-stack observability cut outage costs in half (from $2M to $1M/hour).
Only 23% of full-stack organizations report weekly high-impact outages (versus 40% without).
Mean Time to Detection (MTTD) improves to 28 minutes, 7 minutes faster than less mature peers.
Causes for outages typically include network failures, third-party/cloud service issues, and software changes.

Productivity Impact & Tooling Strategies

Engineers report spending ~33% of time on break-fix work, impacting productivity and feature development.
Average number of observability tools per org dropped 27% since 2023; organizations are consolidating onto unified platforms.

The Growing Role of AI in Observability

Proliferation of LLM-powered and agentic apps increases silent failure risks.
Organizations are increasingly monitoring AI with AI: adoption of AI monitoring tools rose from 42% in 2024 to 54% in 2025.
Only 4% report they are not using or planning to use AI for monitoring.
Top-valued AI capabilities for incident response:
- AI-assisted troubleshooting
- Automatic root-cause analysis
- AI-assisted remediation
- Predictive analytics
- AI-generated post-incident reviews

Business Case & ROI

68% of organizations with observability report faster incident response.
75% cite positive ROI; 18% claim 3–10x returns.
Benefits for executives: reduced downtime, improved efficiency, and lower security risk.
Practitioners report less alert fatigue, faster troubleshooting, and improved team collaboration.

Implementation Checklist

To achieve full-stack observability:

Instrument all five technology layers
Correlate logs, traces, metrics, and user experience data
Monitor both systems and AI models/agents explicitly
Use AI tools to speed triage and remediation

Conclusion: Full-stack observability, reinforced by AI-driven monitoring and analytics, offers concrete operational and financial benefits. It enables faster detection, reduces the frequency of high-impact incidents, and ultimately supports better customer experiences and business outcomes.

This post appeared first on “DevOps Blog”. Read the entire article here