James Maguire summarizes a New Relic survey on the value of full-stack observability and AI-driven monitoring in cutting IT outage costs, with insights into productivity, incident detection, and enterprise tooling strategies.

Full-Stack Observability and AI: Mitigating IT Outage Costs

Author: James Maguire

Overview

A recent global survey by New Relic exposes the high cost of IT outages, with the median price tag at $2 million per hour and a median annual impact of $76 million. These costs arrive as organizations increasingly depend on distributed, software-intensive stacks powered by modern AI, including large language models (LLMs) and agentic services.

Key Survey Findings

  • Survey sampled 1,700 IT and engineering leaders/practitioners across 23 countries and 11 industries.
  • High-impact outages are costly and frequent, especially for organizations lacking comprehensive observability.

Full-Stack Observability Defined

Full-stack observability is described as visibility across five technology layers:

  • Infrastructure
  • Applications & Services
  • Security Monitoring
  • Digital Experience Monitoring (DEM)
  • Log Management

Benefits identified by the survey:

  • Organizations with full-stack observability cut outage costs in half (from $2M to $1M/hour).
  • Only 23% of full-stack organizations report weekly high-impact outages (versus 40% without).
  • Mean Time to Detection (MTTD) improves to 28 minutes, 7 minutes faster than less mature peers.
  • Causes for outages typically include network failures, third-party/cloud service issues, and software changes.

Productivity Impact & Tooling Strategies

  • Engineers report spending ~33% of time on break-fix work, impacting productivity and feature development.
  • Average number of observability tools per org dropped 27% since 2023; organizations are consolidating onto unified platforms.

The Growing Role of AI in Observability

  • Proliferation of LLM-powered and agentic apps increases silent failure risks.
  • Organizations are increasingly monitoring AI with AI: adoption of AI monitoring tools rose from 42% in 2024 to 54% in 2025.
  • Only 4% report they are not using or planning to use AI for monitoring.
  • Top-valued AI capabilities for incident response:
    • AI-assisted troubleshooting
    • Automatic root-cause analysis
    • AI-assisted remediation
    • Predictive analytics
    • AI-generated post-incident reviews

Business Case & ROI

  • 68% of organizations with observability report faster incident response.
  • 75% cite positive ROI; 18% claim 3–10x returns.
  • Benefits for executives: reduced downtime, improved efficiency, and lower security risk.
  • Practitioners report less alert fatigue, faster troubleshooting, and improved team collaboration.

Implementation Checklist

To achieve full-stack observability:

  • Instrument all five technology layers
  • Correlate logs, traces, metrics, and user experience data
  • Monitor both systems and AI models/agents explicitly
  • Use AI tools to speed triage and remediation

Conclusion: Full-stack observability, reinforced by AI-driven monitoring and analytics, offers concrete operational and financial benefits. It enables faster detection, reduces the frequency of high-impact incidents, and ultimately supports better customer experiences and business outcomes.

This post appeared first on “DevOps Blog”. Read the entire article here