Ron Williams, Author at Gigaom https://gigaom.com/author/ronwilliams/ Your industry partner in emerging technology research Thu, 11 Apr 2024 19:46:49 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.3 Chaos Theory and Observability https://gigaom.com/2024/03/26/chaos-theory-and-observability/ Tue, 26 Mar 2024 17:17:44 +0000 https://gigaom.com/?p=1029700 Can observability deal with the IT chaos facing so many enterprises today? It’s a question worth digging into. IT Chaos (Monitoring, Observability,

The post Chaos Theory and Observability appeared first on Gigaom.

]]>
Can observability deal with the IT chaos facing so many enterprises today? It’s a question worth digging into.

IT Chaos (Monitoring, Observability, and Intelligence)

IT chaos is a function of monitoring, observability, and intelligence. Yes, I added intelligence, but I’m not talking about artificial intelligence (AI)—yet. Just as monitoring has generated more data than humans can consume, observability can produce more observations than anyone can understand. The overload of observation information is particularly true when multiple observation tools come into play.

Machine learning can help, but the questions we want to answer are changing. Once, we wanted to know if services in a public cloud worked and how to merge that data with the on-premises noise. Now, the questions have changed to what to do about the observations. Automation allows restarting poorly performing items and expanding memory or computing power on demand, but you have to store the data somewhere, and storage is not free. Leading observability solutions now include real-time cost comparisons between cloud vendors. The best observability tools have financial operations (FinOps) abilities to find underused, overused, and abandoned resources in clouds (public or private).

Observability tooling has enough data to predict future states. Unfortunately, chaos theory does not help. Data at the element level does not exist at the observability level. Regression analysis, least-squares fits, and more complicated algorithms allow the prediction of chaos. The more data available, the more accurate the predictions, but storing data is costly. Vendors are addressing the issues with consumption-based licensing, lower-cost storage tiers, and other methods to deal with the wave of data needed for observability.

IT chaos will never end, but at least we can try to manage it. The new hope is generative AI (GenAI)—maybe.

Chaos, Observability, and Artificial Intelligence

The chaos function contains the steps from monitoring to observability to intelligence and requires new approaches to answer questions. Monitoring tells us the state of items, observability can create relationships and provide a meta view of the elements, and intelligent questions are possible with the help of GenAI.

Ask an observability tool when the next outage will occur, and you may get an answer. Ask it to automate a known failure mode, and it performs a perfect dance. Ask an observability tool if the enterprise is OK, and you get nothing. The question is beyond its capabilities. Observability tools as they exist today focus on IT, including developers in DevOps pipelines, operations management team members working to keep the lights on, and the newly coined (by my more than 40-year standard) system reliability engineers (SREs). Observability explains the data from monitoring.

Enter GenAI, the big rock in the pond creating its version of chaos. In chaos theory, a single element can tip an entire system over the edge. The math makes this abundantly clear (I’ll get to that in a moment). So, what happens next?

GenAI is already improving IT, from better chatbots to consuming all the data and providing remarkable insights. Yet GenAI is brand new and disruptive. Few observability vendors are using it to significant effect now, and a smaller number can predict the impacts in 24 to 26 months.

Observability can slow the devolution into chaos, pointing to a calmer IT environment with GenAI somewhere in the future. Actual intelligence for the enterprise comes when GenAI consumes data from every source in the company, allowing unthinkable questions and a future where the tsunami of GenAI-created change does not disrupt the company.

Chaos Theory: What Is It?

I’ve mentioned chaos theory a few times. Let’s look into what it is. Chaos theory is a popular trope that allows writers to invent seemingly impossible situations the protagonists must overcome or to base an entire story concept on moving a single item. If any large-scale, easily conceived system can be said to embody chaos, then information technology stands out. Chaos is the normal state of IT, particularly in large enterprises. I’m going to lay out the math for you.

Hold on. Why am I writing about mathematics in an IT blog?

I’m a physicist, and though I’ve been doing IT for over 40 years, I rely on my education for even the most mundane things. Observability and chaos theory are related—the how and why are essential when we look at the entire enterprise. I could have used entropy, but chaos theory is sexier and closer to the reality of an IT ecosystem. Now, to the esoteric math discussion.

Chaos theory has equations that help mathematicians and physicists analyze the systems under study. In 1975, Robert May created a model to demonstrate the chaotic behavior of dynamic systems. I have modified May’s model for incidents:

In+1 = r • In • (1 – In)

    • In
      • The proportion of the system’s capacity affected by incidents at a given time includes the number of incidents, severity, or the total impact on the system, with the value ranging from zero (no impact) to one (full impact or system-wide failure).
      • In a perfect world, this is always zero, but this is about IT, where the value is never zero. Oh, but we do try hard. NASA has some of the best methods and processes anywhere, but the first place they looked after the Challenger explosion was the range safety code, which can blow up the shuttle. It was deemed perfect after a multimillion-dollar, line-by-line examination.
    • r
      • This represents the rate of incident generation and resolution, influenced by factors such as system complexity, change frequency, and the effectiveness of incident management processes. High values indicate a system where incidents are rapidly generated or poorly resolved, leading to a more chaotic system. Lower values suggest a stable system where incidents are effectively managed or are infrequent.
      • In another perfect world, perhaps in the multiverse, this would be equal to or less than one. In this same universe, pigs fly, and nothing ever breaks. I’m sure other strange things happen in this utopia to take the shine off the whole perfection thing.

In another version of Earth, I can simulate every IT element to identify systems and processes on the precipice of chaos and magically heal them. IT does not create dinosaurs, except in the form of mainframe computers running COBOL.

OK, that isn’t happening, but I can monitor all those elements and gather state information (on or off), metrics (memory usage, CPU performance), and more. Then I can send all that information to a team to determine the system’s chaos level and respond accordingly.

Oops, BAM! We have another data glut (monitoring often accounts for 25% of network traffic in a large enterprise).

Observability strives to infer a system’s internal state from its external outputs. We have scads of data but no idea what it means. Observability tooling, whether specifically for public and private clouds, networks, storage, or applications, is a view into the chaos.

The Intersection of May’s Equation and Observability

May’s equation and observability intersect. Here’s how:

      • Understanding system behavior: Observability and May’s equation aim to enhance understanding of complex systems. Observability allows for real-time monitoring and knowledge of a system’s state based on outputs, while May’s equation shows how system behavior can change dramatically with slight parameter shifts.
      • Predictability and stability: May’s equation highlights the limits of predictability in complex systems due to their sensitivity to initial conditions. Observability, in contrast, is a tool for gaining insight into the system. It increases predictability by allowing for early detection of minor issues before they escalate into significant problems. Thus, the value of “r” above keeps our system from exploding into chaos.
      • Adapting to change: The logistic map in May’s equation shows how systems can transition from stable to chaotic regimes with a single parameter change. Observability provides the means to detect and respond to these transitions, offering a method to help manage and mitigate the risks of entering chaotic states.
      • Feedback loops: Observability can act as a feedback mechanism in complex IT systems, identifying when a system is approaching a chaotic regime. This feedback can inform adjustments to system parameters to maintain desired performance and stability levels.

Technology impacts us almost everywhere—doctor visits, the news, social media, refrigerators, and even our cars (including gas-powered vehicles). The change in a single parameter can bring a company to its knees. Ask AT&T about a simple configuration change that brought their entire network down. Look into how British Airways had to cancel hundreds of flights because a software component failed after a simple change.

IT systems are always on the precipice of chaos. Observability tools are one way to examine every IT enterprise’s chaotic state.

Next Steps

To learn more, take a look at GigaOm’s cloud observability Key Criteria and Radar reports. These reports provide a comprehensive overview of the market, outline the criteria you’ll want to consider in a purchase decision, and evaluate how a number of vendors perform against those decision criteria.

If you’re not yet a GigaOm subscriber, you can access the research using a free trial.

The post Chaos Theory and Observability appeared first on Gigaom.

]]>
GigaOm Radar for Cloud Observability https://gigaom.com/report/gigaom-radar-for-cloud-observability-3/ Wed, 20 Mar 2024 15:00:42 +0000 https://gigaom.com/?post_type=go-report&p=1029259/ Cloud observability is the process of gaining comprehensive insights into the performance, health, and state of cloud-based applications and infrastructure through monitoring,

The post GigaOm Radar for Cloud Observability appeared first on Gigaom.

]]>
Cloud observability is the process of gaining comprehensive insights into the performance, health, and state of cloud-based applications and infrastructure through monitoring, metrics, tracing, logging, and other telemetry data. It enables organizations to proactively detect, understand, and resolve issues to ensure optimal application performance and user experience.

Observability is one step in a larger operational intelligence workflow wherein organizations move from monitoring to observability to intelligence (Figure 1).

Monitoring can determine the states of various hardware or software resources. Observability enables the consolidation of these states to obtain meaning, estimate the impact on critical services, predict future states based on past observations, and automatically remediate known problems. Intelligence synthesizes both technical information and business data.

Figure 1. From Monitoring to Intelligence

Observability tools reduce the data overload and bring insight to the monitored data. These solutions leverage application performance management (APM), service orchestration and automation, Kubernetes management, and cloud provider tooling (for public clouds) and apply machine learning (ML) capabilities and predictive analytics to filter the monitored data. The resulting information is targeted at IT operations and other technical personnel such as developers and systems managers. IT operations staff no longer have to be experts on the software and hardware that run the enterprise. With predictive analytics, IT resources can concentrate on what is failing and what is likely to have problems. Additionally, the use of OpenTelemetry gives enterprises a source for metrics, events, logs, and traces (MELT) that is vendor-agnostic.

Intelligence is the final step in this process—it reflects the operational state of the entire company. Intelligence builds on monitoring and observability and begins to deliver on the promise of artificial intelligence for IT operations (AIOps) by including data from the entire company—marketing, sales, legal, human resources, manufacturing data, and other sources.

The history of IT is filled with products and services deemed to be smart or intelligent— hardware and software initiatives, application solutions, databases, and others, often, however, not much more than marketing terms. In 2023, large language models (LLMs) exploded—spearheaded by OpenAI and ChatGPT—creating the beginning of actual intelligent software. This new ability is finding its way into IT solutions and complicates the definition of observability versus intelligence.

The focus of this analysis will be on observability within cloud environments, including multiple public cloud offerings, private clouds, and any combination of cloud and on-premises operations. LLM-driven capabilities will be considered from a “this is new” perspective and an understanding that LLM abilities are inconsistent across all vendors.

This is our fourth year evaluating the cloud observability space in the context of our Key Criteria and Radar reports. This report builds on previous analysis and considers how the market has evolved over the last year.

This GigaOm Radar report examines 21 of the top cloud observability solutions and compares offerings against the capabilities (table stakes, key features, and emerging features) and nonfunctional requirements (business criteria) outlined in the companion Key Criteria report. Together, these reports provide an overview of the market, identify leading cloud observability offerings, and help decision-makers evaluate these solutions so they can make a more informed investment decision.

GIGAOM KEY CRITERIA AND RADAR REPORTS

The GigaOm Key Criteria report provides a detailed decision framework for IT and executive leadership assessing enterprise technologies. Each report defines relevant functional and nonfunctional aspects of solutions in a sector. The Key Criteria report informs the GigaOm Radar report, which provides a forward-looking assessment of vendor solutions in the sector.

The post GigaOm Radar for Cloud Observability appeared first on Gigaom.

]]>
GigaOm Key Criteria for Evaluating Cloud Observability Solutions https://gigaom.com/report/gigaom-key-criteria-for-evaluating-cloud-observability-solutions/ Thu, 07 Mar 2024 19:12:52 +0000 https://gigaom.com/?post_type=go-report&p=1029002/ Observability aims to monitor and proactively respond to problems in IT infrastructure—or an entire business—by analyzing the massive amounts of data generated

The post GigaOm Key Criteria for Evaluating Cloud Observability Solutions appeared first on Gigaom.

]]>
Observability aims to monitor and proactively respond to problems in IT infrastructure—or an entire business—by analyzing the massive amounts of data generated by an enterprise. Modern IT organizations instrument and monitor everything, including applications, network access, on-premises and cloud storage, and Kubernetes clusters. The volume of data is far too large for practical human consumption and analysis so companies increasingly rely on observability solutions to capture, analyze, and derive actionable intelligence from all this data.

Specifically looking at cloud observability, monitoring involves data that is created by a wide variety of applications and infrastructure that may reside in—and operate across—multiple public cloud, private cloud, and on-premises networks. Enterprises cannot understand the topology of a critical application or service from monitoring data alone. Cloud computing services, microservices, and remote infrastructure are often distributed between both public and private clouds. Logging is one of the traditional tools for “measuring” the health of a system, but with modern infrastructure scale and complexity, logs often provide only a glut of contextless data. The instrumentation data returned from systems may be in different formats or time signatures, with little context and no chance of timely human correlation and deduplication.

A cloud observability solution gives modern IT organizations more complete, efficient, and actionable visibility across their diverse cloud-based infrastructure. It enables finding answers to critical questions regarding what systems exist, where they are, what they’re doing, who is using them, and how they’re operating from the massive amount of data available for review by IT operations.

Business Imperative
Public and private clouds have become a critical part of businesses regardless of size. The complexity of cloud-based solutions makes them difficult to understand and control, and simple monitoring solutions leave IT teams exposed to a mounting glut of data to sift through.

Cloud observability allows the company to answer critical oversight questions about these systems that monitoring alone cannot reveal. In Figure 1, you can see that monitoring asks questions concerning the status of a device or individual software component. Observability, regardless of tooling, answers more complicated questions.

Figure 1. From Monitoring to Intelligence

Observability uses monitoring data to understand what is happening, what the data means, and information on how to either manually fix incidents or—better still—what can be done to automatically and proactively identify and remediate issues. Observability consumes the data from monitoring and turns it into useful or actionable information for IT operations or other technical specialists with responsibility for software development, infrastructure, networking, or security.

Sector Adoption Score
To help executives and decision-makers assess the potential impact and value of a cloud observability solution deployment to the business, this GigaOm Key Criteria report provides a structured assessment of the sector across five factors: benefit, maturity, urgency, impact, and effort. By scoring each factor based on how strongly it compels or deters adoption of a cloud observability solution, we provide an overall Sector Adoption Score (Figure 2) of 4.6 out of 5, with 5 indicating the strongest possible recommendation to adopt. This indicates that a cloud observability solution is a credible candidate for deployment and worthy of thoughtful consideration.

The factors contributing to the Sector Adoption Score for cloud observability are explained in more detail in the Sector Brief section that follows.

Key Criteria for Evaluating Cloud Observability Solutions

Sector Adoption Score

1.0