David S. Linthicum, Author at Gigaom

Key Criteria for Evaluating AIOps Solutions

David S. Linthicum — Fri, 03 Sep 2021 17:29:58 +0000

Cloud computing and the accelerating pace of development, particularly in response to effects from the global pandemic, have been key drivers behind the growth and adoption of AIOps concepts and tooling. As cloud and edge computing have driven operational complexity inexorably upward, IT organizations have turned to automation to address the needs of operations teams.

In fact, without AIOps solutions coming online to bear the load, IT budgets would have been overwhelmed by increases in staff spending, as Figure 1 makes very clear.

Figure 1. Projected IT Staff Spending Based on Available Operations Solutions

Until recently, most traditional IT operations tools offered limited automation and autonomy—bolting an AI engine onto an existing ITOps tool and calling it AIOps—regardless of whether the tool leveraged AI systemically or not. However, some AIOps tool startups have released purpose-built solutions that leverage AI from the ground up. Over time, we’ve seen incumbent ITOps solutions combine with focused AIOps tools to provide more comprehensive automation and tooling. Larger cloud providers are also starting to enter this active market.

There’s good reason for the excitement. Intelligent AIOps solutions can go beyond the rote triggers and preprogrammed actions of traditional ITOps tools to enable enlightened, experience-based responses based on current events and historical patterns. The objective: a fully automated system that’s able to spot and correct issues without the knowledge or participation of humans. This can sharply reduce mean time to repair (MTTR) incidents even as it reduces IT headcount.

Every vendor has its own spin on what AIOps is, or should be, but there are some common patterns emerging. Indeed, as the stakes in an exploding market go way up, a set of repeating patterns, as well as unique capabilities, appears to be emerging.

AIOps Key Findings

As we carried forward with our briefings, looking at the array of AIOps tools in detail, we came to a few core conclusions:

First, AIOps tools today are in a state of rapid evolution, with the tool providers having different takes on what AIOps tools should do, how they are applied, and the value they are likely to bring. For example, some are oriented toward data gathering, with excellent connectors and data acquisition systems, but they may be weak at pattern finding and analysis. Others may be excellent in discovering correlations in patterns, when revealing causation would be more valuable.

Translating this into real-world IT, it’s like saying the app is slow because the database is slow, when the real reason everything is slow is an oversubscribed network path that’s causing lots of retransmissions. The better tools we looked at would have found the network path issue, while lesser tools might have suggested increasing the size of the server hosting the database.

In addition, most tools today don’t provide native self-healing orchestration engines, instead opting for third-party orchestration providers leveraging their APIs. That’s good for enterprises with their own orchestration teams, as an AIOps tool without automation reduces the risk of configuration sprawl. For companies without centralized orchestration tools, however, an AIOps tool with orchestration ability reduces tool sprawl.

Second, there seems to be two types of AIOps players: those that build upon existing operations tools and those that are new to the AIOps space. Each has tradeoffs. New start-ups have the ability to create an AIOps tool purpose-built for modern system issues, such as hybrid and multi-cloud management. However, these start-ups may be missing connections and analytic capabilities critical to more traditional systems. In contrast, traditional tools often bolt an AI and analytic engine onto their mature offerings, targeting modern systems such as public clouds to stay relevant in the market. These incumbent providers also sometimes purchase companies that better support CloudOps to get into the game—and even then the integration can be awkward.

Both approaches are valid and your requirements will determine which way to go. In some cases, you may need more than a single tool to satisfy all your operational requirements. As time goes on, the functions of all these tools will likely normalize, and the features and values will converge.

Third, this year’s report features more tools with self-healing capabilities—an area that was a point of weakness for many tools in last year’s report. While a few support self-healing functions with native process management (such as rebooting a network device or restarting a database), many leverage open APIs to integrate with third-party process orchestration tools. This means that enterprises will have more flexibility in the brands of process orchestration tools they can leverage, a desirable attribute in the current market. This will likely drive independent software vendors or internal development groups that did not create management APIs for their applications to do so, in order to remain relevant to organizations using AIOps tools. In large enterprises with orchestration tools, the focus is on AIOps tools that can intelligently call the orchestration tool to automate remediation.

To address the market for built-in orchestration, AIOps vendors will need to invest in their solution’s ability to make changes directly to systems or software they did not deploy the affected system to or on. Smaller companies will have to not only maintain the original processes to support continuous deployment but also the settings in the AIOps tools to automate remediation. Vendors that offer orchestration tools connected to continuous deployment workflows will increase their value by adding AIOps, but very large organizations may have more than one CI/CD tool. Moreover, the CI/CD tool and the AIOps tool might be purchased by different buyers. In any case, an orchestration-agnostic AIOps tool would have more value.

Evaluating an AIOps Tool

When looking at these tools, there are six capabilities you’ll want to make sure your selection includes:

Data gathering and selection
Finding patterns in the data
Data analysis
Support for operations teams’ collaboration
ITSM integration, such as opening and closing tickets
Automated response generation

Tools integration, which is now promoted as a core component of AIOps, may well mean dealing with operations data at heightened security and governance levels. Thus, when picking tools, the focus should be on how the data is gathered, such as an agent or agentless approach, and most importantly, how patterns in the data are determined. For example, you might see a lot of packet failures based on 200 pieces of data out of several million. The tool should help you determine which correlations have meaning within the selected data, analyze the likely root cause issue, and enable collaboration with automated and non-automated processes (humans) to resolve the problem.

Those tool providers that have aligned AIOps processes and collaboration with DevSecOps approaches provide an advantage as issues will be reported directly back to the developer, who is in the best position to resolve most application-related problems. This process becomes part of the collaboration around issue resolution and could reduce problem diagnostics and solutions to a few hours, instead of the more typical days or weeks.

Deployment models vary from provider to provider, generally according to whether it is traditional or a startup. With startups, tool use tends to be on-demand and the solution may actually be hosted in a public hyperscale cloud, such as AWS, Microsoft Azure, or the Google Public Cloud. These tools are the easiest to acquire and deploy but will have some difficulty reaching back into an enterprise data center if it is part of a hybrid or multi-cloud operational domain.

The more traditional players provide on-premises AIOps tool deployments, with some offering a hybrid approach as well, either on-demand via the cloud or locally hosted in your environment. While you would think that enterprises would gravitate to on-demand, many still prefer running AIOps tools on-premises, as they do with security and governance tools. Most AIOps tools consume log, metadata, and telemetry data, pushing this back to a central cloud or on-premises location becomes a problem. Newer designs allow for decentralized pre-processing of data at optimal locations, forwarding only actionable intelligence to a central system and its DR counterpart. This means that for vendors to service customers with large on-premises data centers, event sources will need to support remote collection and processing.

Integration with other tools improves year over year in the AiOps marketplace. This results in more comprehensive ways for operating systems to work with security, governance, and DevOps tools. For example, a security tool would be aware that CPU or I/O saturation is occurring, which could mean a potential breach attempt. It also enables consuming data from various sources, including the network, storage, host (VM, physical servers, containers, cloud IaaS/PaaS, or on-premises), for various uses, such as end-user monitoring (observed or synthetic or both), and deep application metrics for mainframe, JEE, complex package apps like CRM, ERP, BPM, and data warehousing packages.

The ability to do automated remediation is critical to the success of an AIOps tool. Of course, it’s useful if a tool can find issues and even collaborate with humans. However, the automated resolution of problems and moving closer to a “no-ops” model where humans are minimized in the process is clearly the ultimate destination, and including and supporting automated remediation in a scalable and reliable way is critical.

How to Read this Report

This GigaOm report is one of a series of documents that helps IT organizations assess competing solutions in the context of well-defined features and criteria. For a fuller understanding consider reviewing the following reports:

Key Criteria report: A detailed market sector analysis that assesses the impact that key product features and criteria have on top-line solution characteristics—such as scalability, performance, and TCO—that drive purchase decisions.

GigaOm Radar report: A forward-looking analysis that plots the relative value and progression of vendor solutions along multiple axes based on strategy and execution. The Radar report includes a breakdown of each vendor’s offering in the sector.

Solution Profile: An in-depth vendor analysis that builds on the framework developed in the Key Criteria and Radar reports to assess a company’s engagement within a technology sector. This analysis includes forward-looking guidance around both strategy and product.

The post Key Criteria for Evaluating AIOps Solutions appeared first on Gigaom.

GigaOm Radar for AIOps Solutions

David S. Linthicum — Fri, 03 Sep 2021 14:00:49 +0000

The importance of AIOps has increased in response to the rapid adoption of cloud and edge computing and the rising complexity these environments create. Intelligent tools act as a force multiplier for ops teams, helping them adapt to escalating demand even in the absence of budget and staff increases. AIOps also helps address the operational challenges of having cloud-based applications and data that must continue to operate with existing systems, such as mainframes, x86 clusters still crowding data centers, and increasingly complicated networks.

AIOps tools are growing in several directions. Most vendors in the traditional operations tools space have incorporated an AI engine and rebranded their tool with AIOps. Additionally, there is a cohort of startups that have developed purpose-built AIOps tools.

The development of hybrid AIOps tools follows the normalization of the market, leading to vendors combining technologies. Some vendors are buying their way into the AIOps space via acquisitions. In this scenario, vendors integrate traditional operational tools with AI technology, while new upstarts address a niche or add new features to the AIOps landscape. Finally, there are also larger cloud providers dipping their toes into the market. These providers are building tools that manage their native services and cross-cloud tools that manage multiple cloud platforms.

All these tools are data-oriented. They gather data from as many sources as possible, using their connectors and integration or even leveraging other instrumentation to connect with systems. Combining software has confused the AIOps market, as some tools focus only on data analysis and not how it’s collected. Others focus on collection and analysis but may not support complete awareness of the state of the enterprise

If that’s not confusing enough, we’ve also found AIOps tools take different approaches to how AIOps works. Approaches to the remediation of issues, integration with other cloud systems, security, governance, and even cost accountability make vendor selection more complex. The confusion multiplies when the term “AI” is used to describe a rules-based heuristic system with human supervision.

In contrast, others have a core AI module with true neural capabilities. This difference can determine whether a system can ingest a new data set with minimal human intervention or if it requires substantial effort to add new data to the system.

As we close in on the measure of a good AIOps tool, the “it depends” factor becomes important to understand. The complexity of the answer depends on the types of systems you’re looking to monitor and observe, the data storage in place, expectations (including supporting a customer experience), applications employed, and other operational systems such as security and governance. Thus, it’s less about selecting the best AIOps tool, and more about selecting a tool or tools that will meet your overall cloud and non-cloud operational needs in the near- and long-term.

How to Read this Report

The post GigaOm Radar for AIOps Solutions appeared first on Gigaom.

What’s Observability, and 5 Ways to Ensure Observability Success

David S. Linthicum — Thu, 15 Jul 2021 17:00:24 +0000

This free 1-hour webinar from GigaOm Research brings together experts in observability, featuring GigaOm analyst David Linthicum and a special guest from Splunk, Patrick Lin, VP of Product Management, Observability.

Monitoring is the passive consumption of information from systems under management, typically by leveraging dashboards to display the state of systems. Monitoring systems are purpose built for static systems or platforms that don’t change much over time.

On the other hand, observability leverages the ability to understand proactively – interrogate, analyze, investigate, and ask questions based on a continuous understanding or hypotheses. Observability, both a concept and technology, is purpose built for dynamic systems/platforms with widely changing complexity. Also, it can analyze “what if” and trending applications to predict a possible future failure.

So, what does observability mean to you? What are the core enabling technologies available? More importantly, what are the best practices that, if followed, can ensure success?

In this webinar, we’ll provide pragmatic advice around getting the value out of the concept of observability, as well as picking the right technology to meet your observability objective.

The post What’s Observability, and 5 Ways to Ensure Observability Success appeared first on Gigaom.

The Essence of Observability

David S. Linthicum — Thu, 29 Apr 2021 17:00:25 +0000

This free 1-hour webinar from GigaOm features analysts David Linthicum and Andy Thurai and special guests from VMware, Harmen Van der Linde, Office of the CTO, and Clement Pang, Principal Engineer. The discussion focuses on how Observability enables organizations to respond to the evolving application management needs.

Monitoring of systems, infrastructure, and applications is familiar territory, but traditional approaches have become inadequate for multiple reasons— not least cloud and microservices-based architectures and practices such as continuous deployment. In response, Observability offers an emerging set of practices, platforms, and tools, such that events and incidents can be acted upon quickly.
In this 1-hour webinar, we’ll provide the findings from Gigaom’s Observability Key Criteria and Radar Report, together with insider commentary from the report’s authors and industry experts.

Above all, you will discover how you can gain a deeper level of visibility and insight with Observability. If you are moving to forward-facing application architecture or already there and grappling with the consequences, this webinar is for you.

The post The Essence of Observability appeared first on Gigaom.

Delivering on the Promise of SASE

David S. Linthicum — Wed, 21 Apr 2021 23:00:25 +0000

“It’s where this world is headed and, frankly, if you just step back and think about security as an industry it is incredibly complex, overly so. This is why so many are being constantly challenged, because they have to focus on two separate disciplines, which I believe is fundamentally the wrong approach. Divorcing the two is foolish. By combining them you end up with efficiencies, operational agility, and give your network and security teams time back in their day to focus their energies on other business critical tasks.” – Mike Spanbauer, Juniper Networks

Just as technology decision-makers have started to wrap their heads around the idea of a Software-Defined Wide Area Network (SD-WAN), a new reality has emerged. While SD-WANs go a long way towards keeping pace with the evolving requirements of today’s digital-first enterprises, they also drive an emerging requirement to combine networking and security.

This basic concept applies equally to the networks that tie our offices to our data centers, to our multiple clouds, and to our remote users and devices. In today’s cloud-based environments, you have very little control: every part of your network – the laptops your team uses, the Wi-Fi they use to connect, the applications they connect to – is outside your jurisdiction.

How, in this situation, do you keep all such elements “safe” and secure? The old, zone-based, inside/trusted versus outside/untrusted perimeter firewall paradigm no longer applies. This creates numerous challenges: the architecture of IT has shifted underneath us all, driving a fundamental change in the way we approach security and act to secure our distributed resources and users.

Increasingly, organizations are looking to deploy a Secure Access Service Edge (SASE) architecture as the solution to this challenge. This architecture combines multiple features to reduce complexity and security-related risk, helping organizations prepare for the security challenges of the next 20 years. While the principles behind SASE are sound, it is not a “one size fits all” solution: each deployment is unique and needs to be considered in a way that addresses the needs and practices of the organization concerned.

In this paper we consider what SASE is, how to deploy it, and what lessons can be learned from those already on the journey to better security in a digital-first world, across tools and processes, not least in importance: partnering with the right vendor.

The post Delivering on the Promise of SASE appeared first on Gigaom.

GigaOm Radar for Cloud Observability

David S. Linthicum — Fri, 26 Feb 2021 16:57:41 +0000

Observability is an emerging set of practices, platforms, and tools that goes beyond monitoring to provide insight into the internal state of systems by analyzing external outputs. It’s a concept that has its roots in 19th century control theory concepts and is rapidly gaining traction today.

Of course, monitoring has been a core function of IT for decades, but old approaches have become inadequate for a variety of reasons—cloud deployments, agile development methodology, continuous deployments, and new DevOps practices among them. These have changed the way systems, infrastructure, and applications need to be observed so events and incidents can be acted upon quickly.

At the heart of the observability concept is a very basic premise: quickly learn what happens within your IT to avoid extended outages. And in the unfortunate event of an outage, you need to ensure that you can get to the root cause of it fast. Outages are measured by Mean Time To Resolution (MTTR) and it is the goal of the observability concept to drive the MTTR value to as close to zero as possible.

No surprise, building resilient service delivery systems that are available with high uptime is the ultimate end goal for any business. Achieving this goal requires executing three core concepts:

Monitoring: This is about understanding if things are working properly in a service-centric manner.
Observability: This is about enabling complete end-to-end visibility into your applications, systems, APIs, microservices, network, infrastructure, and more.
AIOps: This is about using comprehensive visibility to derive meaning from the collected data to yield actionable insights and courses of action.

To achieve observability, you need to measure the golden telemetry signals—logs, metrics, and traces. Logs and metrics have been measured by IT professionals for decades, but traces is a fairly new concept that emerged as modern applications increasingly were built using distributed microservices. A service request is no longer completed by one service but rather by a composition of microservices, and as such there is an imperative to track or trace the service request from start to finish. In order to generate proper telemetry, all the underlying systems must be properly instrumented. This way enterprises can achieve full visibility into their systems to track service calls, identify outages, and determine if the impacted systems are on-premises, in the cloud, or somewhere else.

Observability is not always about introducing new tools, but about consolidating the telemetry data, properly instrumenting systems to get the appropriate telemetry, creating actionable insights, and avoiding extended outages. Comprehensive observability is core to future proofing IT infrastructure.

This report evaluates key vendors in the emerging application/system/infrastructure observability space and aims to equip IT decision makers with the information they need to select providers according to their specific needs. We analyze the vendors on a set of key criteria and evaluation metrics, which are described in depth in the “Key Criteria Report for Cloud Observability.”

How to Read this Report

Vendor Profile: An in-depth vendor analysis that builds on the framework developed in the Key Criteria and Radar reports to assess a company’s engagement within a technology sector. This analysis includes forward-looking guidance around both strategy and product.

The post GigaOm Radar for Cloud Observability appeared first on Gigaom.

What’s New at DRYiCE? How Do They Fit In the Emerging Multi-Cloud Enterprise?

David S. Linthicum — Mon, 18 Jan 2021 16:51:15 +0000

I attended the DRYiCE analyst day to look in on the progress of this relatively young part of HCL. Clearly, no moss is growing on their developers as they push a number of products in the marketplace, including an AIOps offering that is growing in popularity and sales, according to Amit Gupta, EVP and Global Head at DRYiCE.

The company has experienced almost 50% year-over-year growth, with DRYiCE Lucy, MyCloud, iAutomate, and iControl rounding out the best-selling products. Currently, Lucy has 2.1 million active users, and over 3,000 use cases, just to look at one of their products.

One can explain the success of DRYiCE around the emerging need for operationally focused tools, including those supporting emerging AIOps concepts. This will continue to grow through 2021 and 2022 as applications and data migrate to the cloud—most from organizations lacking a good understanding as to how they will deploy and operate these workloads effectively.

Companies today need to deal with widely distributed systems that are connected via a wide range of ISPs. When 10% of employees were traveling or working remotely from time to time, the task of managing systems was not that difficult. Today, with the vast majority of employees working from home, the complexity of IT operations has grown ten-fold, and that complexity is not going away. IT operations have had to adapt to a new normal where the environments they manage are often not under their direct control.

At the core of the journey is the simplification of IT operations and recreating them using an AIOps tool. This means we are no longer married to manual processes. The cost center becomes proactive instead of reactive, as the tool can pragmatically predict the future.

When using AIOps, the focus is on the business and the elimination of higher risk delays and inconsistencies to achieve fully automated processes that reduce risk. The journey requires that we move from inefficient processes to automated processes that are consistent. Finally, there’s a move to self-service models, where IT is not dependent on people and functions that are likely to fail.

Also presented during the analyst day was a case study highlighting how a pharmaceutical company leveraged DRYiCE products as force multipliers. Specifically, Clayton Ching introduced a new product called ROAR, a data-oriented path to advanced operational management of distributed data.

This product does a few things that are emerging in the market but not yet in other systems, including:

Robust computational engine, meaning it has the ability to scale.
Single source of truth, which aggregates data from siloed multi-sourced data to a single reference record.
Database for billables, which maps Contractual RU against golden datasets to create a single source of truth here as well.
Out-of-the-box reports, enabling the use of standard analytics within the product.

If there was one suggestion I would give DRYiCE, it is to normalize their products a bit. The company seems to have too many (10 in total), and those attempting to figure out which product does what can find it confusing.

If that’s all there is to complain about, that’s not bad. Indeed, HCL’s innovative, operationally focused companies are leading the way to better operations both inside and outside of the cloud. Indeed, as deployments grow more complex, this kind of capability is no longer an option but table stakes.

The post What’s New at DRYiCE? How Do They Fit In the Emerging Multi-Cloud Enterprise? appeared first on Gigaom.

Key Criteria for Evaluating Cloud Observability

David S. Linthicum — Fri, 18 Dec 2020 22:50:23 +0000

The concept of observability has evolved over the years, referring to the ability to monitor internal states of systems using the externally exhibited characteristics of those systems. It provides the ability to predict the future behavior of those systems using data analysis and other technologies.

By monitoring what has already happened, enterprises can reactively fix the issues. Observability, in contrast, helps predict issues before they surface, thus helping to build a proactive enterprise. Ultimately, by automating observability, it’s possible to build hyper-automated, self-healing enterprises that fully understand what’s happening within the systems under management, and to predict and respond to likely outcomes.

Observability is important in the management and monitoring of modern systems, especially because modern applications are built to be quick and agile. The widespread practice of bolting on monitoring and management after applications are deployed is no longer sufficient. Thus, the notion of observability includes the implementation of modern instrumentation designed to help users better understand the properties and behaviors of an application.

It should be noted that monitoring and observability are not the same thing. Monitoring involves the passive consumption of information from systems under management. It uses dashboards to display the state of systems, and is generally purpose-built for static systems or platforms that don’t change much over time.

In contrast, observability takes a proactive approach to revealing system attributes and behavior, asking questions based on a system’s internal data. The technology is purpose-built for dynamic systems and platforms with widely changing complexity. An important feature of observability is the ability to analyze “what if” scenarios and trending applications to predict a possible future failure (See Figure 1).

Figure 1: Observability Is Built on Monitoring, Analysis, and AI-Enabled Learning Systems

Why is Cloud Observability So Important Now?

The insight that observability grants is particularly important because the way new applications are built and architected has changed over the years:

Cloud-based applications depend on multiple services, often via RESTful APIs and microservices.
Serverless and service-based applications have little visibility into underlying infrastructure.
Applications are highly distributed, typically based on containers and microservices, with the ability to scale up and down in seconds.
Hybrid and multi-cloud architectures that depend on multiple providers, in-house and external, are exceedingly complex and difficult to track.

Most monitoring and operations tools, including AIOps tools, claim some role in observability, but in some respects it’s like slapping a new label on an old nostrum. However, if you consider observability as a concept, not as a set of tools and technologies, you get a better idea of the value. For instance, observability means using analytical views that take raw data, provide a holistic view, find patterns in the data, make calculated predictions, and react to those predictions using automated and non-automated responses.

Observability should provide a full view of the enterprise, including on-premises, multi-cloud and hybrid cloud environments. The successful use of appropriate tools can help eliminate the cloud complexity issues that hinder some cloud migrations and new development.

The Emerging Observability Sector

One of the more important topics covered in this Key Criteria report and in the GigaOm Radar for Cloud Observability report is how to sift through the technologies that claim to have observability nailed. At present, the providers generally take different approaches, and while you can indeed find common patterns, it’s still difficult to compare apples to apples.

More tools are purpose-built for observability today, including most of the emerging AIOps toolsets, as well as other monitoring technologies like application performance monitoring (APM), network performance monitoring and diagnostics (NPMD), digital experience monitoring (DEM), and infrastructure monitoring, which have the ability to deal with complex operational data.

However, while all of these technologies claim to be built for observability, most are repurposed monitoring and management tools, the original use of which involved anything to do with bringing operational data to a single view. Therefore, we have a few ways to consider the observability market, including:

Traditional Monitoring Tools: These have been upgraded to provide modern services, such as storage and analysis, plus machine learning to support the notion of observability. These tools may be decades old but, generally speaking, have the advantage of providing better legacy systems integration.

Purpose-Built Observability Tools: These are typically focused on the acquisition and analysis of system data to report and respond to observations. They are generally data-first tools that may or may not have automated responses to analytical findings. Those without automation may expose an API that can be leveraged programmatically to deal with automation needs.

Hybrid Observability Tools: These are a mix of traditional tooling and more modern purpose-built observability tools, typically a combination of two or more technologies, created either through an acquisition or a strategic partnership. The risk with these is that you’re dealing with two or more tools that have their own independent paths and business interests and you may not be able to count on their continued functionality.

The logical features and functions found in most observability tools are depicted in Figure 2. Such tools need a place to store logs and other data for trending, correlation, and analysis. They need an analytics engine to determine patterns and trends. They typically include machine learning and AI systems to provide deeper knowledge management, as well as the ability to get smarter as they find and solve problems. Finally, they typically perform automation through an event processing system, which could be as simple as an API or as complex as a full-blown development environment that supports event orchestration.

Figure 2: Elements of an Observability System

Cloud Observability Findings

This report covers various aspects of observability, including how to plan, how to pick the right technologies, how CloudOps works in multi-cloud and hybrid cloud environments, and what operations and tooling you will need to be successful. Among the conclusions reached in this report are:

Observability is an advanced concept, and cloud observability tools are often not deployed correctly. Moreover, the skills needed for planning, building of the models, gathering system data, and proper deployment are frequently unavailable, all but ensuring failure on the part of many organizations.

Those leveraging observability tools typically don’t employ them to their maximum potential, especially with regard to using machine learning subsystems effectively. This limits the potential benefits an organization can accrue in terms of staying ahead of systems issues.

When leveraged correctly, observability tools potentially can increase systems reliability 100-fold because issues are automatically discovered, analyzed, and corrected—without human intervention. Moreover, the system is able to learn from successes, so system reliability improves over time. Proper observability tools and optimized IT processes can reduce the mean time to resolution (MTTR) metric between 50% to 90%, an important KPI to measure with modern “always on” systems.

The cloud observability market includes dozens of tools that claim to support the concept, but often these tools have very different primary missions, such as network, application performance, logging, or general operational monitoring. The lack of any real feature/functionality standards is causing many enterprises to wait for the market to normalize. The Open Telemetry initiative by the Cloud Native Computing Foundation (CNCF) is one such effort that is gaining traction. Most vendors in our report seem to have adopted or in the process of adopting it.

The push to move systems quickly to the cloud increases both complexity and risk, making observability more critical. This will likely persist for the next several years.

Organizations change over time, and their systems need to be able to change as well. Thus, extendable and configurable tools are a huge advantage.

The post Key Criteria for Evaluating Cloud Observability appeared first on Gigaom.

Best Practices in Moving from ITOps to AIOps

David S. Linthicum — Tue, 20 Oct 2020 13:49:13 +0000

From the beginning Ops tools have been data gathering and analytics tools. Now artificial intelligence (AI) and machine learning (ML) are being applied to enable a new class of Ops tools, called AIOps, that can learn from the data they gather. In some cases, these tools can correct issues using pre-programmed routines, such as restarting a server or blocking an IP address that seems to be attacking one of your servers.

This capability provides a few advantages:

We can remove humans from CloudOps processes for the most part, only alerting when a situation requires human intervention. This means fewer operational personnel and lower costs.
We can integrate AIOps with other enterprise tools, such as DevOps or governance and security operations.
We can look for trends that allow the operational team to be proactive. For example, the AIOps tool can monitor a networking switch that is about to fail and is putting an increasing amount of errors on the network.
We can build a knowledge base that at some point should exceed the knowledge of the entire CloudOps team and provide the ability to build proactive and preventive processes.

In this GigaOm report, we look at the process of moving from a traditional ITOps set of processes and skillsets to the use of AIOps and its new processes and skillsets. This is not a conceptual report but rather a report grounded in actionable guidance to help organizations move to AIOps-based operations successfully, including a stepwise process that should guarantee success if followed.

Conclusions reached in this report include:

The use of AIOps technology can pay for itself in a very short time, but only if users are able to move from existing traditional ITOps technology and processes successfully.
Metrics show that success is based on the ability to work around traditional silos and accept that automation of operations is a superior approach.
The rise of the remote workforce has led to a greater need for AIOps and the processes that go with this technology. Distributed workforces require an adaptable set of operational processes that leverage automation to resolve the additional IT complexity posed by remote work.
Reskilling the operations team is a critical success factor to a successful journey from traditional ITOps to AIOps. Those who are self-learners seem to provide the most promise for enterprise IT organizations.
Leveraging AIOps means renewing focus on the business, from prioritizing systems based on their importance to the enterprises to employing automation to achieve near-zero downtime.
The focus on self-service among IT resources and applications lead to a need to provide automation and monitoring, including integration with both security and governance resources.

The post Best Practices in Moving from ITOps to AIOps appeared first on Gigaom.

Rancher

David S. Linthicum — Fri, 26 Jun 2020 18:13:27 +0000

Rancher Kubernetes Engine (RKE) is a supported distribution of Kubernetes that can run on many different platforms and infrastructures. Rancher is foundational in the container and Kubernetes worlds, focusing on ease of installation and setup, and supporting innovative directions around distributed cluster architectures and Kubernetes for small IoT devices.

Rancher integrates with and manages cloud-hosted Kubernetes services such as those found on Azure and AWS and has licensed its distributions to other cloud providers that are white-labeling Rancher. In our recent report, “GigaOm Radar for Leveraging Federated Kubernetes,” we declared Rancher to be one of the few Kubernetes distribution providers that focuses on federation. Moreover, it moves beyond the current DIY approaches that other Kubernetes providers require. In short, Rancher removes the complexity from the installation and set up of Kubernetes.

Rancher was one of the strongest products listed in the Radar report on federated Kubernetes. This Kubernetes distribution can support platform-native mechanisms, providing native security and native management and monitoring capabilities while eliminating the need to rely on third-party, non-native products. Moreover, the company is focused on Kubernetes distribution as a strategy, meaning that more distributed cluster features will likely arise from its product road map. This tool resides at the top of the Kubernetes food chain.

The post Rancher appeared first on Gigaom.