AI: Driving reliable, stable IT operations

An article that appeared in early 2020 talked about how AI was expected to be the new catalyst for software development.

The article stated that AI-powered software development tool providers had raised more than $700 million in just 12 months. And this was before Covid-19 compelled enterprises to under-take rapid digital transformation.

Of course, this move forward has been accompanied by an accelerated growth in AI adoption as well.

According to the Digital Acceleration report, AI has catapulted to become one of the biggest drivers of technology investment for global enterprises.

AI’s role in IT

With the increasing adoption of the concepts of Site Reliability Engineering (SRE) in mainstream enterprises, automation is becoming more intrusive in the world of IT operations.

This article explores the prevalence of AI/ML in application IT support rather than in infrastructure support as application IT support is a more complex problem to solve.

Let’s look at the three types of tickets namely service requests, incidents, and alerts that typically get created in IT operations and consider how AI/ML is used to handle each ticket type.

Service requests

Service requests handling has the most common use of AI/ML because of the fact that Standard Operating Procedures (SOPs) are easy to be created for such tickets. Once we have an SOP, natural language processing (NLP)-based understanding and classification models with robotic process automation (RPA) can enable automated resolution of these tickets unless authentication is required.

In such cases, OpsBots (ChatBots) could be an alternative for self-service portals. ChatBots also bring an added advantage of helping visually challenged people.

Incidents

Incident handling can be categorised into three use cases: Recovery, Resolution, and Prevention.

Recovery: In today’s world where infrastructure as code, service mesh, containerisation and micro-services architecture are becoming the norm, automated recovery using AI/ML ensures HA (high availability) in these applications or platforms. This might include, but is not limited to, auto-scaling of applications based on model rules, automated mission control operations like segmentation, BackPressure, and BulkHead creation among others by applying these remediation techniques automatically through AI/ML.

These are achieved by integrating simple pattern recognition models with relevant actions to be taken which are automatically executed.

Resolution: Incident resolution involves routing, triaging, and remediating the incident.

(a) Routing: For any conventional incident resolution cycle, identifying and routing the ticket to the right person or resource to resolve the problem is typical waste when lean management principles are applied on IT operations value stream.

AI optimises the ticket allocation process by referencing data from all previous ticket allocations — from the service desk to the various operations teams.

It also takes into consideration existing information of ticket hops that have taken place previously.

With historical allocation data, ability to automatically categorise a ticket using NLP and ticket type, allocation to appropriate teams is seamless and fast. In certain cases, these tickets are assigned to the exact engineer whose code base was problematic.

This was possible using AI/ML and more importantly, the ability to trace-back an error to the actual engineers based on backward traceability established by matured CI/CD practices.

(b) Triaging: This step in the resolution process takes the maximum time and effort in IT operations. AI/ML is helping operators triage incidents faster through the use of conversational UI-driven intelligent KeDBs which enable semantic searches, advances in observability which provide 360-degree view of the state of dependent systems or actors during the incident and suggestions of possible remediations based on semantic patterns.

(c) Remediating: Notification in triaging would most likely lead to suggestions on remediations as elicited above.

In matured cases, such prescribed remediation that are agreed by the operator are also monitored to eventually enable straight-through-remediation or self-healing. This is still quite rare in application operations space where SOPs are hard to come by for incidents.

Prevention: So far, we have been exploring how AI and ML can help in the resolution of an incident. But how do we prevent incidents before they can even occur?

Preemptive resolution of possible incidents is perhaps one of the most ambitious applications of AI in IT operations. Achieving something like this depends on learning models that can identify the strongest indicators and causes of an incident risk and the degree of threat. When it comes to preventing incidents, AI and ML can be used to model and predict systems behaviour based on a range of parameters that we can analyse.

Different models are used to predict systems behaviour depending on the level of maturity of the available data. These are:

* Probability distribution which focuses on internal two-dimensional data

* Topological data analysis which focuses on internal multi-dimensional data

* Game theory which focusses on both internal as well as external multi-dimensional data

Alerts

In contrast to traditional IT systems monitoring, AI and ML can be used to observe a system from a business-down perspective. AI/ML is used to correlate events from various monitoring tools and make an inference of business capability/sub-process behaviour.

Intelligent alert aggregation reduces the number of alert tickets while also helping in identifying the real source of an alert and thereby reducing discovery, triage, and remediation time for such alerts.

One other outcome of this approach is eliminating unforced errors in ticket prioritisation and allocation. This in turn, saves costs and allows the operations teams to focus on priority tasks.

Conclusion

AI and machine learning (ML) models that analyse data patterns in systems have shown the potential to streamline every phase of not only operations but also development and hence find their place in most of DevOps and SRE implementations. But as it has become evident in my experience, the value of any technology is only as good as its implementation, and that will continue to be the key differentiator for effective AI and ML adoption in an enterprise.

If the underlying datasets are understood well and appropriate AI/ML models are adopted, we can realise benefits of at least 55 per cent reduction in tickets, 45 per cent reduction in operators and 70 per cent improvement in NPS scores for IT operations team.

The writer is Associate Vice-President, D&A, HCL Technologies

COMMENT NOW

Comments

Comments have to be in English, and in full sentences. They cannot be abusive or personal. Please abide by our community guidelines for posting your comments.

We have migrated to a new commenting platform. If you are already a registered user of TheHindu Businessline and logged in, you may continue to engage with our articles. If you do not have an account please register and login to post comments. Users can access their older comments by logging into their accounts on Vuukle.

AI: Driving reliable, stable IT operations

Understanding the underlying datasets and adoption of appropriate AI/ML models are crucial

AI’s role in IT

Service requests

Incidents

Alerts

Conclusion

Latest from Opinion

Getting the best out of your B-School education

Editorial. IRDA’s reforms on health insurance need follow up

Banks should be ready for ‘higher for longer’ rates

Electric air taxis set to take wing

Fostering UAE-India trade

You might also like

You might also like

Comments

AI: Driving reliable, stable IT operations

Understanding the underlying datasets and adoption of appropriate AI/ML models are crucial

AI’s role in IT

Service requests

Incidents

Alerts

Conclusion

Related Topics

You might also like

You might also like

Comments