Why shouldn't you be using a Resolution SLA?

In my feature article "What's Wrong with the Resolution SLA", I concluded that although the suggestion is impractical for many service providers, ideally, the Resolution SLA wouldn't be used until the accuracy of its underlying data can be sufficiently and substantially improved.

Considering it's the standard measure of performance in IT support, this is quite a suggestion, I know, but without statistically accurate data, the SLA is unable to adequately support the Incident Management process, which is its primary purpose. Until it can, perhaps customer satisfaction surveys should be the primary focus.

There are two perspectives on its use. Either it should be based on the time taken to reach practical resolution i.e. to restore normal service, which would be in line with ITIL's definition and purpose for Incident Management; or it's based on the time it takes to "resolve" or "complete" (i.e. pre-close) the service ticket.

I think most would agree that the former makes more sense, but in practice, it has to be the latter.

There's inevitably a big difference between the two time frames, accounted for mainly by "on-hold" periods. The difference is exacerbated because new support work tends to take precedence, so on-hold tickets often dwindle and age.

Organisations might decide to use their service management tool's ticket timer suspension feature in an attempt to consider on-hold periods and so gauge actual lead time for all service tickets, but doing so is inadvisable because controlling its use is almost impossible. Data will likely be even more inaccurate as a result.

You can't blame organisations for trying to considering on-hold periods though, as futile as it will surely prove to be. This is because metrics must support a process.

Strictly speaking, the time it takes to resolve or complete a ticket does fit with ITIL's outline for service ticket management but the outline has a substantial shortcoming in not addressing the on-hold period.

Without accurately doing so, it's impossible to know how well a service provider is able to restore normal service, or indeed how quickly service requests are being fulfilled. In other words, it's impossible to measure performance in meeting the purpose of the Incident Management and Request Fulfilment processes. ITIL's outline for service ticket management is therefore not entirely fit for use.

ITIL contributors must know this but ITIL can't fill the gap because any derived solution would be too detailed in its prescription. It's therefore up to organisations to find a solution if they possibly can, contributing to the development of their own Incident Management process.

It is possible to do, to quite accurately consider on-hold periods and prevent tickets from dwindling. While a solution is in no way obvious, it's a good example of why focus should always be on process. Process is the great enabler and we have the tools to do anything these days.

Reminds me of the adage "people, process and technology". "Process" sits in the middle for a reason.

So, this article draws on the perspective that because organisations usually measure the Resolution SLA with data that doesn't reflect the purpose of the Incident Management and Request Fulfilment processes, the data is inappropriate and therefore substantially inaccurate. In my opinion, organisations who haven't found how to fix the problem of the on-hold period should therefore not be using a Resolution SLA.

Difficulty is, managed service providers need to and internal IT departments quite rightly feel they should be measuring the performance of their support function, in part to show accountability.

I recently opened a LinkedIn itSMF group conversation on the subject. I asked:

"I'm wondering if anyone has ever seen anything written on the Resolution SLA usually being statistically and practically flawed?" and when asked to elaborate, I wrote:

"Yes, in relation to the Incident Management process. So, for example, if a service provider agrees with the business or client that say 80% of medium priority service tickets, which might include requests as well as incidents, will be resolved/ completed within 3 days.

I'm wondering if others agree with me that producing accurate data for this is almost impossible, even if the service provider has the benefit of an Incident Manager to oversee breaches and the IM process generally."

It received 35 comments but most related only to its practicality i.e. how definitions, parameters and data are managed and adjusted in partnership with the supported business. I've not personally had involvement in such negotiations, but reason tells me these efforts are through necessity to make best of a difficult or indeed impossible situation.

If a new service provider is bedding in, or the scope for support sees a significant upwards shift, or it's agreed that support resource can be taken away for a project, these negotiations are of course necessary.

Any other reason for the SLA declining or failing, necessitating discussions, would likely be because the data is inaccurate.

So, here's the thing.. If you have significantly inaccurate supporting data, it unavoidably means the metric it serves is largely invalidated. The metric shouldn't really be used to represent performance. If it is, it won't have a great deal of meaning or purpose. The Resolution SLA is widely used though, because it has to be.

Inaccurate, invalid data causes the SLA's other two problems. Firstly, typically, a Resolution SLA needs to allow a high number of service level target breaches (then, what happens to the support requirement once its breached, but that's another subject). Secondly, the lead time (e.g. 3 days) is unrealistically long. It will always be interpreted as being the lead time for service to be restored or for a request to be fulfilled, but in fact it's not.

It's putting out the wrong message and so probably shouldn't be published or otherwise used to manage user expectations.

So, I really can't see any justification for using a Resolution SLA under usual circumstances, other than because it's necessary.

I very much agree it's necessary though. It's a good example of a paradox and one that must be affecting service provision the world over.

Elaborating a little more, if the reality of how support is provided might show that 80% of medium priority support tickets actually reach practical completion within 1 day, a service level target of 3 days doesn't seem appropriate.

I understand the argument that this longer period is with consideration for the need to actively manage individual tickets that are essentially on-hold rather than let them dwindle, but it's missing the point and in my experience, IT departments and managed service providers quite rightly don't often find themselves overstaffed, which without a truly advanced Incident Management process would be necessary to have any hope of proactively managing all open tickets. Hence the need for service review discussions over why SLA performance has suffered, despite the permitted breach percentile and target time-frame usually being excessive.

Comments in the group conversation that related to the "on-hold" period, agreed with me.

On-hold is seemingly impossible to control and so either isn't used - but it must be - or its problems are ignored - but they mustn't be. This is another paradox.

In conclusion, it seems to me that IT departments and managed service providers, with the supported business, aim to reduce the breach percentile and lead time but rarely fix the fundamental flaw. Really though, it's this that deserves attention first of all.

I hope I've got my thoughts on this tied to reality. I'd welcome any comments you might have. If you agree with me and you find the subject as interesting as I do, I've gone into more detail in my feature article. It's found at:

www.opimise.com/resolutionsla