Learning from IT mistakes

Guest contribution by Astrid Kuhlmey from KPMO | 18.05.2020

Table of Contents

Troubleshooting in complex IT systems
Parameters for a professional approach
Professional approach needs learning from mistakes
Fair enough. Now what?
A personal remark

Your computer seems to be broken. Cause unclear. But it’s getting hot and the processor is already slowing down. To avoid overheating, you shut the computer off for safety’s sake, let it cool down and even replace the fan. As soon as you measure the temperature, which is back within the permitted range, you start it up again and continue working as before.

With such a simple system you have – at least with a little experience with PCs and unless it’s a virus or a damaged fan cable – a good chance that this will work. Of course, it’s unpleasant that you have lost the data of a file, but with a little effort you can make up for it. All good, right?

Troubleshooting in complex IT systems

I come from the world of data centers and application support. At the latest with the introduction of computer networks and networked applications, troubleshooting was no longer quite so easy. Professional error cause research followed the principle of eliminating one cause after another. It was therefore often (too) long. Impatient application owners did not make things any easier for technical staff, even if they had been told several times in advance by IT that they had to prepare for possible downtimes. And it was precisely those who had been particularly thrifty in the run-up to the project who were now in the front line of the critics, demanding replacement systems, technical masterpieces and sometimes small miracles.

In times of Corona, memories of this come back massively, and despite all the differences between biological, economic and technical systems, I have a suspicion about the reasons for this when I observe the current situation.

Sometimes an exact cause could not really be found, or it was simply externally induced and therefore not avoidable with on-board means. A few cables changed, a cleaning lady (or better her vacuum cleaner) removed from the computer room, a finally available update of the database software installed, booted up and off we went. The users said “You see, it was quite simple” and happily continued as before with austerity measures and the belief in the possibility of an infallible, yet so fragile IT solution. Not so the IT experts, who continued to shout prophecies of doom and found the “solution” shaky and unprofessional.

Sometimes the strategy even worked. However, this had the fatal effect that other users took part in the savings program and lacked precautionary measures. The next crash was not far away, and the search for the culprits usually rested on IT. At some point, their management was replaced, or IT was even “outsourced”. After a short breather for all involved, the first ones realised that nothing got better. On the contrary: it became more expensive and from the user’s perspective there was less and less room for influence.

In fact, it often went much better if a slow and gradual startup was possible with diffuse error causes and the application owners supported this process. Only parts of the application(s) were made available, access was not possible from everywhere, but only via excellent paths, and the number of users was temporarily reduced. Every further step of the startup was discussed in advance,

which components were affected,
which risks were assessed and with which evaluation
and how long it would take to get to the next step.

If problems occurred in one step, the status before the last change was restored and the corresponding components were looked at.

Parameters for a professional approach

Two things were essential in the troubleshooting process:

To work with clear priorities.
Paying attention to the concerns of those who were not high on the priority list.

Why was the second point so important? If only one of them tried to push through uncoordinated special solutions for certain user groups, the process would falter in the most harmless case, in the worse one it would get quite confused. More than once there was the well-known effect “Go Back to Start, Do Not Pass Go.”.

Then as now, the following applies: Dealing with blame is neither internally nor externally beneficial. On the contrary. It almost always ties up necessary resources. In retrospect, at least some departments have learned to improvise. And the company recognised which processes were actually business critical and which were not. As a consequence, this also put the status of many executives to test.

Professional approach needs learning from mistakes

In professional areas, there were weakness assessments both during the process and afterwards, which were of a fundamental nature:

Was the system still up-to-date at all?
What value did it have for the company?
Where was it (too) complicated and therefore difficult to maintain?
Where were components prone to errors?
What (investment) costs were necessary for appropriate operation?
And also: Which business processes are really relevant for the company?

Even when discussing these questions, good cooperation and a renunciation of the enforcement of partial interests was imperative in order to achieve sustainable results. It was simply a matter of learning from mistakes.

Fair enough. Now what?

What do you take from my description? What can you and your organisation learn from it?

Let’s take a look at corporations today, in the age of Corona. In many companies, infrastructure is practically out of order from now on. Employees are not able to do their jobs or only to a very limited extent. Now one can rail against the infrastructure, put pressure on it and demand that the infrastructure should simply work. “You should have seen this.” Or: “Typical, X and Y were asleep again!”

I think it makes much more sense to think about your own future, about outdated processes and the appropriate precautions for risks and crises. For example, you might ask yourself:

What role does my product play in times of crisis and what are the consequences? Above all, what importance does it have then?
How much do I rely on the availability of fragile logistics or global distribution?
How and with what offer can I maintain my business in times of crisis to the extent that I don’t go bankrupt?
What reserves do I build up for times of crisis to remain independent?
How can I contribute (financially) to receiving support from others in times of crisis or to ensuring that the necessary infrastructure is sufficiently secured to continue working?
What do I have to change in my processes?
Which previous savings have a ruinous character?

Many of these questions can actually only be answered adequately with a changed idea of how companies should operate. More common ground between companies instead of a ruinous price war, more cooperation with partners at the local level – these are parameters that companies can manage on their own and without dependence on politics.

Of course it is important that the “users” of companies, i.e. the “customers”, are prepared to pay appropriately for services. As an IT manager I have made the experience that service orientation and expectation management are essential prerequisites for successful action. I recently saw a television report about the Kiel Heroes who, instead of competing with each other, are now offering Corona-based service to customers.¹ In my view, this is a great example of a different, “better” approach.

A personal remark

Perhaps you are wondering whether IT systems and the situation in and around companies are comparable in Corona times?

Of course you can argue that the way we do business has worked excellently for many decades and that a “single failure” does not necessarily mean that there will be further crises. But I see it differently. For me, science shows very clearly that complex systems have many things in common, and that unexpected events in complex systems are becoming increasingly important. Such events quickly turn into a crisis in fragile systems. And even if you see it differently: why not simply try to learn from each other? Why not question situations in organisations, in industries and also in societies to identify your own weaknesses or risks. I believe that from such a viewpoint everyone wins in the end.

Notes:

[1] https://www.kielerhelden.de/

Astrid Kuhlmey has published more articles in the t2informatik Blog, including

Planning subject to reservation

Letting go is the new way of planning

How can I avoid uncertainty?

Astrid Kuhlmey

Computer scientist Astrid Kuhlmey has more than 30 years of experience in project and line management in pharmaceutical IT. She has been working as a systemic consultant for 7 years and advises companies and individuals in necessary change processes. Sustainability as well as social and economic change and development are close to her heart. Together with a colleague, she has developed an approach that promotes competencies to act and decide in situations of uncertainty and complexity.

t2informatik develops and modernises software as a service provider. Click here to learn more.