Asking the right questions when testing AI systems

Guest contribution by Marco Achtziger, Test Architect | 08.05.2024

Table of Contents

Question 1: Do we still need testers?
Question 2: What can the tester contribute?
Question 3: What es should testers potentially learn?
Question 4: What are data scientists already testing?
Question 5: What else should data scientists learn?
Question 6: What do we actually gain from the use of AI?
Question 7: Which parts of the system use AI?
Question 8: Are we dealing a deterministic system?
Question 9: When should the results be the same?
Question 10: Are we dealing with a risky system?
Question 11: What do users do with the output?
Question 12: Is there a golden test data set?
Question 13: What data from the system do we need to collect or monitor?
Conclusion

Artificial intelligence (AI) is currently on everyone’s lips and will sooner or later find its way into almost every software product. However, this technology also creates new challenges in the traditional disciplines of software development. In this article, I present a catalogue of questions that you can use if artificial intelligence is also used in your context and you are wondering how to test it. It will help you to structure the topic right from the start.

Question 1: Do we still need testers?

There are now various tool manufacturers that show how test cases can be generated automatically with the help of AI. So why should companies still need testers? Because they need someone with a trained eye to look at the issue and uncover possible inconsistencies and systemic errors. It is not without reason that the manufacturers of the tools point out, at least in the small print, that the artefact created should still be checked by experienced users.

As a rule, the question of the necessity of testers can therefore be answered in the affirmative and does not have to be posed anew for every system.

Question 2: What can the tester contribute?

Quite a lot, of course. Testers are typically highly qualified specialists in their field and have learnt many methods and approaches to test complicated and complex systems. In particular, methods such as

boundary condition analysis,
equivalence class partitioning and
property testing

are particularly relevant here. The output of generative AI is typically not always exactly the same, but must fall into certain areas and categories.

By and large, this question is also obsolete, analogous to question 1, but I would like to mention it for the sake of completeness.

Question 3: What else should testers potentially learn?

Something you may have already learnt as a tester, namely in the area of “testing non-functional requirements”: an in-depth knowledge of statistics.

You don’t have to become a statistics expert or have a degree in maths, but attending a course like “Statistics 301”, which typically deepens the knowledge of introductory statistics, is recommended. The goal is to have a good understanding of outputs that are subject to certain probabilities.

If you take a closer look at the area of non-functional requirements, you will be confronted with statements such as “The response time of the service should not exceed 200ms in 90% of cases and 500ms in 99% of other cases”. However, these are statements analogous to the output of AI-based systems.

Question 4: What are data scientists already testing?

Clear answer: A whole lot. The data scientist is also interested in generating a high-quality model from the data. The focus may be different from that of the tester, but the quality requirements are the same. It is therefore important that testers and data scientists exchange ideas. Many test cases that the tester will come up with have probably already been performed by the data scientist, such as ensuring that precision and accuracy correspond to the specified values. This can avoid a lot of redundancy in testing.

Question 5: What else should data scientists learn?

Data scientists are usually very focussed on creating models and analysing the data. What would also be important here is to be aware of how the generated model fits into the overall system, e.g. which requirements it fulfils. Again, this can be done very easily in dialogue with the testers, who usually have exactly this view of the system.

Question 6: What do we actually gain from the use of AI?

As trivial as it may sound, this question should also be asked. Do I need AI here in particular or can I solve the problem just as well with other mechanisms? Even at Google, the first rule in the guideline for creating machine learning (ML) applications is: “Don’t be afraid to launch a product without machine learning”.

So as long as AI or ML do not bring any significant advantages, you should consider whether you want to use these technologies at all, because their use naturally comes at the price of complexity and the high consumption of computing resources when training the models.

Question 7: Which parts of the system use AI?

This is also a very important question. In most cases, it is very specific parts of the system that use AI, and the proportion of AI in the overall system is rather small. It is not uncommon for large parts of systems to deal with “normal” things such as the provision of web services, the evaluation of REST requests, the persistence of imported data, etc., and only one of several systems – e.g. for querying models and returning responses – uses AI. Conversely, this means that for a large part of the system, the question of AI does not arise at all and it can be tested in the classic way.

Question 8: Are we dealing with a deterministic system? Is it (apparently) not deterministic?

The latest research from December 2023 shows that far fewer AI systems can be forced into a deterministic mode than was assumed a year ago. Both generative AI systems (such as ChatGPT) and neural networks in general have an inherent non-determinism, even after their training is completed during inference. This non-determinism can be reduced in all systems, but never completely eliminated.

The more deterministic a system is, the more it should be regarded as a black box during testing. To put it bluntly: As a user, what do I care whether my query is being processed by an AI or another algorithm? I always need the same result if possible.

Question 9: When should the results be the same? Under what circumstances should they be different?

This question seems to be identical or very similar to the previous one. However, this is more about cases that we had before AI, namely a situation called “ultimately consistent”. Particularly in the case of distributed independent services that communicate with each other, it can happen that although correct results are displayed, the specific return values for two requests can be different. An example of this would be so-called walls such as Facebook, which do not always consistently display the same data just to get the data to the user more quickly. Such situations can now increasingly occur with AI systems if they are either not completely deterministic or the output of several systems is combined into one final result.

Question 10: Are we dealing with a risky system?

I think the recently adopted EU AI Act¹ is the current benchmark here, even if it remains open to interpretation in this categorisation. Three categories of systems can be derived from it:

prohibited applications,
high-risk applications and
other applications.

I don’t think you’ll be thinking about how to test prohibited applications. High-risk applications are applications in a security-relevant environment (i.e. software that can directly or indirectly endanger people). These typically already operate in a regulated environment and most test requirements can be derived from the regulations. All the more reason, however, to deal with the other questions.

Question 11: What do users do with the output?

Even in the case of non-risky applications, consideration should be given to how the output of the AI is further utilised. Are there ethical questions that should be asked and scrutinised? I think we’ve all heard of chatbots that have turned into extremist “personalities” relatively quickly, or of software that is supposed to make the application process in a company more efficient, but then excludes a certain group of people across the board because they were not represented strongly enough in the training data.

Question 12: Is there a golden test data set?

This seems to be an interesting question, especially for training. However, it should already be answered by the data scientist. From the tester’s point of view, the question must be asked from the perspective of the model user. And here it may well turn out that you don’t need a stable test data set at all, but can also test the functionality with changing data.

It becomes particularly interesting when I have a system that, for example, updates its models in the field and thus changes its output over time. This results in the need to scale the model with stable data sets in the test to ensure that it also develops in the right direction. I don’t think anyone wants a chatbot that suddenly becomes abusive or malicious based on user input and the processing of that input.

Question 13: What data from the system do we need to collect or monitor?

This is a classic question from the monitoring of systems, but it can also be very interesting here. In addition to the familiar metrics that can be collected for functioning software, model-specific metrics can also emerge. For example, how do the metrics for the model itself behave over time? Do accuracy and precision remain the same or do they deteriorate? Does the runtime or CPU load change over time due to the model?

In today’s cloud systems, this can also have an impact on the operating costs of the system if I have to pay for them depending on this.

Conclusion

The above questions help you to approach the “problem” of testing an AI application in a structured way and break it down into its components. Go through the questions step by step and look for your answers. For example, you can determine to what extent which test data is required or discuss how you can avoid redundancies and thus unnecessary runs and waiting times during testing in order to generate feedback for development more quickly.

Notes:

The questionnaire is a joint effort by Marco Achtziger, Alex Schladebeck (from Bredex GmbH), Andrea Huettner, Christian Matt (both from eschbach GmbH) and Dr Gregor Endler (from codemanufaktur GmbH).

You are welcome to use the questions in your projects. Marco Achtziger is happy to receive feedback and suggestions for additional questions that may also be useful when testing AI systems.

[1] EU law on artificial intelligence

If you like the article or would like to discuss it, please feel free to share it in your network.

Marco Achtziger

Marco Achtziger works for Siemens Healthineers as a test architect and trainer for test architects. He is an iSQI Certified Professional for Software Architecture, iSTQB Certified Tester and iSQI Certified Professional for Project Management. He is also happy to share his knowledge and experience across company boundaries and speaks at conferences such as OOP and Agile Testing Days.

t2informatik develops and modernises software as a service provider. Click here to learn more.