Cognitive Quality Assurance — An Introduction

This article has a slant towards the IBM Watson Developer Cloud Services but the principles and rules of thumb expressed here are applicable to most cognitive/machine learning problems.



Quality assurance is arguably one of the most important parts of the software development lifecycle. In order to release a product that is production ready, it must be put under, and pass, a number of tests — these include unit testing, boundary testing, stress testing and other practices that many software testers are no doubt familiar with. The ways in which traditional software are relatively clear.In a normal system, developers write deterministic functions, that is — if you put an input parameter in, unless there is a bug, you will always get the same output back. This principal makes it.. well not easy… but less difficult to write good test scripts and know that there is a bug or regression in your system if these scripts get a different answer back than usual.

Cognitive systems are not deterministic in nature. This means that you can receive different results from the same input data when training a system. Such systems tend to be randomly initialised and learn in different, nuanced ways every time they are trained. This is similar to how identical twins who may be biologically identical still learn their own preferences, memories and skillsets.

Thus, a traditional unit testing approach with tests that pass and fail depending on how the output of the system compares to an expected result is not helpful.

This article is the first in a series on Cognitive Quality Assurance. Or in other words, how to test and validate the performance of non-deterministic, machine learning systems. In today’s article we look at how to build a good quality ground truth and then carrying out train/test/blind data segmentation and how you can use your ground truth to verify that a cognitive system is doing its job.

Ground Truth

Let’s take a step back for a moment and make sure we’re ok with the concept of ground truth.

In machine learning/cognitive applications, the ground truth is the dataset which you use to train and test the system. You can think of it like a school textbook that the cognitive system treats as the absolute truth and first point of reference for learning the subject at hand. Its structure and layout can vary depending on the nature of the system you are trying to build but it will always abide by a number of rules. As I like to remember them: R-C-S!

Representative of the problem

Like Da Vinci when he drew the anatomically correct Vitruvian man, strive to represent the data as clearly and accurately as possible — errors make learning harder!
  • The ground truth must accurately reflect the problem you are trying to solve.
  • If you are building a question answering system, how sure are you that the questions in the ground truth are also the questions that end users will be asking?
  • If you are building an image classification system, are the images in your ground truth of a similar size and quality to the images that you will need to tag and classify in production? Do your positive and negative examples truly represent the problem (i.e. if you only have black and white images in positive but are learning to find cat, the machine might learn to assume that black and white implies cat).
  • The proportions of each type is an important factor. If you have 10 classes of image or text and one particular class occurs 35% of the time in the field, you should try and reflect this in your ground truth too.


  • The data in your ground truth must follow a logical set of rules — even if these are
    a bit “fuzzy” — after all if a human can’t decide on how to classify a set of data consistently, how can we expect a machine to do this?
  • Building a ground truth can often be a very large task requiring a team of people. When working in groups it may be useful to build a set of guidelines that detail which data belongs to which class and lists some examples. I will cover this in more detail on my article on working in groups.
  • Humans ourselves can be inconsistent in nature so if at all possible, try to automate some of the classification — using dictionaries or pattern matching rules.

Important: never use cognitive systems to generate ground truth or you run the risk of introducing compounding learn errors.

Statistically Significant –

More data points means that the cognitive system has more to work with - don't skimp on ground truth - it will cost you your accuracy!
More data points means that the cognitive system has more to work with — don’t skimp on ground truth — it will cost you your accuracy!
  • The ground truth should be as large as is affordable. When you were a child and learned the concept of dog or cat, the chances are you learned that from seeing a large number of these animals and were able to draw up mental rules for what a dog entails (4 legs, furry, barks, wags tail) vs what cat entails (4 legs, sometimes furry, meows, retractable claws). The more diverse examples of these animals you see, the better you are able to refine your mental model for what each animal entails. The same applies with machine learning and cognitive systems.
  • Some of the Watson APIs list minimal ground truth quality requirements and these vary from service to service. You should always be aiming as high as possible but as an absolute minimum, for at least 25% more than the service requirement so that we have some data for our blind testing (all will be revealed)

There are some test techniques for dealing with testing smaller corpuses that I will cover in a follow up article.

Training and Testing — Concepts

Once we are happy with our ground truth, we need to decide how best to train and test the system. In a standard software environment, you would want to test every combination of every function and make sure that all combinations work. It may be tempting to jump to this conclusion with Cognitive systems too. However, this is not the answer.

Taking a step back again, let’s think remember when you were back at school. Over the course of a year you would learn about a topic and at the end there was an exam. We knew that the exam would test what we had learned during the year but we did not know:

  • The exact questions that we would be tested on — you have some idea of the sorts of questions you might be tested on but if you knew what the exact questions were you could go and find out what the answers are ahead of time
  • The exact exam answers that would get us the best results before we went into the exam room and took the test. That’d be cheating right?

With machine learning, this concept of learning and then blind testing is equally important. If we train the algorithm on all of the ground truth available to us and then test it, we are essentially asking it questions we already told it the answers to. We’re allowing it to cheat.

By splitting the ground truth into two datasets, training on one and then asking questions with the other — we are really demonstrating that the machine has learned the concepts we are trying to teach and not just memorised the answer sheet.

Training and Testing — Best Practices


Generally we split our data set into 80% training data and 20% testing data — this means that we are giving the cognitive system the larger chunk of information to learn from and testing it on a small subset of those concepts (in the same way that your professor gave you 12 weeks of lectures to lean from and then a 2 hour exam at the end of term).

It is important that the test questions are well represented in the train data (it would have been mean of your professors to ask you questions in the exam that were never taught in the lectures). Therefore, you should make sure to sample ground truth pairs from each class or concept that you are trying to teach.

You should not simply take the first 80% of the ground truth file and feed it into the algorithm and use the last 20% of the file to test the algorithm — this is making a huge assumption about how well each class is represented in the data. For example, you might find that all of the questions about car insurance come at the end of your banking FAQ ground truth resulting in:

  • The algorithm never seeing what a car insurance question looks like and not learning this concept.
  • The algorithm fails miserably at the test because most of the questions were on car insurance and it didn’t know much about that.
  • The algorithm has examples of mortgage and credit card questions but is never tested on these — we can’t make any assertions about how well it has learned to classify these concepts.

The best way to divide test and training data for the above NLC problem is as follows:

  1. Iterate over the ground truth — separating out each example into groups by class/concept
  2. Randomly select 80% of each of the groups to become the training data for that group/class
  3. Take the other 20% of each group and use this as the test data for that group/class
  4. Recombine the subgroups into two groups: test and train

With some of the other Watson cognitive APIs (I’m looking at you, Visual Recognition and Retrieve & Rank) you will need to alter this process a little bit. However the key here is making sure that the test data set is a fair representation (and a fair test) of the information in the train dataset.

Testing the model

Once you have your train set and test set, the next bit is easy. Train a classifier with the train set and then write a script that loads in your test set, asks the question (or shows the classifier the image) and then compare the answer that the classifier gives with the answer in the ground truth. If they match, increment a “correct” number. If they don’t match, too bad! You can then calculate the accuracy of your classifier — it is the percentage of the total number of answers that were marked as correct.

Blind Testing and Performance Reporting


In a typical work flow you may be training, testing, altering your ground truth to try and improve performance and re-training. This is perfectly normal and it often takes some time to tune and tweak a model in order to get optimal performance.

However, in doing this, you may be inadvertently biasing your model towards the test data — which in itself may change how the model performs in the real world. When you are happy with your test performance, you may wish to benchmark against another third dataset — a blind test set that the machine has not been ‘tweaked’ in order to perform better against. This will give you the most accurate view, with respect to the data available, of how well your classifier is performing in the real world.

In the case of three data sets (test, train, blind) you should use a similar algorithm/work flow as describe in the above section. The important thing is that the three sets must not overlap in any way and should all be representative of the problem you are trying to train on.

There are a lot of differing opinions on what proportions to separate out the data set into. Some folks advocate 50%, 25%, 25% for test, train, blind respectively, others 70, 20, 10. I personally start at the latter and change these around if they don’t work — your mileage may vary depending on the type of model you are trying to build and the sort of problem you are trying to model.


Important: once you have done your blind test to get an accurate idea of how well your model performs in the real world, you must not do any more tuning on the model.If you do, your metrics will be meaningless since you are now biasing the new model towards the blind data set. You can of course, start from scratch and randomly initialise a new set of test, train and blind data sets from your ground truth at any time.


Hopefully, this article has given you some ideas about how best to start assessing the quality of your cognitive application. In the next article, I will be covering some more in depth measurements that you can do on your model to find out where it is performing well and where it needs tuning beyond a simple accuracy rating. We will also discuss some other methods for segmenting test and train data for smaller corpuses.