In this post we will talk about some more metrics you can do on your machine learning system including PrecisionRecallF-measure and confusion matrices. These metrics give you a much deeper level of insight into how your system is performing and provide hints at how you could improve performance too!

Cognitive Quality Assurance: A recap — Accuracy calculation

This is the most simple calculation but perhaps the least interesting. We are just looking at the percentage of times the classifier got it right versus the percentage of times it failed. Simply:

  1. sum up the number of results (count the rows),
  2. sum up the number of rows where the predicted label and the actual label match.
  3. Calculate percentage accuracy: correct / total * 100.

This tells you how good the classifier is in general across all classes. It does not help you in understanding how that result is made up.

Going above and beyond accuracy: why is it important?

Imagine that you are a hospital and it is critically important to be able to predict different types of cancer and how urgently they should be treated. Your classifier is 73% accurate overall but that does not tell you anything about it’s ability to predict any one type of cancer. What if the 27% of the answers it got wrong were the cancers that need urgent treatment? We wouldn’t know!

This is exactly why we need to use measurements like precision, recall and f-measure as well as confusion matrices in order to understand what is really going on inside the classifier and which particular classes (if any) it is really struggling with.

Precision, Recall and F-measure and confusion matrices (Grandma’s Memory Game)

Precision, Recall and F-measure are incredibly useful for getting a deeper understanding of which classes the classifier is struggling with. They can be a little bit tricky to get your head around so lets use a metaphor about Grandma’s memory.

Imagine Grandma has 24 grandchildren. As you can understand it is particularly difficult to remember their names. Thankfully, her 6 children, the grandchildren’s parents all had 4 kids and named them after themselves. Her son Steve has 3 sons: Steve I, Steve II, Steve III and so on.

This makes things much easier for Grandma, she now only has to remember 6 names: Brian, Steve, Eliza, Diana, Nick and Reggie. The children do not like being called the wrong name so it is vitally important that she correctly classifies the child into the right name group when she sees them at the family reunion every Christmas.

I will now describe Precision, Recall, F-Measure and confusion matrices in terms of Grandma’s predicament.

Some Terminology

Before we get on to precision and recall, I need to introduce the concepts of true positive, false positive, true negative and false negative. Every time Grandma gets an answer wrong or right, we can talk about it in terms of these labels and this will also help us get to grips with precision and recall later.

These phrases are in terms of each class — you have TP, FP, FN, TN for each class. In this case we can have TP,FP,FN,TN with respect to Brian, with respect to Steve, with respect to Eliza and so on.

This table shows how these four labels apply to the class “Brian” — you can create a table will

Brian Not Brian Grandma says “Brian” True Positive False Positive Grandma says <not brian> False Negative True Negative

  • If Grandma calls a Brian, Brian then we have a true positive (with respect to the Brian class) — the answer is true in both senses- Brian’s name is indeed Brian AND Grandma said Brian — go Grandma!
  • If Grandma calls a Brian, Steve then we have a false negative (with respect to the Brian class). Brian’s name is Brian and Grandma said Steve. This is also a false positive with respect to the Steve Class.
  • If Grandma calls a Steve, Brian then we have a false positive (with respect to the Brian class). Steve’s name is Steve, Grandma wrongly said Brian (i.e. identified positively).
  • If Grandma calls an Eliza, Eliza, or Steve, or Diana, or Nick — the result is the same — we have a true negative (with respect to the Brian class). Eliza,Eliza would obviously be a true positive with respect to the Eliza class but because we are only interested in Brian and what is or isn’t Brian at this point, we are not measuring this.

When you are recording results, it is helpful to store them in terms of each of these labels where applicable. For example:

Steve,Steve (TP Steve, TN everything else)
Brian,Steve (FN Brian, FP Steve)

Precision and Recall

Grandma is in the kitchen, pouring herself a Christmas Sherry when three Brians and 2 Steves come in to top up their eggnogs.

Grandma correctly classifies 2 Brians but slips up and calls one of them Eliza. She only gets 1 of the Steve’ and calls the other Brian.

In terms of TP,FP,TN,FN we can say the following (true negative is the least interesting for us):

TP FP FN Brian 2 1 1 Eliza 0 1 0 Steve 1 0 1

  • She has correctly identified 2 people who are truly called Brian as Brian (TP)
  • She has falsely named someone Eliza when their name is not Eliza (FP)
  • She has falsely named someone whose name is truly Steve something else (FN)

True Positive, False Positive, True Negative and False negative are crucial to understand before you look at precision and recall so make sure you have fully understood this section before you move on.


Precision, like our TP/FP labels, is expressed in terms of each class or name. It is the proportion of true positive name guesses divided by true positive + false positive guesses.

Put another way, precision is how many times Grandma correctly guessed Brian versus how many times she called other people (like Steve) Brian.

For Grandma to be precise, she needs to be very good at correctly guessing Brians and also never call anyone else (Elizas and Steves) Brian.

Important: If Grandma came to the conclusion that 70% of her grandchildren were named Brian and decided to just randomly say “Brian” most of the time, she could still achieve a high overall accuracy. However, her Precision — with respect to Brian would be poor because of all the Steves and Elizas she was mis-labelling. This is why precision is important.

TP FP FN Precision Brian 2 1 1 66% Eliza 0 1 0 N/A Steve 1 0 1 100%

The results from this case are displayed above. As you can see, Grandma uses Brian to incorrectly label Steve so precision is only 66%. Despite only getting one of the Steves correct, Grandma has 100% precision for Steve simply by never using the name incorrectly. We can’t calculate for Eliza because there were no true positive guesses for that name ( 0 / 1 is still zero ).

So what about false negatives? Surely it’s important to note how often Grandma is inaccurately calling Brian by other names? We’ll look at that now…


Continuing the theme, Recall is also expressed in terms of each class. It is the proportion of true positive name guesses divided by true positive + false negative guesses.

Another way to look at it is given a population of Brians, how many does Grandma correctly identify and how many does she give another name (i.e. Eliza or Steve)?

This tells us how “confusing” Brian is as a class. If Recall is high then its likely that Brians all have a very distinctive feature that distinguishes them as Brians (maybe they all have the same nose). If Recall is low, maybe Brians are very varied in appearance and perhaps look a lot like Elizas or Steves (this presents a problem of its own, check out confusion matrices below for more on this).

TP FP FN Recall Brian 2 1 1 66.6% Eliza 0 1 0 N/A Steve 1 0 1 50%

You can see that recall for Brian remains the same (of the 3 Brians Grandma named, she only guessed incorrectly for one). Recall for Steve is 50% because Grandma guessed correctly for 1 and incorrectly for the other Steve. Again Eliza can’t be calculated because we end up trying to divide zero by zero.


F-measure effectively a measurement of how accurate the classifier is per class once you factor in both precision and recall. This gives you a wholistic view of your classifier’s performance on a particular class.

In terms of Grandma, f-measure give us an aggregate metric of how good Grandma is at dealing with Brians in terms of both precision AND accuracy.

It is very simple to calculate if you already have precision and recall:

F_1 = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}

Here are the F-Measure results for Brian, Steve and Eliza from above.

TP FP FN Precision Recall F-measure Brian 2 1 1 66.6% 66.6% 66.6% Eliza 0 1 0 N/A N/A N/A Steve 1 0 1 1 0.5 0.6666666667

As you can see — the F-measure is the average (harmonic mean) of the two values — this can often give you a good overview of both precision and recall and is dramatically affected by one of the contributing measurements being poor.

Confusion Matrices

When a class has a particularly low Recall or Precision, the next question should be why? Often you can improve a classifier’s performance by modifying the data or (if you have control of the classifier) which features you are training on.

For example, what if we find out that Brians look a lot like Elizas? We could add a new feature (Grandma could start using their voice pitch to determine their gender and their gender to inform her name choice) or we could update the data (maybe we could make all Brians wear a blue jumper and all Elizas wear a green jumper).

Before we go down that road, we need to understand where there is confusion between classes and where Grandma is doing well. This is where a confusion matrix helps.

A Confusion Matrix allows us to see which classes are being correctly predicted and which classes Grandma is struggling to predict and getting most confused about. It also crucially gives us insight into which classes Grandma is confusing as above. Here is an example of a confusion Matrix for Grandma’s family.

Predictions Steve Brian Eliza Diana Nick Reggie Actual

Class Steve 4 1 0 1 0 0 Brian 1 3 0 0 1 1 Eliza 0 0 5 1 0 0 Diana 0 0 5 1 0 0 Nick 1 0 0 0 5 0 Reggie 0 0 0 0 0 6

Ok so lets have a closer look at the above.

Reading across the rows left to right these are the actual examples of each class — in this case there are 6 children with each name so if you sum over the row you will find that they each add up to 6.

Reading down the columns top-to-bottom you will find the predictions — i.e. what Grandma thought each child’s name was. You will find that these columns may add up to more than or less than 6 because Grandma may overfit for one particular name. In this case she seems to think that all her female Grandchildren are called Eliza (she predicted 5/6 Elizas are called Eliza and 5/6 Dianas are also called Eliza).

Reading diagonally where I’ve shaded things in bold gives you the number of correctly predicted examples. In this case Reggie was 100% accurately predicted with 6/6 children called “Reggie” actually being predicted “Reggie”. Diana is the poorest performer with only 1/6 children being correctly identified. This can be explained as above with Grandma over-generalising and calling all female relatives “Eliza”.

Steve sings for a Rush tribute band — his Geddy Lee is impeccable.

Grandma seems to have gender nailed except in the case of one of the Steves (who in fairness does have a Pony Tail and can sing very high). She is best at predicting Reggies and struggles with Brians (perhaps Brians have the most diverse appearance and look a lot like their respective male cousins). She is also pretty good at Nicks and Steves.

Grandma is terrible at female grandchildrens’ names. If this was a machine learning problem we would need to find a way to make it easier to identify the difference between Dianas and Elizas through some kind of further feature extraction or weighting or through the gathering of additional training data.


Machine learning is definitely no walk in the park. There are a lot of intricacies involved in assessing the effectiveness of a classifier. Accuracy is a great start if until now you’ve been praying to the gods and carrying four-leaf-clovers around with you to improve your cognitive system performance.

However, Precision, Recall, F-Measure and Confusion Matrices really give you the insight you need into which classes your system is struggling with and which classes confuse it the most.

A Note for Document Retrieval (Watson Retrieve & Rank) Users

This example is probably directly relevant to those building classification systems (i.e. extracting intent from questions or revealing whether an image contains a particular company’s logo). However all of this stuff works directly for document retrieval use cases too. Consider true positive to be when the first document returned from the query is the correct answer and false negative is when the first document returned is the wrong answer.

There are also variants on this that consider the top 5 retrieved answer (Precision@N) that tell you whether your system can predict the correct answer in the top 1,3,5 or 10 answers by simply identifying “True Positive” as the document turning up in the top N answers returned by the query.


Overall I hope this tutorial has helped you to understand the ins and outs of machine learning evaluation.

Next time we look at cross-validation techniques and how to assess small corpii where carving out a 30% chunk of the documents would seriously impact the learning. Stay tuned for more!