Our client produces items frequently sold in convenience stores and supermarkets. They were interested to know individual items were most sold and how many items were not in stock.

Currently, people are sent to the stores in order to report on the types and availability of the items which creates a significant cost. In order to reduce those costs, our client wanted to know if it was possible to automatically identify the individual categories of items and the empty shelf spaces from camera images.

So the challenge was, to

  • identify where in the image the shelf with the products is
  • detect the items
  • classify them into different categories (more than 40)
  • detect empty slots

Several thousand images taken with smartphones in local shops were supplied. They varied in their quality, perspective, illumination and distance from the shelving units. The images came without any human labels or information about which items can be seen in them. Nor was there any information about where items are in the images and where there were empty shelves.


Similarly to the airport baggage computer vision study, the first part – locating the shelves and items within the image – was done with a human in the loop approach and transfer-learning based on RetinaNet, a state-of-the-art neural network for object detection.

Detecting and classifying items and empty spaces in images was well suited to an object detection approach. However with more than 40 different types of item, even with our human-in-the-loop approach, it would take a very long time to manually label an initial dataset. 

Out initial idea was to first detect the shelf, crop it out and then use template matching or feature matching  to confirm that the items were in the image. Due to variations in perspective, illumination and image quality as well as the fact that many items looked similar to the human eye when photographed, we had to dismiss this more straight forward approach.

We decided to revisit the object detection approach and rather than manually annotating images, were able to circumvent the manual training data issue by using image synthesis.

We cut samples for each item category until we had a small collection of templates that reflected the intra-category variance in appearance (due differences in positioning of the item for example). We were able to inject images of particular items that we wanted to detect into images of empty shelves. Since we knew which item was automatically injected into each image and the location of the items, we got the annotations for free. We also applied Gaussian noise, compression artefacts, perspective distortions, and illumination changes to the synthesised images in order to make the data more varied and our detection algorithm more robust.

We trained another RetinaNet on this fake data but tested it on the real data. Our models were able to detect 95% of the items and classify 98% of them correctly.


The outcome of the project was code written in Python that is able to take images as inputs and return the location and class of over 40 different classes, as well as empty shelf-positions. We also provided a report that documented our approach, challenges that we faced and recommendations for further development (e.g. how to make the model class-agnostic).

The project was carried out over 4 weeks