My impressions of the British Computer Vision Summer School

Last month, the British Machine Vision Association organised the Computer Vision Summer School at the University of East Anglia, Norwich. This is an annual 4-day workshop where young computer vision practitioners can listen to UK leading academic experts on the various research aspects of the field. The talks ranged from introductory courses (such as colour, low-level vision) to the latest research trends, this year including active vision, probabilistic generative models and, as it always has to be, deep learning. Here are my top picks.

Trend #1: An attempt to know what we don’t know

There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns — the ones we don’t know we don’t know. — Controversial claim by former US Secretary D. Rumsfeld

Computer Vision models can nowadays be trained to categorise images in over 1000 different classes with an impressive error rate of less than 4%, which is deemed better than human performance on the task. However, what happens if we want to generate new images? How can we be sure that we are learning all the features relatively to each class? If we are given a dataset that won’t contain all possible scenarios in the world (think self-driverless cars), how can we propagate our prior knowledge through our model?

The answer to most of these questions entails introducing a principled probabilistic analysis, as described by Prof Neill Cambpell (University of Bath) during his talk at the BMVA summit (think learning probability distributionsof data, given weights, rather than “hard” individual scores). By learning probability distributions of the possible scenarios, one can (i) impose prior knowledge on the system, specified by the user (ii) make the confidences outputted by the model more palatable and robust to data contingency (knowning what we don’t know, huh?) and (iii) generate new data. For instance, using suitable priors for the system, one can learn a manifold of likely images, that can transition smoothly and have a representatively low probability for unlikely ones (check out this interactive demo on how to generate new fonts by learning a manifold of “cool” fonts!).

Learning a manifold of cool fonts ( Image kindly provided by Prof Campbell.

Probabilistic generative models can also go deep (see what I did there). Prof Campbell mentioned the principles of deep stochastic belief networks that are able to propagate not only weights but also uncertainty across a neural network. Bayesian deep learning is indeed a very hot trend that combines the best of deep learning with salient features of generative processes such as smoothness prior, ability of training with very small datasets and, perhaps more importantly, interpretable associated confidences to better assess when we are way far from the training set. A whole workshop devoted to the topic will take place at NIPS 2018.

Trend #2: From 2D to 3D (and 4D, and counting…)

Models in two dimensions are relatively well studied and, as mentioned, for some of the tasks (such as image classification), we can achieve reasonable performance with off-the-shelf tools. However, there is much we need to do when it comes to understanding scenes in three dimensions, sequences of actions and, effectively, reacting to the environment. This was the subject of two of the lectures at the BMVA summit (Active Vision, by Dr Nicola Belloto, University of Lincoln and 3D Computer Vision by Will Smith, University of York).

Perhaps the most surprising task where Computer Vision still does not provide a satisfactory solution is grasping. This is a problem where Active Vision has to interact with object recognition, as well as depth estimation (and ultimately 3D shape), choice of stable points for a variety of shapes, textures, weights, and so on. Grasping remains a challenge and a very active area of research, especially when it comes to trying to build five-fingers versatile robots. As one might correctly guess, the prospective applications are numerous, such as automated agriculture and healthcare.

Check out this 2016 (!) video of a robot trying to pick various objects.

…and one caveat

In 2014, a group of researchers showed how to trick neural models by adding tiny perturbations, imperceptible to the human eyes, to the images. Shockingly, [Szegedy et al] exhibited a method to force state-of-the-art image neural networks to misclassify all images in the training data. This contradicted a folkloric claim that such models have a smooth boundary across classes and raised a big red flag towards our understanding of neural networks.

Two images whose difference has negligible norm might be misclassified.

Four years passed, and our understanding of the behaviour of such networks has only marginally improved, although we now know that it is possible to fool them universally, and we can generate such examples adversarially as a training strategy. However, incidents involving deep learning systems only point to an increasing need of introducing better priors for machine learning models. Perhaps a combination of those with probabilistic generative models (see Trend #1 above) will prove handy in applications when stronger guarantees are needed.

Bonus: During the very last talk of the summit, Prof Adrian Clark (University of Essex) discussed the need for rigorous assessment of Computer Vision and, in fact, of any Machine Learning system. He mentioned the pitfalls of comparing algorithms with one single metric, such as the error rate/accuracy, in one single dataset as in huge competitions such as ImageNet, without any second order analysis (RIP error bars). Building on that observation, he introduced a couple of methods that should be employed to improve rigour when comparing vision models.

Indeed, Filament AI and Prof Clarke are part of a joint InnovateUK project to incorporate this and other insights into Computer Vision software tools.

And to the winners…

Last but not least (fine, just last…), the main social activity of the program was a typical northeastern pub night out, with a rather (a)typical pub quiz, as you have never seen before. Who would tell that Computer Vision folks are so competitive when it comes to pub quiz?

An unusual pub quiz

Further reading (references)

[Szegedy et al] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. International Conference on Learning Representations, 2014.

[Campbell and Kautz] Learning a Manifold of Fonts, In ACM Transactions on Graphics (SIGGRAPH) 33(4), 2014