ML One
Lecture 10
A walking tour of AI developments in computer vision
+
example applications: hand pose detection, barcode detection, image foreground instance segmentation
Welcome 👩‍🎤🧑‍🎤👨‍🎤
By the end of this lecture, we'll have learnt about:
The STORY:
- The evolution of AI in computer vision
- Computer vision tasks, datasets and models
The practical:
- Core ML models walkthrough
- Example applications in hand pose detection, barcode detection, image foreground instance segmentation
First of all, don't forget to confirm your attendence on Seats App!
as usual, a fun AI model Pix2Pix to wake us up
Assessment Info
Assessment Info
- Multiple Choice test: There will be a summary sheet and mock exam handed out on 11th Jan
Assessment Info
- Presentation: pick an AI model and present (what are its input/output? what can it do? and a proposal plan of application, no training/implementation is required)
- There will be a summary sheet(with tips and model zoo) handed out next week (7th December)
Recap
🍰Convolutional neural network (CNN) is better at image tasks than MLP: efficient and effective.
🤘CNN is characterised by two new types of layers: conv layer and pooling layer
A conv layer has a set of filters and is convoluted with previous layer's activations. 🔎
Convolution is actually quite easy:
- "filter" in the image is the weights matrix
- element-wise multiplication between input and filter matrices
- sum up the element-wise products to one number, similar to dot product
A pooling layer takes the avg/max value within each sliding window on previous layer's activations. ◰
max pooling is just another easy math operation, it reduces each dimension by half in this example
A conv block is a stack of one or more conv layers followed by a pooling layer. 🍔
A typical CNN: input layer -> conv blocks -> fc layer -> output layer 🍔🍔
A state-of-the-art AI model: PoseNet.
- Intimidating maybe at the first look
- By looking closely you will find it is just many good old layers stacking together.
Intuition on how CNN works: it hierachically extracts features from an image: from simple edges to complex patterns
That's quite a lot, congrats! 🎉
end of recap 👋
Now that we have the basics (CNN) equiped, let's have a bird eye view on computer vision!
What is computer vision?
Computer vision (CV) is a field of artificial intelligence (AI) that enables computers and systems to derive meaningful information from digital images, videos and other visual inputs.
Meaningful information from digital images:
- What is it about?
- What are the objects in it?
- Where are the objects in it?
etc.
🤠Here are several main CV tasks:
- Image classification
- Object detection
- Image segmentation
- Scene reconstruction
etc.
🏋️While they all take digital images as input, we can differentiate these CV tasks by their distinct output types:
- Image classification: outputs a categorical vector
- Object detection: outputs bounding boxes
- Image segmentation: outputs pixel masks (providing finer object location precision than bb )
- Scene reconstruction: outputs 3D representations
let's look at some of the apple models and corresponding tasks, now this page should be much more familiar
The evolution of AI models in computer vision (story time):
Once upon a time...
The story started with two scientists and a cat🐈 in the 1950s-1960s.
Hubel and Wiesel Cat Experiment:
They discovered that the early layers of cat's visual cortex responded to simple shapes, like hard edges or lines.
- This meant that image processing starts with simple shapes like straight edges
(here is a demo of the experiment.)
In the 1960s, AI emerged as an academic field of study, and it also marked the beginning of the AI quest to solve the human vision problem.
1974 saw the introduction of optical character recognition (OCR) technology, which could recognize text printed in any font or typeface.
- (check out Optacon)
In 1982, neuroscientist David Marr established that vision works hierarchically and introduced algorithms for machines to detect edges, corners, curves and similar basic shapes.
Concurrently, computer scientist Kunihiko Fukushima developed a network of cells that could recognize patterns. The network, called the Neocognitron, included convolutional layers in a neural network.
- Neocognitron was applied to Japanese handwritten character recognition and other pattern recognition tasks.
Back to the year of 1998
"hello world" dataset of machine learning - MNIST, handwritten digits images with labels
Lenet: one of the earlies Convolutional Neural Network, with conv and pooling layers, the prototype of everything follows
But what next? Where should computer vision go? In the early 2000s, computer vision tasks included face recognition, matching satellite images, image stitching, 3D scene reconstruction ...
Finding the focal point, the right level of difficulty and task abstraction: obejct recognition
2005 PASCAL Visual Object Classes Challenge
It has an annotated dataset
number of training images: 1500±
image dimension: RGB 450*280 ish
four classes: motorbikes, bicycles, people, and cars
type of tasks: let's have a look!
🍜 noodle time!
How would you programme the computer to differentiate a car from a bike? (a car-or-bike binary classifier)
🍜 noodle time!
How would you programme the computer to differentiate a car from a bike? (a car-or-bike binary classifier)
Rule-based explicit programm is hard...
Type of tasks:
classification: outputs categorical vector
obejct bouding box detection: outputs bounding box
object segmentation: outputs pixel mask
These are the milestone tasks for computer vision till today.
While "gathering data" nowadays is just an everyday word, back at 2005 there was not so much mindset about the "data" and the impact of just the scale of data!
Let's move on to the ImageNet era 👽
The visionary ImageNet was introduced in 2010.
Let's take a look at its visionary scale:
number of training images: 1,281,167
image dimension: RGB 469x387 on average
1000 classes: based on WordNet
tasks: image classification and object detection
and introducing another scientist behind recent computer vision developments Feifei Li
another visionary design of ImageNet: 1000 classes from WordNet
WordNet consists of many English-language terms organized into an ontological structure
Wordnet example: a lexical database that is ontologically structured
It is bridging computer vision with cognitive science and 1000-class classification task is far beyond the capability of contemporary models
ImageNet challenge winners from 2010 to 2017
Explanation of "top-5" error on whiteboard
- 1. It was really bad in 2010 and 2011
- 2. Everything changed from 2012, from AlexNet onwards

- From AlexNet in 2012, we saw an explosion of AI models in CV.
- Let's take a look at some of the big namers (they are all image classifiers initially trained on ImageNet dataset).
Side note:
every big name model is usually chracterised by one or few brilliant architectural designs (e.g. a new set of hyper-paremters, a new layer type etc.).
just for reference
AlexNet: the first CNN that goes "deep", and uses GPU for training
VGG: deeper(aka it has more layers), with smaller filter size(3x3) in conv layers
Resnet: residual modules ("new layer type"), connections jumping layers
Inception, or GoogleNet: inception modules ("new layer type"), it goes "wide"
All these models pre-trained on ImageNet can be used as a good starting point for "any" vision task.
- think of these models as "vision task bootcamp" graduates,
they are good visual feature extractors.
We'll see applications of these next semester!
"New" topics of computer vision:
- Image captioning (image to text) e.g. models here
- 2D images generation: GAN( thispersondoesnotexist), Stable diffusion (next year😈)
3D models generation:
-- 2D-to-3D, e.g. avatar generation with RODIN by Microsoft, Avaturn(it has a free plan)
-- text-to-3D, e.g. game asset genetation Masterpiece X
Introducing hand pose detection:
- Input: a digital image
- Output: a set of coordinates corresponding to different key points on human hand (Here are the keypoints used by Apple's hand pose detector)
For the next ~30 mins:
- 😛download the virtual drawing (enabled by hand pose detection) App here
- 🍺unzip and open the xcode project
- ✏️modify the "Team" and "Bundle Identifier" fields under signing&capabilities
- 📱run it on your phone!
- 🤏pinch to draw, double tap to clear the screen
- 🤨question: where is the hand pose detection API called?
- 🫥hint: VNDetectHumanHandPoseRequest()
For the next ~20 mins:
- 😛download the barcode detection playground here
- 🍺unzip and open the playground
- 📱run it and see what the output print is
- ⛹️‍♀️find a QR-code generator online, generate a QR code and use this playground to decipher!
For the next ~30 mins:
- 😛download the subject lifting (enabled by an image segmentation model) App here
- 🍺unzip and open xcode project
- ✏️modify the "Team" and "Bundle Identifier" fields under signing&capabilities
- 📱run it on your phone!
- 🤨question: where is the foreground instance segmentation API called?
- 🫥hint: VNGenerateForegroundInstanceMaskRequest()
That's quite a lot, congrats! 🎉
Today we have looked at:
- What is computer vision about 👁️
- Image classification, object detection, image segmentation 🤓
- Some open datasets in CV research: MNIST, PASCAL, ImageNet 💿
- Some popular AI models in CV:
-- VGG, ResNet, GoogleNet 🏎️
- Three example Apps integrating models from Apple's Vision framework:
-- hand pose detection, barcode detection, foreground instance segmentation 🧃
*BIG ANNOUNCEMENT*

- Next week is Mick's guest lecture on AI models for languages (or any sequential data)😎
- It will also be our last lecture of this year!😎
- Hope to see you all next Thursday same time and same place!😎