Review — Look, Listen and Learn (Self-Supervised Learning)

Self-Supervised Learning Using L³-Net for Audio-Visual Correspondence Task (AVC)

Published in

Geek Culture

9 min readAug 29, 2021

**Audio-visual correspondence task (AVC)**: By seeing and hearing many **unlabelled** examples, a network should learn to determine whether a pair of (video frame, short audio clip) correspond to each other or not.

In this story, Look, Listen and Learn, (L³-Net), by DeepMind, and VGG in University of Oxford, is reviewed. In this paper, a question is considered:

What can be learnt by looking at and listening to a large number of unlabelled videos?

Audio-Visual Correspondence (AVC) learning task is introduced to training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, result in good visual and audio representations.

This is a paper in 2017 ICCV with over 400 citations. (Sik-Ho Tsang @ Medium)

Outline

Core Idea
L³-Net: Network Architecture
Training Data Sampling
Audio-Visual Correspondence (AVC) Results
Transfer Learning Results
Qualitative Analysis

1. Core Idea

1.1. Binary Classification Task

By seeing and hearing many examples of a person playing a violin and examples of a dog barking, and never, or at least very infrequently, seeing a violin being played while hearing a dog bark and vice versa, it should be possible to conclude what a violin and a dog look and sound like, without ever being explicitly taught what is a violin or a dog.
The AVC task is a simple binary classification task: given an example video frame and a short audio clip — decide whether they correspond to each other or not.

1.2. Difficulties

The corresponding (positive) pairs are the ones that are taken at the same time from the same video, while mismatched (negative) pairs are extracted from different videos.
The network learns visual and audio features and concepts from scratch without ever seeing a single label.
Videos can be very noisy, the audio source is not necessarily visible in the video (e.g. camera operator speaking, person narrating the video, sound source out of view or occluded, etc.).
The audio and visual content can be completely unrelated (e.g. edited videos with added music, very low volume sound, ambient sound such as wind dominating the audio track despite other audio events being present, etc.).

2. L³-Net: Network Architecture

The network has three distinct parts: the vision and the audio subnetworks which extract visual and audio features, respectively, and the fusion network which takes these features into account to produce the final decision.

2.1. Vision Subnetwork

The input to the vision subnetwork is a 224×224 colour image.
VGGNet design style is used, with 3×3 convolutional filters, and 2×2 max-pooling layers with stride 2 and no padding.
The network can be segmented into four blocks of conv+conv+pool layers such that inside each block the two conv layers have the same number of filters, while consecutive blocks have doubling filter numbers: 64, 128, 256 and 512.
At the very end, max-pooling is performed across all spatial locations (i.e. Global Max Pooling) to produce a single 512-D feature vector.
Each conv layer is followed by batch normalization and ReLU.

2.2. Audio Subnetwork

The input to the audio subnetwork is a 1 second sound clip converted into a log-spectrogram, treated as a greyscale 257×199 image.
The architecture of the audio subnetwork is identical to the vision one except that the input is 1D signal instead of 3D.
The final audio feature is also 512-D.

2.3. Fusion Network

The two 512-D visual and audio features are concatenated into a 1024-D vector.
It consists of two fully connected layers, with ReLU in between them, and the intermediate feature size of 128-D, to produce a 2-way classification output, namely, whether the vision and audio correspond or not.

3. Training Data Sampling & Datasets

3.1. Training Data Sampling & Other Details

A non-corresponding frame-audio pair is compiled by randomly sampling two different videos and picking a random frame from one and a random 1 second audio clip from the other.
A corresponding frame-audio pair is created by sampling a random video, picking a random frame in that video, and then picking a random 1 second audio clip that overlaps in time with the sampled frame.
Standard data augmentation techniques are used: Each training image is uniformly scaled such that the smallest dimension is equal to 256, followed by random cropping into 224 × 224, random horizontal flipping, and brightness and saturation jittering.
Audio is only augmented by changing the volume up to 10% randomly but consistently across the sample.
The network was trained on 16 GPUs in parallel with synchronous training implemented in TensorFlow, where each worker processed a 16-element batch, thus making the effective batch size of 256.
For a training set of 400k 10 second videos, the network is trained for two days, during which it has seen 60M frame-audio pairs.

3.2. Datasets

Two video datasets are used for training the networks: Flickr-SoundNet and Kinetics-Sounds.

3.2.1. Flickr-SoundNet

This is a large unlabelled dataset of completely unconstrained videos from Flickr.
It contains over 2 million videos but for practical reasons only a random subset of 500k videos is used (400k training, 50k validation and 50k test) and only the first 10 seconds of each video are used.
This dataset is used for the transfer learning experiments.

3.2.2. Kinetics-Sounds

This is a a labelled dataset for quantitative evaluation. A subset (much smaller than Flickr-SoundNet) of the Kinetics dataset, is used, which contains YouTube videos manually annotated for 10-sceond cropped human actions.
The subset contains 19k 10 second video clips (15k training, 1.9k validation, 1.9k test) formed by filtering the Kinetics dataset for 34 human action classes, such as:

such as playing various instruments (guitar, violin, xylophone, etc.),
using tools (lawn mowing, shovelling snow, etc.), as well as
performing miscellaneous actions (tap dancing, bowling, laughing, singing, blowing nose, etc.).

It still contains considerable noise, e.g.: the bowling action is often accompanied by loud music at the bowling alley, human voices (camera operators or video narrations) often masks the sound of interest, and many videos contain sound tracks.

4. Audio-Visual Correspondence (AVC) Results

Test set accuracy on the AVC task for the L3-Net, and the two supervised baselines on the labelled Kinetics-Sounds dataset is shown.
The number of positives and negatives is the same, so chance gets 50%. All methods are trained on the training set of the respective datasets.
The vision network has an identical feature extraction trunk as our vision subnetwork (Section 2.1), on top of which two fully connected layers are attached (sizes: 512×128 and 128×34), because there are 34 Kinetics-Sounds classes.
The audio network is built analogously.
Supervised direct: The direct combination baseline computes the audio-video correspondence score as the scalar product between the 34-D network softmax outputs, and decides that audio and video are in correspondence if the score is larger than a threshold.
Supervised pretraining: It takes the feature extraction trunks from the two trained networks, assembles them into our network architecture by concatenating the features and adding two fully connected layers. The weights of the feature extractors are frozen and the fully connected layers are trained on the AVC task.
The L³-Net achieves 74% and 78% on the two datasets, where chance is 50%. (Even humans find it hard to judge whether an isolated frame and an isolated single second of audio correspond)
The supervised baselines do not beat the L³-Net as “supervised pretraining” performs on par with it, while “supervised direct combination” works significantly worse as, unlike “supervised pretraining”, it has not been trained for the AVC task.

5. Transfer Learning Results

After self-supervised training in the above AVC experiment, the subnetwork should be well pretrained, and can be used as weight initialization for other supervised datasets.

5.1. Audio Features on ESC-50 & DCASE

Environmental sound classification (ESC-50): 2000 audio clips, 5 seconds each, equally balanced between 50 classes.
Detection and classification of acoustic scenes and events (DCASE): 10 classes with 10 training and 100 test clips per class, where each clip is 30 seconds long.
The audio features are obtained by maxpooling the last convolutional layer of the audio subnetwork (conv4_2), before the ReLU, into a 4×3×512 = 6144 dimensional representation.
The features are preprocessed using z-score normalization. A multi-class one-vs-all linear SVM is trained, and at test time the class scores for a recording are computed as the mean over the class scores for its subclips.
“Ours random” is an additional baseline which shows the performance of our network without L3-training.
On both benchmarks L³-training convincingly beat the previous state-of-the-art, SoundNet [3], by 5.1% and 5% absolute. For ESC-50, L³-training reduces the gap between the previous best result and the human performance by 72% while for DCASE, L³-training reduces the error by 42%.

The proposed L3-training (Ours) sets the new state-of-the-art by a large margin on both benchmarks.

5.2. Video Features on ImageNet

conv4_2 features are taken after ReLU and perform max-pooling with equal kernel and stride sizes until feature dimensionality is below 10k; in this case this results in 4×4×512 = 8192-D features.
A single fully connected layer is added to perform linear classification into the 1000 ImageNet classes.
All the weights are frozen to their L³-Net-trained values, apart from the final classification layer which is trained with cross-entropy loss on the ImageNet training set.
The proposed L³-Net-trained features achieve 32.3% accuracy which is on par with other state-of-the-art self-supervised methods of [7, 8, 22, 36] (Context Prediction [7]), while convincingly beating random initialization, data-dependent initialization [17], and Context Encoders [25].
An important fact to consider is that all competing methods actually use ImageNet images when training. Although they do not make use of the labels, the underlying image statistics are the same:
In contrast, L³-Net uses a completely separate source of training data in the form of frames from Flickr videos. Furthermore, video frames have vastly different low-level statistics to still images, with strong artefacts such as motion blur.

It is impressive that the proposed visual features L³-Net-trained on Flickr videos perform on par with self-supervised state-of-the-art trained on ImageNet.

6. Qualitative Results

6.1. Visual Features

The above figure shows the images that activate particular units in pool4 the most (i.e. are ranked highest by its magnitude).

The vision subnetwork has automatically learnt, without any explicit supervision, to recognize semantic entities such as guitars, accordions, keyboards, etc.

The above heatmaps show that objects are successfully detected despite significant clutter and occlusions.

6.2. Audio Features

The above images display the video frame that corresponds to the sound.

The audio subnetwork, again without any supervision, manages to learn various semantic entities, as well as perform fine-grained classification (“fingerpicking” vs “playing bass guitar”).

The above figure shows spectrograms and their semantic heatmaps.
For example, it shows clear preference for low frequencies when detecting bass guitars, attention to wide frequency range when detecting lawnmowers, and temporal ‘steps’ when detecting fingerpicking and tap dancing.

Reference

[2017 ICCV] [L³-Net]
Look, Listen and Learn

Self-Supervised Learning

2014 [Exemplar-CNN] 2015 [Context Prediction] 2016 [Context Encoders] 2017 [L³-Net]