Discovering micro-events from video data using topic modeling

Summary

This research proposes a method to decompose events, from large-scale video datasets, into semantic micro-events by developing a new variational inference method for the supervised LDA (sLDA), named fsLDA (Fast Supervised LDA). Class labels contain semantic information which cannot be utilized by unsupervised topic modeling algorithms such as Latent Dirichlet Allocation (LDA). In case of realistic action videos, LDA models the structure of the local features which sometimes can be irrelevant to the action. On the other hand, sLDA is intractable for large-scale datasets in regards to both time and memory. fsLDA not only overcomes the computational limitations of sLDA and achieves better results with respect to classification accuracy but also enables the variable influence of the supervised part in the inference of the topics.

fsLDA

Figure 1. Probabilistic graphical model of fsLDA and sLDA

Figure 1 depicts the generative process of both sLDA and fsLDA in plate notation. This process is also outlined below (for a single document omitting document subscripts):

Draw topic proportions $\theta \sim \mathrm{Dir}(\alpha)$
For each codeword:
1. Draw topic assignment $z_n \mid \theta \sim \mathrm{Mult}(\theta)$
2. Draw word assignment $w_n \mid z_n, \beta_{1:K} \sim \mathrm{Mult}(\beta_{z_n})$
Draw class label $y \mid z_{1:N} \sim \mathrm{softmax}\left( \frac{1}{N} \sum_{n=1}^N z_n, \eta \right)$ where the softmax function provides the following distribution $$p(y, \bar{z}, \eta) = \frac{\exp\left( \eta_y^T \frac{1}{N} \sum_{n=1}^N z_n \right)} {\sum_{\hat{y}=1}^C \exp\left(\eta_{\hat{y}}^T \frac{1}{N} \sum_{n=1}^N z_n\right)}$$

We use the following mean field variational family, which is also used for the unsupervised LDA, $q(\theta, z_{1:N} \mid \gamma, \phi_{1:N}) = q(\theta \mid \gamma) \prod_{n=1}^N q(z_n \mid \phi_n)$. The Evidence Lower Bound (ELBO) is given from the following equation. The main problem that this research addresses is the intractability of the ELBO’s last term.

$$ \mathcal{L}(\gamma, \phi \mid \alpha, \beta, \eta) = \mathbb{E}_q\left[\log p(\theta \mid \alpha)\right] + \mathbb{E}_q\left[\log p(z \mid \theta)\right] + \mathbb{E}_q\left[\log p(w \mid \beta, z)\right] + H(q) + \mathbb{E}_q\left[\log p(y \mid z, \eta)\right] $$

In [1] we derive close form update rules for the variational parameters $\phi$ and $\gamma$ with complexity comparable to the corresponding rules of unsupervised LDA. $\mathcal{C}$ is a hyperparameter that controls the influence of the supervised part on the topics and $s = \mathrm{softmax}(\mathbb{E}_q[z], \eta)$.

$$ \begin{aligned} \phi_n &\propto \beta_n \exp\left(\Psi(\gamma) + \frac{\mathcal{C}}{\mathrm{max}(\eta)} \left( \eta_y – \sum_{\hat{y}=1}^C s_{\hat{y}} \eta_{\hat{y}} \right)\right) \\ \gamma &= \alpha + \sum_{n=1}^N \phi_n \end{aligned} $$

Experiments and Results

Figure 2. (1) Original video clip (2) All trajectories drawn over the clip (3) Trajectories that belong to the 5 most common words of the most common topic in class Swing

Figure 2 shows that the inferred topics can capture semantic information with respect to the action. It can be seen from the second image in Figure 2 that Improved Dense Trajectories contain a lot of irrelevant trajectories many of which do not even refer to the swinging baby. By observing the topics vs words representations, in Figure 4, it can be easily seen that the former is by far more sparse than the latter. This makes intuitive sense if we consider that all the depicted trajectories in Figure 2 clip 3 are generated by a single topic.

Figure 3. Classification accuracy using few dimensions to represent videos

Figure 4. Typical Bag of Words and topics representations for a video

Intuitively, we expect that a topics representation is able to encode the motion or visual information of a video with less dimensions. In order to test our intuition, we reduce feature dimensionality using either minimum Redundancy Maximum Relevance Feature Selection (mRMR) or by simply training sLDA, fsLDA and LDA with various feature sizes (number of topics). Figure 3 depicts the classification performance of all the aforementioned representations. We observe that fsLDA clearly surpasses all other methods, including Bag of Visual Words.

In the following table we compare fsLDA, sLDA and LDA with respect to classification performance.

Dataset	Feature	# topics	fsLDA	sLDA	LDA
UCF11	idt-hog	600	0.9299	0.9018	0.9118
UCF11	idt-hof	600	0.8530	0.8592	0.8374
UCF11	idt-mbhx	600	0.8449	0.8323	0.8336
UCF11	idt-mbhy	600	0.8580	0.8455	0.8480
UCF11	idt-traj	600	0.7904	0.7748	0.7754
UCF11	dsift	600	0.9280	0.9280	0.9143
UCF101	dcnn_conv5_2	1200	0.6237	Intractable	0.5603
UCF101	idt-hof	1200	0.5607	Intractable	0.5272

Table 1. Classification accuracy of fsLDA, sLDA and LDA in UCF11 and UCF101

Reproducibility

This research is accompanied by a C++ implementation of all the benchmarked algorithms (LDA, sLDA and fsLDA) under the MIT License. In order for this work to be reproducible we also publish Bag of Visual words histograms and centroids of all the local features that we have extracted from UCF11 (Youtube Action) and UCF101 (Action Recognition) datasets. The data are licensed under a Creative Commons Attribution 4.0 International license.

Code

The entire code is organized in a C++ library, named LDA++. In addition, we also provide a set of console applications that enable the use of the implemented algorithms without writing additional code. Thorough documentation, examples and tutorials as well as the link to the Github repository can be found in the library’s homepage http://ldaplusplus.com/.

Assuming that LDA++ is already installed in your system the following terminal session trains an unsupervised LDA model with 10 topics on the MNIST Dataset after it has been converted to the appropriate numpy format. Figure 5 depicts the learned topics.

$ cd /tmp
$ wget "http://ldaplusplus.com/files/mnist.tar.gz"
$ tar -zxf mnist.tar.gz
$ lda train --topics 10 --workers 4 mnist_train.npy model.npy
E-M Iteration 1
100
...
...
$ python
Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> fig, axes = plt.subplots(1, 10, figsize=(10, 1))
>>> with open("model.npy") as f:
...     alpha = np.load(f)
...     beta = np.load(f)
...
>>> for i in xrange(10):
...     axes[i].imshow(beta[i].reshape(28, 28), cmap=plt.cm.gray_r, interpolation='nearest')
...     axes[i].set_xticks([])
...     axes[i].set_yticks([])
...
>>> plt.tight_layout()
>>> fig.savefig("mnist.png")

Figure 5. 10 unsupervised topics in the MNIST dataset

Data

The given data are in numpy format ready to be used with the console applications. Each file contains two numpy arrays. The first array holds the Bag of Visual Words counts and its shape is (n_words, n_videos). The second one contains the corresponding class labels of each video as integers and has shape (n_videos,). The following code snippet reads the provided data into a python session.

>>> import numpy as np
>>> with open("path/to/data") as f:
...     X = np.load(f) # transpose if you need it in sklearn compatible format
...     y = np.load(f)

Both UCF11 and UCF101 contain videos with less than 15 frames, as a result it was not feasible to extract Improved Dense Trajectories with the default configuration and those videos have been removed. Alongside each feature file we provide the corresponding filenames of the videos they were extracted from. We provide 3 random splits, in case of UCF101 we use the random splits given with the dataset. UCF11 is encoded with a vocabulary of 1000 codewords while UCF101 with 4000 codewords.

Dataset	Feature	Comment
UCF11	Improved Dense Trajectories	Extracted with this software using default parameters	Data \| Centroids
UCF11	Dense SIFT	Extracted at 4 scales (16px, 24px, 32px, 40px) using a stride of 16px	Data \| Centroids
UCF101	Improved Dense Trajectories	Extracted with this software using default parameters	Data \| Centroids
UCF101	STIP	Extracted with this software using default parameters	Data \| Centroids
UCF101	Dense SIFT	Extracted at 4 scales (16px, 24px, 32px, 40px) using a stride of 16px	Data \| Centroids
UCF101	VGG 2014 Deep CNN	Extracted with this Caffe model. We use conv5_1 and conv5_2 as local features in $\mathbb{R}^{512}$	Data \| Centroids
ALL	ALL		Data \| Centroids

Multimedia Understanding Group

School of Electrical and Computer Engineering
Aristotle University of Thessaloniki