|Context||Columbia Neural Network Reading Group/Seminar Series|
For high-dimenaional recognition tasks, there are often a huge number of classes which are subject to a great deal of variability. The goal in these tasks is often to remove any intra-class variability. For image class, this variability often corresponds to a change in point of view or local deformations; for audio it often has to do with frequency/time warping and timbre; in texture recognition the statistical properties are most important. We can attempt to remedy this as finding a representation which transforms data from the original space (which is unstable to geometric and stochastic variabilities) to a new space where, coupled with an appropriate distance metric, the variability is removed.
An ideal representation would have the following characteristics:
After a certain amount of perturbation/deformation, we can change anything to anything else, so we only want to be invariant/stable to a certain amount. Current feature representations focus on these requirements: Mel-Frequency Cepstral Coefficients apply a logarithmic filterbank to the signal and compute the magnitude; SIFT features similarly extract localized features in space and compute local histograms; convolutional networks repeatedly apply small filters and pooling operators.
A wavelet transform is essentially a (complex) bandpass filter which is both localized in frequency and space. A wavelet transform can be constructed by scaling and dilating a “mother” wavelet. There are conditions which state that if the frequencies are covered uniformly the wavelet transform has good stability properties. Wavelet transforms are linear and so are translation covariant; there is no linear operator which is stable to translation except for the average. So, to avoid losing information in the average, we must use a nonlinearity. We'd like a nonlinearity which preserves the stability we are trying to expose; the only operators which satisfy these criteria are point-wise. This matches the practice in neural networks.
A scattering convolutional network repeatedly convolves with a wavelet and applies a pointwise nonlinearity (modulus) and convolves (at each layer) with a different wavelet to obtain coefficients. This transform is stable to additive noise, preserves energy and is stable to geometric deformations. In this representation, the first order coefficients (analogous to MFCCs and SIFT) are often insufficient to discriminate between different classes, but the second order coefficients can. For stationary processes, the notion of stability is statistical - so we want the transform to be stable in expectation. So, instead of convolving with a different wavelet to obtain coefficients, we compute the expectation. As before, if two stationary processes have the same covariance, their first-order coefficients will be the same but the second-order coefficients can be different. In practice, higher-order polynomial moments can have a variance blow-up without a lot of data, so this approach is more stable.
We can formulate the inversion of a scattering representation by trying to find a representation whose scattering transformation is close to the given coefficients. This is nonconvex when cast as an optimization problem. It can be considered as a generalized phase recovery problem where there are many layers of phase which must be recovered. In practice, it can be approximated by minimizing (using gradient descent) the least-square error of the transformed reconstruction against the query coefficients. Using this process, good reconstruction can be obtained from second-order moments. Very localized coefficients can be more effective for inversion.
Using scattering coefficients as a representation, very good results on MNIST were obtained using simple classifiers (e.g. SVM). Most of the stability in this dataset comes from deformation; the scattering coefficients give stability to those deformations so that the SVM kernel can be exactly what it needs to be. In texture classification on the CUREt dataset, scattering coefficients improve over the state of the art. The dataset is small, and the wavelets are contractive so they do not increase the variance of the data. On object recognition in the caltech-101 dataset, scattering can outperform a small convolutional network.
Recently, neural network systems which cascade blocks of linearities followed by a pointwise nonliearity trained end-to-end have achieved state-of-the-art results in many tasks. It was found, however, that state-of-the-art neural networks for image classification are unstable to deformations - small additive deformations can cause the network to change its predicted class. Often some form of regularization is used to try to make the network more stable (weight decay, dropout). In general, adding more parameters can make the network able to more closely estimate functions but by using as few parameters as possible the networks can be more general. It could be argued that convolutional networks are so successful because they use fewer parameters according to the statistics of their input. This allows small filters which are localized in space, so that the number of learned parameters is independent of the input size. This property (number of parameters is independent of input size) is not present in, e.g., graphs.
We can consider signals as functions on graphs, where the similarity between two points is encoded in the weight connecting them in the graph. Networks can be collapsed by repeatedly finding the local neighborhood of the graph. The Fourier basis diagonalizes convolutions, and can be defined as the eigenbasis of the laplacian operator. The Laplacian can be defined on an undirected graph. We can then learn a set of diagonal filter coefficients which are computed in the spectral domain. In harmonic analysis, the decay of a signal is related to the smoothness of its Fourier transform (smoothness and locality are dual). So, rather than learning diagonal frequency multipliers, smooth frequency multipliers are learned instead. This allows us to subsample the frequency grid. A simple way to enforce smoothness is to use a 1-d linear interpolation. Using this structure can improve accuracy with fewer parameters.