An overview for our UAI 2019 paper on Sliced Score Matching. We show how to use random projections to scale up score matching—a classic method to learn unnormalized probabilisic models—to high-dimensional data. Theoretically, sliced score matching produces a consistent and asymptotic normal estimator under some regularity conditions. We apply sliced score matching to training deep energy-based models, learning VAEs with implicit encoders and training Wasserstein Auto-Encoders (WAEs).
Let \(f_\theta(\mathbf{x})\) be a real-valued function on \(\mathbf{x} \in \mathbb{R}^D\). We can define a probability model from \(f_\theta(\mathbf{x})\), using the following formula
\[p_\theta(\mathbf{x}) = \dfrac{e^{-f_\theta(\mathbf{x})}}{Z_\theta}.\]Here we assume that \(Z_\theta = \int e^{-f_\theta(\mathbf{x})} \text{d}\mathbf{x}\) exists and call it the partition function. Typically, it is intractable to evaluate the partition function, and \(Z_\theta\) is an unknown quantity. The probability model \(p_\theta(\mathbf{x})\) is called an unnormalized probability model, because the normalizing constant \(Z_\theta\) is unknown.
The de facto standard for learning a probability model is maximum likelihood (MLE). However, it is not suitable for learning unnormalized probability models, because the likelihood function
\[\log p_\theta(\mathbf{x}) = -f_\theta(\mathbf{x}) - \log Z_\theta\]depends on the intractable partition function \(Z_\theta\).
Score matching
Note that the score of an unnormalized probability model \(\nabla_{\mathbf{x}} \log p_\theta(\mathbf{x})\) does not depend on the intractable partition function \(Z_\theta\). Instead of using likelihood as the objective, score matching is based on the following objective
\[\frac{1}{2}\mathbb{E}_{p_\text{data}}[\|\nabla_\mathbf{x} \log p_\text{data}(\mathbf{x}) - \nabla_\mathbf{x} \log p_\theta(\mathbf{x}) \|_2^2],\]where \(p_\text{data}(\mathbf{x})\) is the data distribution. The above objective is widely known as the Fisher divergence. Since it only involves \(\nabla_\mathbf{x} \log p_\theta(\mathbf{x})\) and does not require \(Z_\theta\), it is ideal for training unnormalized models.
However, Fisher divergence is not directly computable, because the score of the data distribution \(\nabla_\mathbf{x} \log p_\text{data}(\mathbf{x})\) is unknown. Score matching eliminates the data score using integration by parts. To simplify our discussion, we consider the Fisher divergence between distributions of 1-D random variables. We have
\[\begin{aligned} &\frac{1}{2}\mathbb{E}_{p_\text{data}}[(\nabla_x \log p_\text{data}(x) - \nabla_x \log p_\theta(x))^2] \\ =& \frac{1}{2} \int p_\text{data}(x) (\nabla_x \log p_\text{data}(x) - \nabla_x \log p_\theta(x))^2 \text{d}x\\ =& \underbrace{\frac{1}{2} \int p_\text{data}(x) (\nabla_x \log p_\text{data}(x))^2 \text{d}x}_{\text{const}} + \frac{1}{2} \int p_\text{data}(x) (\nabla_x \log p_\theta(x))^2 \text{d} x \\ &- \int p_\text{data}(x) \nabla_x \log p_\theta(x) \nabla_x \log p_\text{data}(x)\text{d}x. \end{aligned}\]By integration by parts, we have
\[\begin{aligned} &- \int p_\text{data}(x) \nabla_x \log p_\theta(x) \nabla_x \log p_\text{data}(x) \text{d}x\\ =& - \int \nabla_x \log p_\theta(x) \nabla_x p_\text{data}(x)\text{d} x\\ =& - p_\text{data}(x) \nabla_x \log p_\theta(x)\bigg|_{-\infty}^{\infty} + \int p_\text{data}(x) \nabla_x^2 \log p_\theta(x) \text{d} x\\ \stackrel{(i)}{=}& \mathbb{E}_{p_\text{data}}[\nabla_x^2 \log p_\theta(x)], \end{aligned}\]where \((i)\) holds if we assume \(p_\text{data}(x) \rightarrow 0\) when \(\vert x \vert \rightarrow \infty\). Now, substituting the results of integration by parts into the 1-D Fisher divergence, we obtain
\[\begin{aligned} &\frac{1}{2}\mathbb{E}_{p_\text{data}}[(\nabla_x \log p_\text{data}(x) - \nabla_x \log p_\theta(x))^2]\\ =& \mathbb{E}_{p_\text{data}}[\nabla_x^2 \log p_\theta(x)] + \frac{1}{2} \mathbb{E}_{p_\text{data}}[(\nabla_x \log p_\theta(x))^2] + \text{const}. \end{aligned}\]Therefore, the equivalent form of 1-D Fisher divergence does not involve \(\nabla_x \log p_\text{data}(x)\).
Generalizing the integration by parts argument to muti-dimensional data, we have the following objective equivalent to Fisher divergence (see
where \(\nabla_\mathbf{x}^2\) denotes the Hessian with respect to \(\mathbf{x}\). This objective is known as the score matching objective. Since it only involves functions of \(\nabla_\mathbf{x} \log p_\theta(\mathbf{x})\), it does not depend on the intractable partition function and therefore is ideal for learning unnormalized probability models.
So far, we know that score matching can be used to learn unnormalized models. We are particularly interested in learning deep energy-based models, a special kind of unnormalized models where \(f_\theta(\mathbf{x})\) is parameterized by a deep neural network. In order to use score matching for learning deep energy-based models, we have to compute \(\| \nabla_\mathbf{x} \log p_\theta(\mathbf{x})\|_2^2 = \| \nabla_\mathbf{x} f_\theta(\mathbf{x})\|^2_2\) and \(\operatorname{tr}( \nabla_{\mathbf{x}}^2 \log p_\theta(\mathbf{x})) = -\operatorname{tr}( \nabla_{\mathbf{x}}^2 f_\theta(\mathbf{x}))\). The former can be computed by one simple backpropagation of \(f_\theta(\mathbf{x})\). The later, however, requires much more number of backpropagations to compute. As studied in
We propose sliced score matching to greatly scale up the computation of score matching. The motivating idea is that one dimensional data distribution is much easier to estimate for score matching. We propose to project the scores onto random directions, such that the vector fields of scores of the data and model distribution become scalar fields. We then compare the scalar fields to determine how far the model distribution is from the data distribution. It is clear to see that the two vector fields are equivalent if and only if their scalar fields corresponding to projections onto all directions are the same.
Mathematically, we denote \(\mathbf{v}\) as the random projection direction, and \(p_\mathbf{v}\) as its distribution. The random projected version of Fisher divergence is
\[\frac{1}{2}\mathbb{E}_{p_\text{data}}[(\mathbf{v}^\intercal \nabla_\mathbf{x} \log p_\text{data}(\mathbf{x}) - \mathbf{v}^\intercal \nabla_\mathbf{x} \log p_\theta(\mathbf{x}) )^2],\]which we name as the sliced Fisher divergence. Unfortunately, sliced Fisher divergence has the same problem as Fisher divergence, due to the unknown data score function \(\nabla_\mathbf{x} \log p_\text{data}(\mathbf{x})\). We therefore play the same trick of integration by parts, as done in score matching, to obtain the following tractable alternative form
\[\mathbb{E}_{p_\text{data}}\bigg[\mathbf{v}^\intercal \nabla_{\mathbf{x}}^2 \log p_\theta(\mathbf{x})\mathbf{v} + \frac{1}{2} (\mathbf{v}^\intercal\nabla_\mathbf{x} \log p_\theta(\mathbf{x}))^2 \bigg] + \text{const},\]which is our sliced score matching objective. Again, \(\mathbf{v}^\intercal\nabla_\mathbf{x} \log p_\theta(\mathbf{x})\) can be computed by one backpropagation for deep energy-based models. The first term of the objective again involves Hessian, but it is in the form of Hessian-vector products, which can be computed within \(O(1)\) backpropagations. Therefore, the computation of sliced score matching does not depend on the dimension of data, and is much more scalable for training deep energy-based models on high dimensional datasets.
Suppose our data points \(\{ \mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_N\}\) are i.i.d. samples from the data distribution \(p_\text{data}\). For each data point \(\mathbf{x}_i\), we randomly draw \(M\) random projection directions \(\{ \mathbf{v}_{i1}, \mathbf{v}_{i2}, \cdots, \mathbf{v}_{iM}\} \sim p_\mathbf{v}\). The sliced score matching objective can be estimated with empirical averages, giving rise to the following finite-sample estimator:
\[\frac{1}{NM} \sum_{i=1}^N \sum_{j=1}^M \bigg[\mathbf{v}_{ij}^\intercal \nabla_{\mathbf{x}}^2 \log p_\theta(\mathbf{x}_i)\mathbf{v}_{ij} + \frac{1}{2} (\mathbf{v}_{ij}^\intercal\nabla_\mathbf{x} \log p_\theta(\mathbf{x}_i))^2 \bigg]\]We denote \(\hat{\theta}_{N,M}\) as the minimizer of the above empirical estimator, and let \(\theta^*\) denote the true parameter corresponding to the data distribution such that \(p_{\theta^*} = p_\text{data}\). We can prove that under some regularity conditions, \(\hat{\theta}_{N,M}\) is consistent and asymptotically normal. More rigorously, for any \(M \in \mathbb{N}^+\), when \(N \rightarrow \infty\), we have
\[\hat{\theta}_{N,M} \stackrel{p}{\rightarrow} \theta^*\]and
\[\sqrt{N}(\hat{\theta}_{N,M} - \theta^*) \stackrel{d}{\rightarrow} \mathcal{N}(0,\Sigma)\]where \(\Sigma\) is some covariance matrix.
Sliced score mathing has numerous applications. For example,