Jekyll2022-03-22T02:40:28-07:00https://chcorbi.github.io/feed.xmlCharles Corbière’s pagePhD Student at CNAM / valeo.aiCharles Corbièrecharles.corbiere[at]valeo.comPreparation on Dirichlet - ICML 20212020-11-26T00:00:00-08:002020-11-26T00:00:00-08:00https://chcorbi.github.io/posts/2020/11/dirichlet_icml<h2 id="table-of-contents">Table of contents</h2> <ol> <li><a href="#context">Context</a></li> <li><a href="#contributions">Contributions</a></li> <li><a href="#gaussian">Thought experiment with Gaussian distributions</a></li> <li><a href="#experiments">Experiments</a></li> </ol> <h2 id="context-">Context <a name="context"></a></h2> <p>Current metrics to measure in-domain and out-domain uncertainty include</p> <ul> <li>maximum class probability $$MCP = \max_c p(y=c \vert \boldsymbol{x^*}; \hat{\boldsymbol{\theta}})$$</li> <li>predictive entropy $$\mathcal{H} \big [ p( y \vert \boldsymbol{x}, \hat{\boldsymbol{\theta}}) \big ]$$</li> <li>mutual information $$\mathcal{MI} \Big [y, \boldsymbol{\pi} \vert \boldsymbol{x^*}; \hat{\boldsymbol{\theta}} \Big ]$$</li> <li>differential entropy $$\mathcal{H} \big [ p( \boldsymbol{\pi} \vert \boldsymbol{x}, \hat{\boldsymbol{\theta}}) \big ]$$</li> <li>Dirichlet precision $$\alpha_0 = \sum_c \alpha_c$$</li> </ul> <p>In many papers <a class="citation" href="#malinin2018">(Malinin &amp; Gales, 2018)</a>, they consider measures from the expected predictive categorical distribution $$p(y \vert \boldsymbol{x^*}, \mathcal{D})$$ as a mesure of <strong>total uncertainty</strong>. This includes MCP and predictive entropy.</p> <p>Based on <a class="citation" href="#depeweg2018decomposition">(Depeweg et al., 2018)</a>, they decompose the predictive entropy into two terms :</p> $\begin{equation} \underbrace{\mathcal{H} \Big [ \mathbb{E}_{p(\boldsymbol{\pi} \vert \boldsymbol{x^*}; \hat{\boldsymbol{\theta}})} \big [ p(y \vert \boldsymbol{\pi}) \big] \Big ]}_{\text{Total Uncertainty}} = \underbrace{\mathbb{E}_{p(\boldsymbol{\pi} \vert \boldsymbol{x^*}; \hat{\boldsymbol{\theta}})} \Big [ \mathcal{H} \big [ p(y \vert \boldsymbol{\pi}) \big] \Big]}_{\text{Expected Aleatoric Uncertainty}} + \underbrace{\mathcal{MI} \Big [y, \boldsymbol{\pi} \vert \boldsymbol{x^*}; \hat{\boldsymbol{\theta}} \Big ]}_{\text{Epistemic Uncertainty}} \end{equation}$ <p>Consequently, mutual information is an epistemic uncertainty measure.</p> <p style="text-align: center;"><img src="/images/desired_behavior.png" alt="desired_behavior" /></p> <p>Finally, Dirichlet precision is also expected to be an epistemic uncertainty measure as it measures the dispersion of the Dirichlet distribution on the simplex. The higher the epistemic uncertainty the higher we desire the dispersion to be.</p> <h2 id="contributions-">Contributions <a name="contributions"></a></h2> <h3 id="summary">Summary</h3> <ul> <li>A new total uncertainty measure for Dirichlet Networks, $$\textrm{KL}_{\textrm{Pred}}$$, with properties similar to MCP</li> <li>$$\textrm{KL}_{\textrm{Pred}}$$ can be decomposed into two terms which allow to distinguish aleatoric and epistemic uncertianty</li> <li>This new criterion enable to consider an improved version based on the <em>ground-truth</em> class and which improves misclassification detection</li> </ul> <h3 id="a-new-total-uncertainty-measure">A new total uncertainty measure</h3> <p>In neural network trained to minimize cross-entropy, which is also equivalent to the KL-divergence between model distribution and empirical true distribution:</p> $\begin{equation} \mathcal{L}_{\textrm{XE}}(\boldsymbol{x},y^*, \boldsymbol{\theta}) = \textrm{KL} \Big ( \hat{p}(y \vert \boldsymbol{x}) \vert \vert \textrm{Cat}(y \vert \boldsymbol{x}, \boldsymbol{\theta}) \Big ) = - \log p(y=y^* \vert \boldsymbol{x}, \boldsymbol{\theta}) \end{equation}$ <p>The standard uncertainty measure is MCP, which is the estimated likelihood $$p(y=\hat{y} \vert \boldsymbol{x^*}, \hat{\boldsymbol{\theta}})$$ at sample $\boldsymbol{x^*}$ by the model.</p> <p>With Dirichlet Networks, training is achieved by minimizing the reverse KL-divergence with a sharp target Dirichlet distribution</p> $\begin{equation} \mathcal{L}_{\textrm{RKL}}(\boldsymbol{x},y^*, \boldsymbol{\theta}) = \textrm{KL} \Big ( \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta} ) ~\vert \vert~ \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{\beta}^{(y^*)}) \Big ) \end{equation}$ <p>In a similar way than for cross-entropy, we propose an uncertainty criterion, denoted $\textrm{KL}_{\textrm{Pred}}$, which measures the KL-divergence between NN’s output and a sharp Dirichlet distribution with concentration parameters $\boldsymbol{\gamma}_{\hat{y}}$ focused on the <em>predictive</em> class $\hat{y}$:</p> $\begin{equation} \textrm{KL}_{\textrm{Pred}}(\boldsymbol{x}) = \textrm{KL} \Big ( \textrm{Dir} (\boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\hat{\theta}} ) ~\vert \vert~ \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{\gamma}^{(\hat{y})} \big ) \Big ) \end{equation}$ <p>To ensure an accurate estimation of concentration parameters $\boldsymbol{\gamma}^{(\hat{y})}$, we compute the empirical exponential logits mean of the predicted class $\hat{y}$ on training set $\mathcal{D}$:</p> $\begin{equation*} \boldsymbol{\gamma}^{(\hat{y})} = \frac{1}{N^{(\hat{y})}} \sum_{i: y_i=\hat{y}}^N \boldsymbol{\alpha}(\boldsymbol{x_i}, \boldsymbol{\hat{\theta}}), \quad \quad \textrm{with}~~ \boldsymbol{\alpha}(\boldsymbol{x_i}, \boldsymbol{\hat{\theta}}) = \exp (f(\boldsymbol{x}_i,\boldsymbol{\hat{\theta}})) \end{equation*}$ <p>where $N^{(\hat{y})}$ is the number of training samples with label $\hat{y}$.</p> <p><img src="/images/klpred_behavior.png" alt="simplex_behavior" height="700px" width="700px" /></p> <p>The lower $\textrm{KL}_{\textrm{Pred}}$ is, the more certain we are in the prediction. Previous figure illustrates the fact that correct predictions will have Dirichlet distributions similar to the computed mean distribution for the predicted class, and thus associated with a low uncertainty measure. Misclassified predictions are expected to present different concentration parameters than the average computed on training set resulting in a higher $\textrm{KL}_{\textrm{Pred}}$ measure.</p> <p><strong>For now, this is just another measure, we don’t have any theoretical explanation whether it is a better to evaluate total uncertainty. Without further justification, it might be as good as MCP or predictive entropy.</strong></p> <h3 id="decomposition-of-textrmkl_textrmpred">Decomposition of $$\textrm{KL}_{\textrm{Pred}}$$</h3> <p>Such as done for the reverse KL-divergence loss in <a class="citation" href="#malinin2019">(Malinin &amp; Gales, 2019)</a>, we decompose $$\textrm{KL}_{\textrm{Pred}}$$ into the reverse cross-entropy and the negative differential entropy:</p> $\begin{equation} \textrm{KL}_{\textrm{Pred}}(\boldsymbol{x}) = \underbrace{\mathbb{E}_{p \big ( \boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\hat{\theta}} ) \big )} \Big [- \log \textrm{Dir} \big ( \boldsymbol{\pi} \vert \boldsymbol{\gamma}_{(\hat{y})} \big ) \Big ]}_\text{Reverse Cross-Entropy} - \underbrace{\mathcal{H} \Big [ \textrm{Dir} \big (\boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\hat{\theta}}) \big ) \Big ]}_\text{Differential Entropy} \end{equation}$ <p>where we can show that the Reverse Cross-Entropy (RCE) correspond to measuring aleatoric uncertainty and the Differential Entropy measures the dispersion of the Dirichlet distribution, hence epistemic uncertainty.</p> <p><strong>Still, no justification if it is better than the previous proposed decomposition of predictive entropy in <a class="citation" href="#malinin2018">(Malinin &amp; Gales, 2018)</a>.</strong></p> <h3 id="improving-misclassification-detection-for-dirichlet-networks">Improving misclassification detection for Dirichlet Networks</h3> <p>Similarly to ConfidNet, we can improve the evaluation of misclassification detection by considering measure in respect to the <em>ground-truth</em> class.</p> $\begin{equation} \textrm{KL}_{\textrm{True}}(\boldsymbol{x}) = \textrm{KL} \Big ( \textrm{Dir} \big (\mathbf{z} \vert \boldsymbol{\alpha}(\boldsymbol{x}, \boldsymbol{\hat{\theta}}) \big ) ~\vert \vert~ \textrm{Dir} \big ( \mathbf{z} \vert \boldsymbol{\gamma}_{y^*} \big ) \Big ) \end{equation}$ <p>where $$\boldsymbol{\gamma}_{y^*}$$ corresponds to a Dirichlet distribution whose concentration parameters are the empirical mean of the <em>true</em> class $$y^*$$ of sample $$\boldsymbol{x}$$. When evaluated in experiments, $$\textrm{KL}_{\textrm{True}}$$ criterion leads to a near-perfect separation of correct and erroneous predictions.</p> <p>However, the true class of an output is obviously not available when estimating confidence on test samples. We propose to learn $$\textrm{KL}_{\textrm{True}}$$ by introducing an auxiliary confidence neural network, termed <em>KLNet</em>, with parameters $$\boldsymbol{\omega}$$, which outputs a confidence prediction $$\hat{c}(\boldsymbol{x}, \boldsymbol{\omega})$$.</p> <p><strong>Now we can justify than <em>KLNet</em> improves total uncertainty by better evaluating aleatoric uncertainty. Nevertheless, we have no guarantees about epistemic uncertainty. Plus, current KLNet training doesn’t take into account OOD samples.</strong></p> <h2 id="thought-experiment-with-gaussian-distributions-">Thought experiment with Gaussian distributions <a name="gaussian"></a></h2> <p>Let’s suppose the random variable over categorical probabilities $\boldsymbol{\pi} =[\pi_1,…,\pi_C]$ is now parametrize as a $K$-multivariate Gaussian distribution:</p> $\begin{equation} p(\boldsymbol{\pi} \vert \boldsymbol{x^*}, \mathcal{D}) = \mathcal{N}(\boldsymbol{\pi} \vert \boldsymbol{\mu_1}, \boldsymbol{\Sigma_1}) \end{equation}$ <p>For instance, an output of a neural network could be $\boldsymbol{\mu_1} = [0.98, 0.01, .01]$ and $\boldsymbol{\Sigma_1}$ = [0.05, 0.03, 0.02].</p> <blockquote> <p>On a related matter, <a class="citation" href="#kendall2017">(Kendall &amp; Gal, 2017)</a> propose in classification to consider a Gaussian over $\boldsymbol{\pi} \vert \boldsymbol{w}$ and parametrized by a NN outputing logits $f(\boldsymbol{x}, \boldsymbol{w})$ (e.g $[98.1, 1.2, 1.3]$) and a scalar $\boldsymbol{\sigma}$ (e.g $[2.1,0.6,3.3]$). Mean probabilities corresponds to the softmax of the mean corrupted by Gaussian noise:</p> $\begin{equation} \boldsymbol{\hat{\pi}}_t = \mathrm{Softmax}(\boldsymbol{\pi}_t)~~~~\mathrm{with }~ \boldsymbol{\pi}_t = f(\boldsymbol{x}, \boldsymbol{w}) +\boldsymbol{\sigma}\epsilon_t,~~~\epsilon_t \sim \mathcal{N}(0,I) \end{equation}$ </blockquote> <p>On the simplex, the distribution can be represented with $\boldsymbol{\mu_1}$ as its center position and dispersion corresponds to $\boldsymbol{\Sigma_1}$. Predictions are based on the argmax of the mean parameter, which is the first moment of the distribution:</p> $\begin{equation} \hat{y} = arg\,max_{c}~ \mathbb{E}_{p(\boldsymbol{\pi} \vert \boldsymbol{x^*}, \mathcal{D})}[y] = arg\,max_{c}~ \boldsymbol{\mu_1} \end{equation}$ <p>In this case, <strong>the entropy corresponds to computing the entropy on the mean of the distribution</strong>: $\mathcal{H} \big [ \mathbb{E}_{p(\boldsymbol{\pi} \vert \boldsymbol{x^*}, \mathcal{D})}[y] \big ]$.</p> <p>To reflect epistemic uncertainty, we should also consider the second moment of the distribution $\boldsymbol{\Sigma_1}$. For instance, the higher $\boldsymbol{\Sigma_1}[\hat{y}]$ is, the higher would be the epistemic uncertainty.</p> <p>In order to consider both aleatoric and epistemic uncertainty, we could try to derive some statistics on $p(\boldsymbol{\pi} \vert \boldsymbol{x^*}, \mathcal{D})$. Given a target distribution $p(\boldsymbol{\pi} \vert \boldsymbol{\mu_2}, \boldsymbol{\Sigma_2}) = \mathcal{N}(\boldsymbol{\pi} \vert \boldsymbol{\mu_2}, \boldsymbol{\Sigma_2})$ also Gaussian, the <strong>KL-divergence</strong> between the two distributions is:</p> \begin{align} \mathrm{KL} \big [ p(\boldsymbol{\pi} \vert \boldsymbol{x^*}, \mathcal{D}) \vert \vert p(\boldsymbol{\pi} \vert \boldsymbol{\mu_2}, \boldsymbol{\Sigma_2})] = \frac{1}{2} \Big (\log \vert\boldsymbol{\Sigma_2}\vert - &amp;\log \vert\boldsymbol{\Sigma_1}\vert - K + tr( \boldsymbol{\Sigma_2}^{-1} \boldsymbol{\Sigma_1}) \\ \nonumber &amp;+ (\boldsymbol{\mu_2} - \boldsymbol{\mu_1})\boldsymbol{\Sigma_2}^{-1}(\boldsymbol{\mu_2}-\boldsymbol{\mu_1}) \Big ) \end{align} <p>Setting aside the terms which do not depend on the input, we can identity two term. The first one relates to the first moment of the distribution, $(\boldsymbol{\mu_2}-\boldsymbol{\mu_1})\boldsymbol{\Sigma_2}^{-1}(\boldsymbol{\mu_2}-\boldsymbol{\mu_1})$ and the second one involves only the variance $- \log \vert\boldsymbol{\Sigma_1}\vert + tr( \boldsymbol{\Sigma_2}^{-1} \boldsymbol{\Sigma_1})$.</p> <h2 id="empirical-experiments-">Empirical Experiments <a name="experiments"></a></h2> <p>Two models were trained:</p> <ul> <li>with standard cross-entropy (<strong>XE</strong>)</li> <li>with contrastive reverse KL-divergence (<strong>Dirichlet</strong>)</li> </ul> <h3 id="cifar-10">CIFAR-10</h3> <ul> <li>In-distribution dataset: CIFAR-10</li> <li>Out-distribution training dataset: CIFAR-100</li> <li>Network Architecture: VGG-16</li> <li>Concentrations parameters $\alpha= \exp{f(x, \theta)}$</li> <li>Target concentrations for in-domain: $$\beta_{\text{in}}$$ = 10</li> <li>Training details: Adam, LR 5e-5, 1-cyclic scheduler for 45 epochs</li> </ul> <p>Accuracy is 93.5% for XE model and 92.9% for Dirichlet model.</p> <p>Presented results are for <strong>TinyImageNet</strong> as OOD dataset (% AUC), $\beta=100$ for standard Dirichlet models and $\beta=10$ for contrastive Dirichlet models</p> <table> <thead> <tr> <th style="text-align: center">Training</th> <th style="text-align: left">Method</th> <th style="text-align: right">Mis. Detection</th> <th style="text-align: right">OOD Detection</th> <th style="text-align: right">Mis.+OOD Detection</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><strong>Standard-XE</strong></td> <td style="text-align: left">MCP<br />ODIN<br />Mahalanobis<br />ConfidNet</td> <td style="text-align: right">$92.6\%$<br />$91.4\%$<br />$88.9\%$<br />$\boldsymbol{92.6\%}$</td> <td style="text-align: right">$87.5\%$<br />$\boldsymbol{89.1\%}$<br />$83.0\%$<br />$87.8\%$</td> <td style="text-align: right">$90.5\%$<br />$\boldsymbol{91.5\%}$<br />$85.7\%$<br />$91.6\%$</td> </tr> <tr> <td style="text-align: center"><strong>Standard-Dirichlet</strong></td> <td style="text-align: left">MCP<br />ODIN<br />Mahalanobis<br />Mutual Information<br />ConfidNet<br />$\textrm{KL}_{\textrm{Pred}}$ (Ours) <br /><strong>KLNet (Ours)</strong></td> <td style="text-align: right">$82.0\%$<br />$81.1\%$<br />$\boldsymbol{91.9\%}$<br />$74.0\%$<br />$.\%$<br />$91.4\%$<br />$91.3\%$</td> <td style="text-align: right">$72.3\%$<br />$71.5\%$<br />$\boldsymbol{86.9\%}$<br />$66.0\%$<br />$.\%$<br />$84.3\%$<br />$84.1\%$</td> <td style="text-align: right">$74.8\%$<br />$73.9\%$<br />$\boldsymbol{89.6\%}$<br />$67.8\%$<br />$.\%$<br />$87.5\%$<br />$87.4\%$</td> </tr> </tbody> </table> <table> <thead> <tr> <th style="text-align: center">Training</th> <th style="text-align: left">Method</th> <th style="text-align: right">Mis. Detection</th> <th style="text-align: right">OOD Detection</th> <th style="text-align: right">Mis.+OOD Detection</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><strong>Contrastive-XE</strong></td> <td style="text-align: left">MCP<br />ODIN<br />ConfidNet</td> <td style="text-align: right">$92.5\%$<br />$92.0\%$<br />$92.7\%$</td> <td style="text-align: right">$\boldsymbol{94.0\%}$<br />$94.1\%$<br />$94.1\%$</td> <td style="text-align: right">$\boldsymbol{95.4\%}$<br />$95.4\%$<br />$95.5\%$</td> </tr> <tr> <td style="text-align: center"><strong>Contrastive-Dirichlet</strong></td> <td style="text-align: left">MCP<br />ODIN<br />Mutual Information<br />Mahalanobis<br />ConfidNet<br />$\textrm{KL}_{\textrm{Pred}}$ (Ours) <br /><strong>KLNet (Ours)</strong></td> <td style="text-align: right">$92.1\%$<br />$92.1\%$<br />$91.9\%$<br />$91.4\%$<br />$92.8\%$<br />$92.1\%$<br />$\boldsymbol{93.2\%}$</td> <td style="text-align: right">$93.2\%$<br />$93.2\%$<br />$93.2\%$<br />$83.3\%$<br />$93.6\%$<br />$93.2\%$<br />$\boldsymbol{93.6\%}$</td> <td style="text-align: right">$94.9\%$<br />$94.9\%$<br />$94.8\%$<br />$95.1\%$<br />$95.2\%$<br />$94.9\%$<br />$\boldsymbol{95.3\%}$</td> </tr> </tbody> </table> <h5 id="effect-of-the-computed-mean-gammahaty">Effect of the computed mean $$\gamma^{(\hat{y})}$$</h5> <p>We have three cases about how to compute the target distribution in the criterion $\textrm{KL}_{\textrm{Pred}}$</p> <ul> <li><strong>KL_Original</strong>: Take exactly the target distribution use in training:</li> </ul> $\begin{equation*} \gamma^{(0, \hat{y})} = \big [ 1,..., \beta_{\text{in}},...,1 \big ] \end{equation*}$ <ul> <li><strong>KL_Full</strong>: Compute the empirical exponential logits mean of the predicted class $\hat{y}$ on training set $\mathcal{D}$ and use <strong>all values</strong>:</li> </ul> $\begin{equation*} \boldsymbol{\gamma}^{(1, \hat{y})} = \frac{1}{N^{(\hat{y})}} \sum_{i: y_i=\hat{y}}^N \boldsymbol{\alpha}(\boldsymbol{x_i}, \boldsymbol{\hat{\theta}}), \quad \quad \textrm{with}~~ \boldsymbol{\alpha}(\boldsymbol{x_i}, \boldsymbol{\hat{\theta}}) = \exp (f(\boldsymbol{x}_i,\boldsymbol{\hat{\theta}})) \end{equation*}$ <p>where $N^{(\hat{y})}$ is the number of training samples with label $\hat{y}$.</p> <ul> <li><strong>KL_Pred</strong>: Compute the empirical exponential logits mean of the predicted class $\hat{y}$ on training set $\mathcal{D}$ and use <strong>only the $\hat{y}$-value</strong>:</li> </ul> $\begin{equation*} \boldsymbol{\gamma}^{(2, \hat{y})} = \big [ 1,..., \boldsymbol{\gamma}^{(1, \hat{y})}[\hat{y}],...,1 \big ] \end{equation*}$ <p>Results using the Dirichlet model trained with $$\beta_{\text{in}}=10$$ are available in the table below:</p> <table> <thead> <tr> <th style="text-align: left">Method</th> <th style="text-align: right">Mis. Detection</th> <th style="text-align: right">OOD Detection</th> <th style="text-align: right">Mis.+OOD Detection</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">KL_Original</td> <td style="text-align: right">$92.6\%$</td> <td style="text-align: right">$93.0\%$</td> <td style="text-align: right">$93.7\%$</td> </tr> <tr> <td style="text-align: left">KL_Pred</td> <td style="text-align: right">$92.5\%$</td> <td style="text-align: right">$93.2\%$</td> <td style="text-align: right">$92.8\%$</td> </tr> <tr> <td style="text-align: left">KL_Full</td> <td style="text-align: right">$92.2\%$</td> <td style="text-align: right">$93.2\%$</td> <td style="text-align: right">$93.6\%$</td> </tr> </tbody> </table> <p>Actually, I observe that using more complicated form of target distribution do not impact performance.</p> <h5 id="effect-of-beta_textin">Effect of $$\beta_{\text{in}}$$</h5> <p>We vary the chosen value for in-domain target concentrations $$\beta_{\text{in}}$$ from 10 to 10,000. Below are the results for the criterion KL_Original:</p> <table> <thead> <tr> <th style="text-align: center">$$\beta_{\text{in}}$$</th> <th style="text-align: right">Accuracy</th> <th style="text-align: right">Mis. Detection</th> <th style="text-align: right">OOD Detection</th> <th style="text-align: right">Mis.+OOD Detection</th> </tr> </thead> <tbody> <tr> <td style="text-align: center">10</td> <td style="text-align: right">$92.9\%$</td> <td style="text-align: right">$92.6\%$</td> <td style="text-align: right">$93.0\%$</td> <td style="text-align: right">$93.7\%$</td> </tr> <tr> <td style="text-align: center">100</td> <td style="text-align: right">$93.0\%$</td> <td style="text-align: right">$91.7\%$</td> <td style="text-align: right">$92.2\%$</td> <td style="text-align: right">$92.8\%$</td> </tr> <tr> <td style="text-align: center">1000</td> <td style="text-align: right">$93.5\%$</td> <td style="text-align: right">$90.2\%$</td> <td style="text-align: right">$90.3\%$</td> <td style="text-align: right">$91.1\%$</td> </tr> <tr> <td style="text-align: center">10000</td> <td style="text-align: right">$92.6\%$</td> <td style="text-align: right">$88.5\%$</td> <td style="text-align: right">$88.1\%$</td> <td style="text-align: right">$89.3\%$</td> </tr> </tbody> </table> <p>We also note that the lower $$\beta_{\text{in}}$$ is, the less confident in-domain softmax probabilities will be. For instance, in the case of $$\beta_{\text{in}}=10$$, they range from $0.1$ to $0.57$.</p> <blockquote> <p>When $$\beta_{\text{in}} \geq 100$$, KL_Original becomes worse compared to KL_Pred or KL_Full. This may be due to the fact that logits variation are important, making them deviate from the original target distribution.</p> </blockquote> <h5 id="empirical-evaluation-of-the-decomposed-measures">Empirical evaluation of the decomposed measures</h5> <p>The criterion $\textrm{KL}_{\textrm{Pred}}$ actually correspond to the negative reverse KL-divergence used in training. Hence, it can be decomposed into the reverse cross-entropy and the differential entropy:</p> $\begin{equation} \textrm{KL}_{\textrm{Pred}}(\boldsymbol{x}) = - \underbrace{\mathbb{E}_{p \big ( \mathbf{z} \vert \boldsymbol{\alpha}(\boldsymbol{x}, \boldsymbol{\hat{\theta}}) \big )} \Big [- \log \textrm{Dir} \big ( \mathbf{z} \vert \boldsymbol{\gamma}_{\hat{y}} \big ) \Big ]}_\text{Reverse Cross-Entropy} + \underbrace{\mathcal{H} \Big [ \textrm{Dir} \big (\mathbf{z} \vert \boldsymbol{\alpha}) \big ) \Big ]}_\text{Differential Entropy} \end{equation}$ <p>In the synthetic experiment, we observe that the differential entropy correlates with the <strong>epistemic uncertainty</strong> and the reverse cross-entropy (RCE) correlates with the <strong>aleatoric uncertainty</strong>.</p> <p>We use those decomposed metric here to evaluate their effectiveness on misclassification detection and OOD detection in the following table. (Experiment done using the Dirichlet model trained with $$\beta_{\text{in}}=10$$)</p> <table> <thead> <tr> <th style="text-align: left">Method</th> <th style="text-align: right">Mis. Detection</th> <th style="text-align: right">OOD Detection</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">RCE</td> <td style="text-align: right">$91.3\%$</td> <td style="text-align: right">$92.8\%$</td> </tr> <tr> <td style="text-align: left">Diff. Ent.</td> <td style="text-align: right">$91.2\%$</td> <td style="text-align: right">$92.8\%$</td> </tr> <tr> <td style="text-align: left">KL_Original</td> <td style="text-align: right">$92.6\%$</td> <td style="text-align: right">$93.0\%$</td> </tr> </tbody> </table> <p>It doesn’t seem RCE or differential entropy are more inclined to measure one type of uncertainty. Only thing we can conlude is that combine them into KL_Original improves performances.</p> <h5 id="ablation-study">Ablation study</h5> <p>As observed in our Neurips submission, KLNet mainly improves misclassification detection, with a slight but uncontrolled benefit on OOD detection:</p> <table> <thead> <tr> <th style="text-align: left">Method</th> <th style="text-align: right">Mis. Detection</th> <th style="text-align: right">OOD Detection</th> <th style="text-align: right">Mis.+OOD Detection</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">KL_Original</td> <td style="text-align: right">$92.6\%$</td> <td style="text-align: right">$93.0\%$</td> <td style="text-align: right">$93.7\%$</td> </tr> <tr> <td style="text-align: left">KLNet_Classic</td> <td style="text-align: right">$93.0\%$</td> <td style="text-align: right">$93.3\%$</td> <td style="text-align: right">$94.0\%$</td> </tr> <tr> <td style="text-align: left">KLNet_Cloning</td> <td style="text-align: right">$93.7\%$</td> <td style="text-align: right">$93.3\%$</td> <td style="text-align: right">$94.5\%$</td> </tr> </tbody> </table> <h3 id="cifar-100">CIFAR-100</h3> <ul> <li>In-distribution dataset: CIFAR-100</li> <li>Out-distribution training dataset: CIFAR-10</li> <li>Network Architecture: VGG-16</li> <li>Concentrations parameters $\alpha= \exp{f(x, \theta)}$</li> <li>Target concentrations for in-domain: $$\beta_{\text{in}}$$ = 10</li> <li>Training details: Adam, LR 5e-5, 1-cyclic scheduler for 45 epochs</li> </ul> <p>Accuracy is 73.2% for XE model and 71.5% for Dirichlet model.</p> <p>Presented results are for <strong>TinyImageNet</strong> as OOD dataset (% AUC)</p> <table> <thead> <tr> <th style="text-align: center">Training</th> <th style="text-align: left">Method</th> <th style="text-align: right">Mis. Detection</th> <th style="text-align: right">OOD Detection</th> <th style="text-align: right">Mis.+OOD Detection</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><strong>XE</strong></td> <td style="text-align: left">MCP<br />ODIN<br />Mahalanobis<br /></td> <td style="text-align: right">$86.6\%$<br />$84.9\%$<br />$80.9\%$</td> <td style="text-align: right">$75.9\%$<br />$77.4\%$<br />$73.3\%$</td> <td style="text-align: right">$86.0\%$<br />$85.4\%$<br />$81.3\%$</td> </tr> <tr> <td style="text-align: center"><strong>Dirichlet</strong></td> <td style="text-align: left">MCP<br />ODIN<br />Mutual Information<br />Mahalanobis<br /><strong>KL_Original</strong><br /><strong>KLNet</strong><br /><strong>KLNet_Cloning</strong></td> <td style="text-align: right">$83.7\%$<br />$83.7\%$<br />$82.8\%$<br />$83.4\%$<br />$87.3\%$<br />$86.8\%$<br />$\boldsymbol{87.6\%}$</td> <td style="text-align: right">$76.0\%$<br />$76.0\%$<br />$76.0\%$<br />$71.9\%$<br />$77.0\%$<br />$76.9\%$<br />$\boldsymbol{77.8\%}$</td> <td style="text-align: right">$84.3\%$<br />$84.3\%$<br />$83.8\%$<br />$82.9\%$<br />$87.3\%$<br />$86.8\%$<br />$\boldsymbol{87.8\%}$</td> </tr> </tbody> </table> <blockquote> <p>Training details</p> <ul> <li>on CIFAR-10, KLNet training without weight decay, then add it on second cloning phase and disable data augmentation</li> <li>same for ConfidNet on CIFAR-10</li> </ul> </blockquote> <h2 id="references">References</h2> <ol class="bibliography"><li><span id="malinin2018">A. Malinin, &amp; M. Gales. (2018). Predictive Uncertainty Estimation via Prior Networks. In <i>Advances in Neural Information Processing Systems</i>.</span></li> <li><span id="depeweg2018decomposition">S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, &amp; S. Udluft. (2018). Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning. In <i>Proceedings of the International Conference on Machine Learning</i>.</span></li> <li><span id="malinin2019">A. Malinin, &amp; M. Gales. (2019). Reverse KL-Divergence Training of Prior Networks: Improved Uncertainty and Adversarial Robustness. In <i>Advances in Neural Information Processing Systems</i>.</span></li> <li><span id="kendall2017">A. Kendall, &amp; Y. Gal. (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? <i>Advances in Neural Information Processing Systems</i>.</span></li></ol>Charles Corbièrecharles.corbiere[at]valeo.comTable of contents Context Contributions Thought experiment with Gaussian distributions ExperimentsA Survey on Dirichlet Neural Networks2020-11-12T00:00:00-08:002020-11-12T00:00:00-08:00https://chcorbi.github.io/posts/2020/11/dirichlet_networks<p>In the past few years, a variety of Dirichlet-based methods for classification have emerged in the machine learning community. Before going into a detailed review of these approaches, we introduce the Dirichlet-multinomial model, a classic problem in machine learning which will help to better understand the intution behind the method.</p> <h2 id="table-of-contents">Table of contents</h2> <ol> <li><a href="#preliminaries">Preliminaries: the Dirichlet-multinomial Model</a></li> <li><a href="#bayesian-approach">Dirichlet Neural Networks, a Bayesian Approach</a></li> <li><a href="#prior-networks">Prior Networks</a></li> <li><a href="#max-gap">Maximizing the Gap between In- and Out-distribution for Prior Networks</a></li> <li><a href="#evidential-networks">Evidential Networks</a></li> <li><a href="#generative-evidential">Generative Evidential Networks</a></li> <li><a href="#variational-inference">Variational Inference for Dirichlet Networks</a></li> <li><a href="#posterior-networks">Posterior Networks, a Density-based Approach</a></li> <li><a href="#summary">Summary and Discussion</a></li> </ol> <h2 id="preliminaries-the-dirichlet-multinomial-model-">Preliminaries: the Dirichlet-multinomial Model <a name="preliminaries"></a></h2> <p>Let us take the problem of infering the probability that a dice with $C$ sides comes up as face $c$.</p> <blockquote> <p>This example is heavily inspired by Section 3.4 of <a class="citation" href="#murphy2013machine">(Murphy, 2013)</a>’s book.</p> </blockquote> <p>Suppose we observe $N$ dice rolls. As we always roll the same dice, our training dataset consists of $$\mathcal{D} = \{(x, y^{(i)})\}_{i=1}^N$$ where $$y^{(i)} \in \{1,...,C\}$$. If we assume data is <em>i.i.d</em>, the likelihood has the form:</p> $\begin{equation} p(\mathcal{D} \vert \boldsymbol{\pi}) = \mathrm{Cat}(\mathcal{D} \vert \boldsymbol{\pi}) = \prod_{c=1}^C \pi_c^{N_c} \label{eq:likelihood_multinomial} \end{equation}$ <p>where $$N_c = \sum_{i=1}^N \mathbb{I}(y^{(i)} = c)$$ is the number of observations of class $c$ among the $N$ dice rolls (sufficient statistics).</p> <p>For its conjugate properties with categorical distributions, we chose to model the prior as a Dirichlet distribution with concentration parameters $\boldsymbol{\beta}$:</p> $\begin{equation} p(\boldsymbol{\pi}) = \mathrm{Dir} \big (\boldsymbol{\pi} ; \boldsymbol{\beta} \big ) = \frac{\Gamma(\beta_0)}{\prod_{c=1}^C \Gamma(\beta_c)} \prod_{c=1}^C \pi_c^{\beta_c- 1} \end{equation}$ <p>Multiplying the likelihood by the prior, we find that the posterior is also a Dirichlet distribution:</p> \begin{align} p(\boldsymbol{\pi} \vert \mathcal{D}) &amp;\propto p(\mathcal{D} \vert \boldsymbol{\pi}) p(\boldsymbol{\pi}) \\ &amp;\propto \prod_{c=1}^C \pi_c^{N_c} \pi_c^{\beta_c - 1} \\ &amp;= \mathrm{Dir} \big (\boldsymbol{\pi} \vert \beta_1 + N_1,..., \beta_C + N_C \big ) \end{align} <p>Now, what we really care about is to compute the posterior predictive distribution for a single multinouilli trial:</p> \begin{align} P(y=c \vert \mathcal{D}) &amp;= \int P(y=c \vert \boldsymbol{\pi}) p(\boldsymbol{\pi} \vert \mathcal{D}) d\boldsymbol{\pi} \\ &amp;= \int \pi_c \cdot p(\boldsymbol{\pi} \vert \mathcal{D}) d\boldsymbol{\pi} \\ &amp;= \mathbb{E}_{p(\boldsymbol{\pi} \vert \mathcal{D})} \big [\pi_c] = \frac{\beta_c + N_c}{\beta_0 + N} \end{align} <p>We observe that the prior distribution acts as a <strong>Bayesian smoothing</strong> by adding pseudo-count $\boldsymbol{\beta}$ to the true count.</p> <blockquote> <p>Consider $A,B$ two random variables with respective realisations $a,b$. In this blog post, we use the abusive notation $$\mathbb{E}_{p(a \vert b)} \big [ f(a) \big ] = \int f(a) p(A=a \vert B=b) da~$$ instead of $$~\mathbb{E} \big [f(a) \vert B=b \big ]$$ for conciseness.</p> </blockquote> <h2 id="dirichlet-neural-networks-a-bayesian-approach--">Dirichlet Neural Networks, a Bayesian Approach <a name="bayesian-approach"></a></h2> <p>We extend the Bayesian treatment of a single categorical distribution to classification. In classification tasks, we predict the class label $y_i$ from a different categorical distribution for each input $\boldsymbol{x}_i$. Dataset consists of a collection of <em>i.i.d</em> training samples $$\mathcal{D} = \{(\boldsymbol{x}^{(i)}, y^{(i)})\}_{i=1}^N \in (\mathcal{X} \times \mathcal{Y})^N$$ where $\mathcal{X}$ represents the input space and $$\mathcal{Y}=\{1,\ldots,C\}$$ is a set of class labels. Samples drawn from $\mathcal{D}$ follow an unknown joint probability distribution $p(\boldsymbol{x}, y)$ where $\boldsymbol{x}$ and $y$ are random variables over input space and label space respectively.</p> <p>Let us denote $\boldsymbol{\pi} =[\pi_1,…,\pi_C]$ the random variable over categorical probabilities. The likelihood given a sample $\boldsymbol{x}$ has the form:</p> $\begin{equation} y \vert \boldsymbol{\pi}, \boldsymbol{x} \sim \mathrm{Cat}(y \vert \boldsymbol{\pi}, \boldsymbol{x}) = \prod_{c=1}^C \pi_c^{\tilde{N}_c(\boldsymbol{x})} \end{equation}$ <p>The difference with Eq.(\ref{eq:likelihood_multinomial}) is that $$\tilde{N}_c(\boldsymbol{x})$$ now represents a label frequency count at point $$\boldsymbol{x}$$.</p> <blockquote> <p>Most of the time, the estimator $$\tilde{N}_c(\boldsymbol{x})$$ uses single or very few samples since most of the inputs are unique or very rare in the training set.</p> </blockquote> <p>Obviously, for an unknown test sample $$\boldsymbol{x}^*$$, we don’t have access to its label frequency count when infering. Consequently, we are not able to estimate the posterior predictive distribution:</p> \begin{align} P(y =c \vert \boldsymbol{x^*}, \mathcal{D}) &amp;= \int P(y=c \vert \boldsymbol{\pi}, \boldsymbol{x^*}) p(\boldsymbol{\pi} \vert \boldsymbol{x}^*, \mathcal{D}) d\boldsymbol{\pi} \\ &amp;= \mathbb{E}_{p(\boldsymbol{\pi} \vert \boldsymbol{x}^*, \mathcal{D})} \big [ \pi_c \big ] \end{align} <p>Approaches like ensembles <a class="citation" href="#deepensembles2017">(Lakshminarayanan et al., 2017)</a> and dropout <a class="citation" href="#mcdropout2016">(Gal &amp; Ghahramani, 2016)</a> model implicity the posterior distribution over categorical probabilities by marginalizing over the network parameters $\boldsymbol{\theta}$ and estimate the predictive distribution thanks to Monte-Carlo Sampling:</p> $\begin{equation*} P(y=c \vert \boldsymbol{x}^*, \mathcal{D}) = \int P(y=c \vert \boldsymbol{x}^*, \boldsymbol{\theta}) p(\boldsymbol{\theta} \vert \mathcal{D}) d\boldsymbol{\theta} \approx \frac{1}{S} \sum_{s=1}^S P(y=c \vert \boldsymbol{x}^*, \boldsymbol{\theta}_s) \end{equation*}$ <p><strong>With Dirichlet models, we now explicitly parametrizes the distribution over the predictive categorical $p(\boldsymbol{\pi} \vert \boldsymbol{x}^*, \mathcal{D})$ with a Dirichlet distribution.</strong> This effectively emulate an ensemble without sampling approximation, thanks a closed-form solution which also requires one forward pass only.</p> <p>In particular, Dirichlet modelisation enables to distinctly account for the aleatoric uncertainty and the epistemic uncertainty of a prediction. The <em>aleatoric uncertainty</em> is irreducible from the data due to class overlap or noise, <em>e.g.</em> a fair coin has 50/50 chance for head. The <em>epistemic uncertainty</em> is due to the lack of knowledge about unseen data, <em>e.g.</em> an image of an unknown object or an outlier in the data.</p> <p>Epistemic uncertainty relates to the spread of the categorical distribution $$p(\boldsymbol{\pi} \vert \boldsymbol{x}^*, \mathcal{D})$$ on the simplex, which corresponds to $$\alpha_0 = \sum_{c=1}^C \alpha_c$$ for a Dirichlet distribution. Aleatoric uncertainty is linked to the position of the mean on the simplex. Equipped with this configuration, we would like Dirichlet network to yield the desired behaviors shown in the figure below:</p> <p style="text-align: center;"><img src="/images/desired_behavior.png" alt="desired_behavior" /></p> <p>When the model is confident in its prediction, it should yield a sharp distribution centered on one of the corners of the simplex (<em>Fig a.</em>). For an input in a region with high degrees of noise or class overlap (aleatoric uncertainty), it should yield a sharp distribution focused on the center of the simplex, which corresponds to being confident in predicting a flat categorical distribution over class labels (<em>Fig b.</em>). Finally, for out-of-distribution inputs, the Dirichlet network should yield a flat distribution over the simplex, indicating large epistemic uncertainty (<em>Fig c.</em>).</p> <p>Now the remaining questions are:</p> <ul> <li><strong>How to induce such desired behavior when training Dirichlet networks?</strong></li> <li><strong>What measure should we use for each type of uncertainty?</strong></li> </ul> <p>In the recent literature, <a class="citation" href="#malinin2018">(Malinin &amp; Gales, 2018)</a> and <a class="citation" href="#sensoy2018">(Sensoy et al., 2018)</a> simultaneously proposed to learn a Dirichlet neural network to better represent uncertainty in classification. Following papers <a class="citation" href="#malinin2019">(Malinin &amp; Gales, 2019; Chen et al., 2019; Sensoy et al., 2020; Joo et al., 2020; Nandy et al., 2020; Charpentier et al., 2020)</a> build on this framework by improving learning. In the rest of this post, we will review these approaches and their benefits.</p> <h2 id="prior-networks-">Prior Networks <a name="prior-networks"></a></h2> <p><a class="citation" href="#malinin2018">(Malinin &amp; Gales, 2018)</a> propose to model the concentration parameters $\boldsymbol{\alpha}$ of the distribution over probabilities $p(\boldsymbol{\pi} \vert \boldsymbol{x}^*; \boldsymbol{\boldsymbol{\hat{\theta}}}) = \mathrm{Dir} \big (\boldsymbol{\pi} \vert \boldsymbol{\alpha} \big )$ where the concentration parameters $\boldsymbol{\alpha}$ are computed by the output of a neural network $f$.</p> <p>By marginalizing network parameters $\boldsymbol{\theta}$, the distribution over probabilities writes as follow:</p> $\begin{equation*} p(\boldsymbol{\pi} \vert \boldsymbol{x^*}, \mathcal{D}) = \int p(\boldsymbol{\pi} \vert \boldsymbol{x^*}, \boldsymbol{\theta}) p(\boldsymbol{\theta} \vert \mathcal{D}) d\boldsymbol{\theta} \end{equation*}$ <p>For simplicity, authors assume a Dirac-delta approximation of the parameters</p> $\begin{equation*} p(\boldsymbol{\theta} \vert \mathcal{D}) = \delta(\boldsymbol{\theta} - \boldsymbol{\hat{\theta}}) \Rightarrow p(\boldsymbol{\pi} \vert \boldsymbol{x^*}, \mathcal{D}) \approx p(\boldsymbol{\pi} \vert \boldsymbol{x^*}; \boldsymbol{\hat{\theta}}) \end{equation*}$ <p>The posterior over class labels will be given by the mean of the Dirichlet:</p> $\begin{equation} P(y=c \vert \boldsymbol{x}^*, \mathcal{D}) = \int P(y=c \vert \boldsymbol{\pi}, \boldsymbol{x^*}) p(\boldsymbol{\pi} \vert \boldsymbol{x^*}; \boldsymbol{\hat{\theta}}) = \frac{\alpha_c}{\alpha_0} \end{equation}$ <p>If an exponential output function is used, i.e $\alpha_c = e^{f_c(\boldsymbol{x^*}, \boldsymbol{\hat{\theta}})}$, then the expected posterior probability of a label $c$ is given by the output of the softmax:</p> $\begin{equation} P(y=c \vert \boldsymbol{x^*}, \mathcal{D}) = \frac{e^{f_c(\boldsymbol{x^*}, \boldsymbol{\hat{\theta}})}}{\sum_{k=1}^C e^{f_k(\boldsymbol{x^*}; \boldsymbol{\hat{\theta}})}} \end{equation}$ <p>The representation is similar to standard neural networks for classification with the difference that the output now describes the concentration parameters of a Dirichlet distribution over a simplex.</p> <h3 id="learning">Learning</h3> <p>Merely training with a cross-entropy loss only affects the value of the concentration parameters $\alpha_y$ associated to the true class. It does not enable to control the <em>spread</em> parameter $\alpha_0$ of the Dirichlet distribution over categorical probabilities.</p> <p>For clarity, we introduce the existence of out-of-domain training data $$\mathcal{D}_{\textrm{out}}$$ and now denotes the training dataset by $$\mathcal{D}_{\textrm{in}}$$. Associated random variable specifying whether a sample belong to in-distribution or out-distribution takes values in $$\{i,o\}$$.</p> <p>To enforce the desired behavior for out-of-distribution (OOD) samples, <a class="citation" href="#malinin2019">(Malinin &amp; Gales, 2019)</a> propose a <strong>reverse KL-divergence-based contrastive loss between in-distribution and out-distribution samples</strong>. It consists of minimizing the reverse KL divergence between the neural network’s output and a sharp Dirichlet distribution focused on the appropriate class for in-distribution data, and between the output and a flat Dirichlet distribution for out-of-distribution data:</p> \begin{align} \mathcal{L}_{\textrm{RKL-PN}}(\boldsymbol{\theta}) &amp;= \mathbb{E}_{p(\boldsymbol{x})} \Big [ \textrm{KL} \big ( \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta} ) ~\vert \vert~ \textrm{Dir} ( \boldsymbol{\pi} \vert \bar{\boldsymbol{\beta}}) \big ) \Big ] \\ &amp;=\mathbb{E}_{p(\boldsymbol{x})} \Big [\sum_{c=1}^C P(y=c \vert \boldsymbol{x}) \cdot \textrm{KL} \big ( \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta} ) ~\vert \vert~ \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{\beta}^{(c)}) \big ) \Big ] \\ &amp;=\mathbb{E}_{p(\boldsymbol{x} \vert i)} \Big [\sum_{c=1}^C P(y=c \vert \boldsymbol{x}) \cdot \textrm{KL} \big ( \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta} ) ~\vert \vert~ \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{\beta}^{(c)}_{\textrm{in}}) \big ) \Big ] P(i) \\ &amp;~~~~~~+ \mathbb{E}_{p(\boldsymbol{x} \vert o)} \Big [\sum_{c=1}^C P(y=c \vert \boldsymbol{x}) \cdot \textrm{KL} \big ( \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta} ) ~\vert \vert~ \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{\beta}^{(c)}_{\textrm{out}}) \big ) \Big ] P(o) \nonumber \end{align} <p>where $$\bar{\boldsymbol{\beta}} = \sum_{c=1}^C P(y=c \vert \boldsymbol{x}) \cdot \boldsymbol{\beta}^{(c)}$$ is an arithmetic mixture of the target concentration parameters for each class.</p> <p>We approximate with the empirical distributions $$\hat{p}(X \vert i)$$ on $$\mathcal{D}_{\textrm{in}}$$ and $$\hat{p}(X \vert o)$$ on $$\mathcal{D}_{\textrm{out}}$$, which boils down to minimize the following loss:</p> \begin{align} \hat{\mathcal{L}}_{\textrm{RKL-PN}}(\boldsymbol{\theta}, \mathcal{D}_{\textrm{in}}, \mathcal{D}_{\textrm{out}})= \sum_{i=1}^{N_i} &amp;~\textrm{KL} \Big ( \textrm{Dir} \big ( \boldsymbol{\pi} \vert \boldsymbol{x}^{(i)}, \boldsymbol{\theta} \big ) ~\vert \vert~ \textrm{Dir} \big ( \boldsymbol{\pi} \vert \boldsymbol{\beta}_{\textrm{in}}^{(i)} \big ) \Big ) \\ &amp;+ \gamma \sum_{j=1}^{N_o} \mathrm{KL} \Big ( \textrm{Dir} \big ( \boldsymbol{\pi} \vert \boldsymbol{x}^{(j)}, \boldsymbol{\theta} \big ) ~\vert \vert~ \textrm{Dir} \big ( \boldsymbol{\pi} \vert \boldsymbol{\beta}_{\textrm{out}}^{(j)} \big ) \Big ) \nonumber \end{align} <p>where in-domain target concentrations parameters are $$\boldsymbol{\beta}_{\textrm{in}}^{(i)}$$ are such that $$\forall c\neq y^{(i)}, \boldsymbol{\beta}^{(i)}_{c, in}=1$$ and $$\boldsymbol{\beta}^{(i)}_{y, in} = 1 + B$$, with $B$ a hyperparameter. A flat uniform Dirichlet is chosen for out-domain target distribution : $$\forall c, \boldsymbol{\beta}_{c,\textrm{out}}^{(j)}=1$$. Hyperparameter $\gamma = \hat{P}(o) / \hat{P}(i)$ helps to balance the out-of-distribution loss weight in training.</p> <blockquote> <p>Authors chose $B=100$ to aim for in-domain distributions with high target concentration parameters.</p> </blockquote> <h3 id="measuring-uncertainty">Measuring uncertainty</h3> <p><a class="citation" href="#malinin2018">(Malinin &amp; Gales, 2018)</a> consider measures from the expected predictive categorical distribution $$p(y \vert \boldsymbol{x^*}, \mathcal{D})$$ as a mesure of <strong>total uncertainty</strong>. This includes the Maximum Class Probability (MCP) and the predictive entropy $$\mathcal{H} \big [y \vert \boldsymbol{x^*}, \mathcal{D}]$$.</p> <p>Based on <a class="citation" href="#depeweg2018decomposition">(Depeweg et al., 2018)</a>, they decompose the predictive entropy into two terms :</p> $\begin{equation} \underbrace{\mathcal{H} \Big [ \mathbb{E}_{p(\boldsymbol{\pi} \vert \boldsymbol{x^*}; \hat{\boldsymbol{\theta}})} \big [ p(y \vert \boldsymbol{\pi}) \big] \Big ]}_{\text{Total Uncertainty}} = \underbrace{\mathbb{E}_{p(\boldsymbol{\pi} \vert \boldsymbol{x^*}; \hat{\boldsymbol{\theta}})} \Big [ \mathcal{H} \big [ p(y \vert \boldsymbol{\pi}) \big] \Big]}_{\text{Expected Aleatoric Uncertainty}} + \underbrace{\mathcal{MI} \Big [y, \boldsymbol{\pi} \vert \boldsymbol{x^*}; \hat{\boldsymbol{\theta}} \Big ]}_{\text{Epistemic Uncertainty}} \end{equation}$ <ul> <li>$$\mathcal{MI} \Big [y, \boldsymbol{\pi} \vert \boldsymbol{x^*}; \hat{\boldsymbol{\theta}} \Big ]$$ is the <em>mutual information</em> between class label and categorical probabilities from the posterior. It corresponds to a measure of the spread of the posterior distribution $$p(\boldsymbol{\pi} \vert \boldsymbol{x^*}; \hat{\boldsymbol{\theta}})$$</li> <li>$$\mathbb{E}_{p(\boldsymbol{\pi} \vert \boldsymbol{x^*}; \hat{\boldsymbol{\theta}})} \Big [ \mathcal{H} \big [ p(y \vert \boldsymbol{\pi}) \big] \Big]$$ is the expected entropy of the categorical distribution.</li> </ul> <blockquote> <p>When dealing with ensemble methods, the expected entropy of the categorical distribution is computed with the Monte Carlo estimates. With Dirichlet networks, there is a closed-form.</p> </blockquote> <p>In their experiments, they use the mutual information $$\mathcal{MI}$$ to detect out-of-distribution samples and MCP for misclassification detection.</p> <h2 id="maximizing-the-gap-between-in--and-out-distribution-for-prior-networks-">Maximizing the Gap between In- and Out-distribution for Prior Networks <a name="max-gap"></a></h2> <p><a class="citation" href="#maximize-representation-gap2020">(Nandy et al., 2020)</a> motivate their work by showing that using the reverse KL-divergence tends to produce flatter Dirichlet distributions for in-domain misclassified examples. As a consequence, this effect may harden the goal of making in-domain and out-domain samples distinguishable.</p> <p>To demonstrate this behavior, they decompose the KL-divergence into the <em>reverse cross-entropy</em> and the <em>differential entropy</em>:</p> \begin{align} \mathcal{L}_{\textrm{RKL-PN}}(\boldsymbol{\theta}) &amp;= \mathbb{E}_{p(\boldsymbol{x})} \Big [ \textrm{KL} \big ( \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta} ) ~\vert \vert~ \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{\bar{\beta}}) \big ) \Big ] \\ &amp;= \mathbb{E}_{p(\boldsymbol{x})} \Big [ \underbrace{\mathbb{E}_{p( \boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta})} \big [- \log \textrm{Dir} \big ( \boldsymbol{\pi} \vert \boldsymbol{\bar{\beta}}) \big ]}_\text{Reverse Cross-Entropy} - \underbrace{\mathcal{H} \big [ \textrm{Dir} \big (\boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta}) \big ] \Big ]}_\text{Differential Entropy} \label{eq:reverse-kl} \end{align} <p>Minimizing the differential entropy always leads to produce a flatter distribution. Hence, $$\mathcal{L}_{\textrm{RKL-PN}}$$ relies only on the reverse cross-entropy to produce sharper distributions. This latter term can be derived as:</p> \begin{align} \mathbb{E}_{p( \boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta})} \big [- \log \textrm{Dir} \big ( \boldsymbol{\pi} \vert \bar{\boldsymbol{\beta}}) \big ] &amp;= \sum_{c=1}^C P(y=c \vert \boldsymbol{x}) (1 + B - 1)(\psi(\alpha_0) - \psi(\alpha_c)) \\ &amp;= B\cdot \psi(\alpha_0) - \sum_{c=1}^C B \cdot P(y=c \vert \boldsymbol{x})\psi(\alpha_c) \label{eq:rce} \end{align} <p>We can see in Eq.(\ref{eq:rce}) that the reverse cross-entropy term maximizes $\psi(\alpha_c)$ for each class $c$ with the factor $B \cdot P(y=c \vert \boldsymbol{x})$, while minimizing $\psi(\alpha_c)$ with the factor $B$. Imagine a sample with high aleatoric uncertainty, the reverse cross-entropy will leads to smaller concentration parameters $\boldsymbol{\alpha}$ than a confident sample.</p> <p>The proposed solution is to <strong>force OOD samples to have even more lower concentration parameters</strong>, ideally $$\boldsymbol{\alpha}_{\textrm{out}} \rightarrow 0$$.</p> <h3 id="learning-1">Learning</h3> <p>Looking at the reverse cross-entropy term in Eq.(\ref{eq:rce}), a first solution could be to modify the target $$\beta_{\text{out}}$$ within the reverse KL-divergence. Currently, $$\beta_{\text{out}}=1$$ for OOD samples. Setting $$\beta_{\text{out}}&gt;1$$ would minimize $\alpha_0$ but also maximises individual concentration parameters $\alpha_c$. Conversely, $\beta_{\text{out}} \in [0,1[$ maximises $\alpha_0$ while minimizing $\alpha_c$’s. Eithier choice of $\beta_{\text{out}}$ may lead to uncontrolled values.</p> <p>Instead, <a class="citation" href="#maximize-representation-gap2020">(Nandy et al., 2020)</a> propose a new loss for Dirichlet Networks based on the usual cross-entropy loss and a <strong>explicit regularization on the concentration parameters</strong>:</p> \begin{align} \mathcal{L}_{\textrm{Max-PN}}(\boldsymbol{\theta}) = &amp;\mathbb{E}_{p(\boldsymbol{x}, y \vert i)} \Big [ - \log p(y \vert \boldsymbol{x}, \boldsymbol{\theta}) - \frac{\lambda_{\textrm{i}}}{C} \sum_{c=1}^C \sigma(\alpha_c) \Big ] \\ &amp;+ \gamma \cdot \mathbb{E}_{p(\boldsymbol{x},y \vert o)} \Big [ - \frac{1}{C} \sum_{c=1}^C \log P(y=c \vert \boldsymbol{x}, \boldsymbol{\theta}) - \frac{\lambda_{\textrm{o}}}{C} \sum_{c=1}^C \sigma(\alpha_c) \Big ] \nonumber \end{align} <p>where $$\sigma$$ the sigmoid function, $$\lambda_{\textrm{i}}$$ and $$\lambda_{\textrm{o}}$$ hyperparameters to control the precision of respective output distributions and $$\gamma = p(o)/p(i)$$ balancing loss values between in- and out-domain distribution. Note that the term $$- \frac{1}{C} \sum_{c=1}^C \log P(y=c \vert \boldsymbol{x}, \boldsymbol{\theta})$$ corresponds to minimise the cross-entropy with an uniform distribution.</p> <p>To enforce negative values for OOD concentration parameters, authors chose $$\lambda_{\textrm{o}} = 1/C - 1/2 &lt;0$$. It leads the probability densities to be moved across the edges of the simplex to produce extremely sharp multi-modal distributions. They set $$\lambda_{\textrm{i}} = 1/2$$ for in-domain examples.</p> <blockquote> <p>Note that chosing $$\lambda_{\textrm{i}} = \lambda_{\textrm{o}} = 0$$ leads to the same loss proposed in a non-Bayesian outlier exposure framework proposed by <a class="citation" href="#hendrycks2019oe">(Hendrycks et al., 2019)</a></p> </blockquote> <h3 id="measuring-uncertainty-1">Measuring uncertainty</h3> <p>As they explicitely regularize the logits, <a class="citation" href="#maximize-representation-gap2020">(Nandy et al., 2020)</a> use $$\alpha_0 = \sum_c \alpha_c$$ the sum of the concentration parameters of posterior distribution. In parallel, they also mention the <em>differential entropy</em> $$\mathcal{H} \big [ p( \boldsymbol{\pi} \vert \boldsymbol{x}, \hat{\boldsymbol{\theta}}) \big ]$$, which is the entropy of the posterior distribution over probabilities :</p> $\begin{equation} \mathcal{H} \big [ p( \boldsymbol{\pi} \vert \boldsymbol{x}, \hat{\boldsymbol{\theta}}) \big ] = \int p( \boldsymbol{\pi} \vert \boldsymbol{x}, \hat{\boldsymbol{\theta}}) \log p( \boldsymbol{\pi} \vert \boldsymbol{x}, \hat{\boldsymbol{\theta}}) d\boldsymbol{\pi} \end{equation}$ <p>In particular, the differential entropy is equivalent to measuring the KL-divergence between the posterior distribution and a Dirichlet uniform $$\mathrm{Dir}(\boldsymbol{\pi} \vert \mathbf{u})$$ where $$\forall c, u_c = 1$$.</p> <p>They measure OOD detection with mutual information or $\alpha_0$ and misclassification detection with MCP.</p> <h2 id="evidential-networks-">Evidential Networks <a name="evidential-networks"></a></h2> <p>For now on, we will assume there is <strong>no OOD data available in training</strong> for the rest of the approaches presented below.</p> <p>In a concurrent work to <a class="citation" href="#malinin2018">(Malinin &amp; Gales, 2018)</a>, <a class="citation" href="#sensoy2018">(Sensoy et al., 2018)</a> also develop a Dirichlet-based model for neural network based on <strong>subjective logic</strong> <a class="citation" href="#josan2016sublogic">(Josang, 2016)</a> framework. Subjective logic formalizes the Dempster-Shafer theory’s notion of belief assignements over a frame of discernement as a Dirichlet distribution.</p> <blockquote> <p>The Dempster-Shafer theory of evidence <a class="citation" href="#dempster2008">(Dempster, 2008)</a> is a generalization of the Bayesian theory to subjective probabilities. It assigns belief masses to subsets of a frame of discernment, which denotes the set of exclusive possible states, <em>e.g.</em> possible class labels for a sample.</p> </blockquote> <p>In practice, it boils down to account for an overall uncertainty mass of $u$ added to belief classes $$b_c$$:</p> $\begin{equation} u + \sum_{c=1}^C b_c = 1 \end{equation}$ <p>where $u \geq 0$ and $$\forall c \in \mathcal{y}, b_c \geq 0$$. A belief mass $$b_c$$ for a singleton $c$ is computed using the evidence for the singleton. Let $$e_c \geq 0$$ be the evidence derived for the $c^{th}$ singleton, then the belief $$b_c$$ and the uncertainty $u$ are computed as:</p> $\begin{equation} b_c = \frac{e_c}{S} \quad \textrm{and} \quad u = \frac{C}{S} \end{equation}$ <p>where $$S = \sum_{c=1}^C (e_c +1)$$. Note that the uncertainty is inversely proportional to the total evidence.</p> <p>The link to Dirichlet distribution is as follow. Concentration parameters of distribution $$p(\boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta})$$ corresponds to evidences $$\alpha_c = e_c + 1$$. Then, we obtain that $$S = \alpha_0$$, which represents the spread of the distribution.</p> <p><a class="citation" href="#sensoy2018">(Sensoy et al., 2018)</a> propose to model the concentration parameters by a neural network output, hence:</p> $\begin{equation} \boldsymbol{\alpha} = \textrm{ReLU} \big ( f(\boldsymbol{x}, \boldsymbol{\theta}) \big ) + 1 \end{equation}$ <p>In contrast to previous work, they differs in the modelisation by replacing the softmax layer with an ReLU activation layer to ensure non-negative outputs.</p> <h3 id="learning-2">Learning</h3> <p>Authors propose to train their Evidential Neural Network by <strong>minimizing the Bayes risk of the MSE loss with respect to the ‘class predictor’</strong>:</p> \begin{align} \mathcal{L}_{\textrm{ENN}}(\boldsymbol{\theta}) = \mathbb{E}_{p(\boldsymbol{x}, y)} \Big [\int \vert \vert \boldsymbol{y} - \boldsymbol{\pi} \vert \vert^2 \cdot p(\boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta}) d\boldsymbol{\pi} + \lambda_t \cdot \textrm{KL} \big ( \textrm{Dir} ( \boldsymbol{\pi} \vert \tilde{\boldsymbol{\alpha}} ) ~\vert \vert~ \textrm{Dir} (\boldsymbol{\pi} \vert \mathbf{u}) \big ) \Big ] \end{align} <p>Added to the Bayes risk, they also incorporate a KL-divergence regularization term which penalize ‘unknown’ predictive distributions. $\boldsymbol{y}$ denotes the one-hot representation of $y$, $$\textrm{Dir} (\boldsymbol{\pi} \vert \mathbf{u})$$ is the Dirichlet uniform distribution and $$\tilde{\boldsymbol{\alpha}} = \boldsymbol{y} + (1- \boldsymbol{y}) \odot \boldsymbol{\alpha}$$ is the Dirichlet parameters after removal of the non-misleading evidence from predicted parameters $\boldsymbol{\alpha}$. In the paper, $$\lambda_t = \min (1, t/10) \in [0, 1]$$ is an annealing coefficient, with $t$ is the index of the current training epoch.</p> <p>In particular, they also provide some interesting theoretical properties the Bayes risk minimization with MSE loss thanks to the variance identity:</p> \begin{align} \int \vert \vert \boldsymbol{y} - \boldsymbol{\pi} \vert \vert^2 \cdot p(\boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta}) d\boldsymbol{\pi} &amp;= \mathbb{E}_{p( \boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta})} \Big [(\boldsymbol{y} - \boldsymbol{\pi})^2 \Big ] \\ &amp;= \mathbb{E}_{p( \boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta})} \Big [\boldsymbol{y} - \boldsymbol{\pi} \Big ]^2 + \textrm{Var} \Big [ \boldsymbol{y} - \boldsymbol{\pi} \Big ] \\ &amp;= \sum_{c=1}^C (y_c - \alpha_c / \alpha_0)^2 + \frac{\alpha_c(\alpha_0 - \alpha_c)}{\alpha_0^2(\alpha_c + 1)} \end{align} <p>The derivation decompose the loss into the first and second moments, which highlight it aims to achieve the joint goals of <strong>minimizing the prediction error and the variance of the Dirichlet distribution output by the neural network</strong>.</p> <h3 id="measuring-uncertainty-2">Measuring uncertainty</h3> <p>Without further reflection, <a class="citation" href="#sensoy2018">(Sensoy et al., 2018)</a> use predictive entropy $$\mathcal{H} \big [y \vert \boldsymbol{x^*}; \boldsymbol{\theta}]$$ as uncertainty measure to be consistent with other approaches used in literature. Experiments include OOD detection and adversarial robustness.</p> <h2 id="generative-evidential--networks-">Generative Evidential Networks <a name="generative-evidential"></a></h2> <p>While, Evidential networks enable to account aleatoric and epistemic uncertainty separately, they fail to induce to desired behavior for out-of-distribution samples. There is no constraint on the out-distribution domain and models could still be able to derive large amount of evidence and become overconfident in their predictions for OOD samples.</p> <p>To alleviate this issue, <a class="citation" href="#sensoy2020">(Sensoy et al., 2020)</a> extend their previous work and propose to <strong>synthetise out-of-distribution samples close to training samples</strong> thank to a generative model. Then, they learn a classifier using both types of samples and based on <strong><em>implicit</em> density modeling</strong> to account for out-of-distribution samples during training.</p> <h3 id="implicit-density-modeling">Implicit density modeling</h3> <p>A convenient way to describe density of samples from a class $c$ is to describe it relative to the density of some other reference data. By using the same reference data for all classes in the training set, one desire to get comparable quantities for their density estimations. We reformulate the ratio between densities $$p(\boldsymbol{x} \vert y=c)$$ and $$p(\boldsymbol{x} \vert o)$$ as:</p> $\begin{equation} \frac{p(\boldsymbol{x} \vert y=c)}{p(\boldsymbol{x} \vert o)} = \frac{P(y=c \vert \boldsymbol{x})}{p(o \vert \boldsymbol{x})} \frac{P(o)}{P(y=c)} \label{eq:density-ratio} \end{equation}$ <p>where $$\frac{p(o)}{P(y=c)}$$ can be approximated with the empirical count of samples of class $c$ and the empirical count of out-of-distribution training samples.</p> <p>As shown in Eq.(\ref{eq:density-ratio}), one can approximate the log density ratio $$\log \frac{p(\boldsymbol{x} \vert y=c)}{p(\boldsymbol{x} \vert o)}$$ as the logit output of a binary classifier <a class="citation" href="#implicit2017">(Lakshminarayanan &amp; Mohamed, 2017)</a>, which is trained to discriminate between the samples from $P(y=c)$ and $P(o)$. Hence, along with the Dirichlet framework, each output $$f_c(\boldsymbol{x}, \boldsymbol{\theta})$$ of a neural network classifier now also aims to approximate the log density ratio.</p> <p>Concentration parameters are defined as $$\boldsymbol{\alpha} = e^{f(\boldsymbol{x}, \boldsymbol{\theta})} + 1$$. If a sample $\boldsymbol{x}$ tends to be from out domain, then the density ratio should be close to zero, meaning almost zero evidence generated by the network for that sample.</p> <h3 id="learning-3">Learning</h3> <p>Authors use the <strong>Bernoulli loss</strong> to train their network with a <strong>regularization term for misclassified samples</strong>:</p> \begin{align} \mathcal{L}_{\textrm{GEN}}(\boldsymbol{\theta}) = - \sum_{c=1}^C \Big (\mathbb{E}_{p(\boldsymbol{x} \vert c, i)} \big [&amp;\log \sigma ( f_c(\boldsymbol{x}, \boldsymbol{\theta})) \big ] + \mathbb{E}_{p(\boldsymbol{x} \vert o)} \big [\log \big ( 1 - \sigma ( f_c(\boldsymbol{x}, \boldsymbol{\theta})) \big ) \big ] \Big ) \\ &amp;+ \lambda \cdot \mathbb{E}_{p(\boldsymbol{x},y \vert i)} \Big [ \textrm{KL} \big ( \textrm{Dir} (\boldsymbol{\pi}_{-y} \vert \boldsymbol{\alpha}_{-y} )~\vert \vert~ \textrm{Dir} (\boldsymbol{\pi}_{-y} \vert \mathbf{u}) \big ) \Big ] \nonumber \end{align} <p>where $\sigma$ is the sigmoid function, $$\boldsymbol{\pi}_{-y}$$ and $$\boldsymbol{\alpha}_{-y}$$ refers to the vector of probabilities $$\pi_c$$ and the vector of concentration parameters $$\alpha_c$$ such that $$c \neq y$$. The regularization term aims to push concentration parameters of all classes $c \neq y$ to be close to uniform by minimizing a KL-divergence with the Dirichlet uniform distribution $\mathbf{u}$.</p> <p>Finally, $\lambda$ is an sample-level hyperparameter controlling the weight of the regularization term. Authors defines $\lambda = 1 - \alpha_y / \alpha_0$, which is the expected probability of misclassification. Hence, when used as a weight of the KL-divergence term, it enables a <em>learned loss attenuation</em>: as the aleatoric uncertainty decreases, the weight of the regularization term decrease as well.</p> <blockquote> <p>Note that this contrastive training is inspired from noise-constrative estimation <a class="citation" href="#gutmann2012noise">(Gutmann &amp; Hyvärinen, 2012)</a> with the difference here that noisy data are generalized to out-of-distribution samples.</p> </blockquote> <h3 id="generating-out-of-distribution-samples">Generating out-of-distribution samples</h3> <p>To synthesize out-of-distribution samples, <a class="citation" href="#sensoy2020">(Sensoy et al., 2020)</a> relies on the <strong>latent space of a variational autoencoder</strong> (VAE) where they perturb in-domain sample representation with a multivariate Gaussian distribution. More precisely, for each $\boldsymbol{x}$ in training dataset, they sample a latent point $\boldsymbol{z}$ from the latent space distribution learned by a VAE and perturb it by $$\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}, G(\boldsymbol{z}))$$ where $G(\cdot)$ is a generative adversarial network (GAN) with non-negative output and adversarially trained against two discriminators (one that acts in the latent space, the other one in the input space). The VAE, generator and discriminators are iteratively trained until convergence.</p> <h3 id="measuring-uncertainty-3">Measuring uncertainty</h3> <p>As with Evidential Networks, <a class="citation" href="#sensoy2020">(Sensoy et al., 2020)</a> use predictive entropy $$\mathcal{H} \big [y \vert \boldsymbol{x^*}, \boldsymbol{\theta})]$$ as uncertainty measure. Experiments include misclassification detection, OOD detection and adversarial robustness.</p> <h2 id="variational-inference-for-dirichlet-networks-">Variational inference for Dirichlet Networks <a name="variational-inference"></a></h2> <p>Using the <strong>Bayesian principle</strong>, the true posterior distribution over categorical distribution for a sample $(\boldsymbol{x}, y)$ can be obtained by:</p> $\begin{equation} p(\boldsymbol{\pi} \vert \boldsymbol{x}, y) \propto p(y \vert \boldsymbol{\pi}, \boldsymbol{x}) p(\boldsymbol{\pi} \vert \boldsymbol{x}) \end{equation}$ <p>Just as in the preliminaries for a Dirichlet-multinomial model, we define the prior $$p(\boldsymbol{\pi} \vert \boldsymbol{x})$$ as a Dirichlet distribution with concentration parameters $\boldsymbol{\beta}$, which is conjugate to the categorical likelihood. Hence, we have the following posterior given dataset $\mathcal{D}$:</p> $\begin{equation} p(\boldsymbol{\pi} \vert \boldsymbol{x},y) = \mathrm{Dir} \Big (\boldsymbol{\pi} \vert \beta_1 + \tilde{N}_1(\boldsymbol{x}),..., \beta_K + \tilde{N}_K(\boldsymbol{x}) \Big ) \label{eq:posterior_distribution} \end{equation}$ <p>where $$\tilde{N}_c(\boldsymbol{x})$$ now represents the empirical label frequency count at point $$\boldsymbol{x}$$. Again, we see that the target posterior mean is explicitly <em>smoothed</em> by the prior belief. However, when predicting for a new sample $\boldsymbol{x}^*$, we obviously don’t know its label frequency.</p> <p><a class="citation" href="#beingbayesian2020">(Joo et al., 2020)</a> propose to approximate the posterior distribution with a variational distribution $q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})$ modeled by the neural network’s output. They chose $$q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})$$ to be a Dirichlet distribution whose concentration parameters are $$\boldsymbol{\alpha} = e^{f(\boldsymbol{x}, \boldsymbol{\theta})}$$.</p> <h3 id="learning-4">Learning</h3> <p>In variational inference, the goal is to bring the variational distribution $$q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})$$ closer to the true posterior distribution $$p(\boldsymbol{\pi} \vert \boldsymbol{x}, y)$$. In a standard way, <a class="citation" href="#beingbayesian2020">(Joo et al., 2020)</a>, as well as <a class="citation" href="#vardir2019">(Chen et al., 2019)</a>, minimize the KL-divergence between the two distributions:</p> \begin{align} \mathrm{KL} \Big ( q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})~\vert \vert~p(\boldsymbol{\pi} \vert \boldsymbol{x}, y) \Big ) &amp;= \int q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x}) \log \frac{q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})}{p(\boldsymbol{\pi} \vert \boldsymbol{x}, y)} \nonumber \\ &amp;= \int q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x}) \log \frac{q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x}) p(y \vert \boldsymbol{x})}{p(y \vert \boldsymbol{\pi}, \boldsymbol{x}) p(\boldsymbol{\pi} \vert \boldsymbol{x})} \nonumber \\ &amp;= \int - q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x}) \log p(y \vert \boldsymbol{\pi}, \boldsymbol{x}) + \int q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x}) \log \frac{q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})}{p(\boldsymbol{\pi} \vert \boldsymbol{x})} + p(y \vert \boldsymbol{x}) \nonumber \\ &amp;= \mathbb{E}_{q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})} \Big [ - \log p(y \vert \boldsymbol{\pi}, \boldsymbol{x}) \Big ] + \mathrm{KL} \Big (q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})~\vert \vert~p(\boldsymbol{\pi} \vert \boldsymbol{x}) \Big ) + p(y \vert \boldsymbol{x}) \label{eq:elbo} \end{align} <p>From Eq.(\ref{eq:elbo}), we can observe that minimizing the KL-divergence here is equivalent to maximizing the evidence lower bound (ELBO). Loss function writes as:</p> $\begin{equation} \mathcal{L}_{\textrm{VI}}(\boldsymbol{\theta}) = \mathbb{E}_{p(\boldsymbol{x}, y)} \Big [ \mathbb{E}_{q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})} \big [ - \log p(y \vert \boldsymbol{\pi}, \boldsymbol{x}) \big ] + \mathrm{KL} \big (q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})~\vert \vert~p(\boldsymbol{\pi} \vert \boldsymbol{x}) \big ) \Big] \label{eq:variational_inference} \end{equation}$ <p>The first term can be further derived as $$\mathbb{E}_{q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})} \big [ - \log p(y \vert \boldsymbol{\pi}, \boldsymbol{x}) \big ] \propto \psi(\alpha_y) - \psi(\alpha_0)$$. Regarding the KL-divergence with the prior, authors chose to set concentrations parameters $\boldsymbol{\beta} = 1$.</p> <p>Now what’s interesting is that <strong>$$\mathcal{L}_{\textrm{VI}}(\boldsymbol{\theta})$$ actually corresponds to the reverse KL-divergence loss Eq.(\ref{eq:reverse-kl}) for in-domain samples!</strong></p> <blockquote> <p>We can show that given uniform concentration parameters, the KL-divergence between the variational distribution and the uniform prior distribution over probabilities is equivalent to compute the differential entropy of the variational distribution:</p> $\begin{equation} \mathrm{KL} \big (q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})~\vert \vert~p(\boldsymbol{\pi} \vert \boldsymbol{x}) \big ) = - \mathcal{H} \big [ q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x}) \big ] + \log \Gamma(K) \end{equation}$ </blockquote> <p>While <a class="citation" href="#beingbayesian2020">(Joo et al., 2020)</a> actually propose the same loss than <a class="citation" href="#malinin2019">(Malinin &amp; Gales, 2019)</a>, they do not act on pushing OOD samples to yield a flat distribution over the simplex. Hence, this work mostly boils down to a <strong>Bayesian smoothing</strong> on in-distribution samples, such as for label smoothing.</p> <h3 id="measuring-uncertainty-4">Measuring uncertainty</h3> <p><a class="citation" href="#beingbayesian2020">(Joo et al., 2020)</a> don’t really focus on chosing a good uncertainty measure and rely on using softmax probabilities for the evaluation of confidence calibration and predictive entropy for OOD detection.</p> <h2 id="posterior-networks-a-density-based-approach-">Posterior Networks, a Density-based Approach <a name="posterior-networks"></a></h2> <p>To ensure a high epistemic uncertainty on the out-domain samples, <a class="citation" href="#postnetworks2020">(Charpentier et al., 2020)</a> use <strong>normalizing flows</strong> to learn the posterior distribution $$p(\boldsymbol{\pi} \vert \boldsymbol{x}, y)$$ over Dirichlets parameters on a <strong>latent space</strong>. The idea is to assign high density on region with many training examples which force low density elsewhere to fulfill the integration constraint.</p> <p>They map an input $\boldsymbol{x}$ onto a low-dimensional latent vector $$\boldsymbol{z} = f(\boldsymbol{x}, \boldsymbol{\theta}) \in \mathbb{R}^H$$ thanks to an encoder neural network. Then, they learn a normalized probability density $p(\boldsymbol{z} \vert c, \boldsymbol{\phi})$ per class on the latent space with a density estimator parametrized by $$\boldsymbol{\phi}$$ such as radial flows <a class="citation" href="#rezende15">(Rezende &amp; Mohamed, 2015)</a>. Finally, they compute the pseudo-observations of class $c$ at $\boldsymbol{z}$ as:</p> $\begin{equation} \tilde{N}_c (\boldsymbol{x}) = N_c \cdot p(\boldsymbol{z} \vert c, \phi) = N \cdot p(\boldsymbol{z} \vert c, \boldsymbol{\phi}) pPy=c) \end{equation}$ <p>where $N_c$ is the number of ground-truth observations for class $c$ in training dataset $\mathcal{D}$. Given this learned pseudo-count, we can compute the posterior $$p(\boldsymbol{\pi} \vert \boldsymbol{x}, y)$$ with Eq.(\ref{eq:posterior_distribution}).</p> <blockquote> <p>Note that here $f$ is different from earlier as it denotes a feature extractor and not the entire neural network classifier.</p> </blockquote> <h3 id="analysis">Analysis</h3> <p>Let us look at the mean of the Dirichlet posterior for class probability $$\boldsymbol{\pi}_c$$:</p> $\begin{equation} \mathbb{E}_{p(\boldsymbol{\pi} \vert \boldsymbol{x}, y)} \big [\boldsymbol{\pi}_c \big ] = \frac{\beta_c + N \cdot P(y=c \vert \boldsymbol{z}, \boldsymbol{\phi}) \cdot p(\boldsymbol{z}, \boldsymbol{\phi})}{\sum_{c=1}^C \beta_c + N \cdot p(\boldsymbol{z}, \boldsymbol{\phi})} \label{eq:mean_posterior} \end{equation}$ <ul> <li>For in-distribution data, $$p(\boldsymbol{z}, \boldsymbol{\phi}) \rightarrow \infty$$, then $$\mathbb{E}_{p(\boldsymbol{\pi} \vert \boldsymbol{x}, y)} \big [\boldsymbol{\pi}_c \big ]$$ converges to the true categorical distribution $$P(y=c \vert \boldsymbol{z}, \boldsymbol{\phi})$$</li> <li>For out-of-distribution data, $$p(\boldsymbol{z}, \boldsymbol{\phi}) \rightarrow 0$$, then $$\mathbb{E}_{p(\boldsymbol{\pi} \vert \boldsymbol{x}, y)} \big [\boldsymbol{\pi}_c \big ]$$ converges to the flat prior distribution $[1/C,…,1/C]$ if we defined a uniform prior $\boldsymbol{\beta} = \boldsymbol{1}$</li> </ul> <p>Figure below provide an illustration for three different inputs: $\boldsymbol{x^{(1)}}$ a correct prediction, $\boldsymbol{x^{(2)}}$ an ambiguous in-domain sample, and $\boldsymbol{x^{(3)}}$ an out-of-distribution sample.</p> <p style="text-align: center;"><img src="/images/posterior_networks.png" alt="posterior_networks" /></p> <p>From <a class="citation" href="#postnetworks2020">(Charpentier et al., 2020)</a>:</p> <blockquote> <p>Each input $\boldsymbol{x^{(i)}}$ is mapped to their to their latent vector $\boldsymbol{z^{(i)}}$. The normalizing flow component learns flexible density functions $$P(\boldsymbol{z} \vert y=c, \boldsymbol{\phi})$$, for which we evaluate their densities at the positions of the latent vectors $\boldsymbol{z^{(i)}}$. These densities are used to parameterize a Dirichlet distribution for each data point, as seen on the right hand side. Higher densities correspond to higher confidence in the Dirichlet distributions.</p> <p>We can observe that the out-of-distribution sample $\boldsymbol{x^{(3)}}$ is mapped to a point with (almost) no density, and hence its predicted Dirichlet distribution has very high epistemic uncertainty. On the other hand, $\boldsymbol{x^{(2)}}$ is an ambiguous example that could depict either the digit 0 or 6. This is reflected in its corresponding Dirichlet distribution, which has high aleatoric uncertainty (as the sample is ambiguous), but low epistemic uncertainty (since it is from the distribution of hand-drawn digits). The unambiguous sample $\boldsymbol{x^{(1)}}$ has low overall uncertainty.</p> </blockquote> <h3 id="learning-5">Learning</h3> <p>Training loss is similar to the ELBO loss used in variational inference for Eq.(\ref{eq:variational_inference}) by optimizing jointly over the neural network parameters $\boldsymbol{\theta}$ and normalizing flow parameters $\boldsymbol{\phi}$:</p> $\begin{equation} \mathcal{L}_{\textrm{PostNet}}(\boldsymbol{\theta}, \boldsymbol{\phi}) = \mathbb{E}_{p(\boldsymbol{x}, y)} \Big [ \mathbb{E}_{q_{\boldsymbol{\theta}, \boldsymbol{\phi}}(\boldsymbol{\pi} \vert \boldsymbol{x})} \big [ - \log p(y \vert \boldsymbol{\pi}, \boldsymbol{x}) \big ] + \mathrm{KL} \big (q_{\boldsymbol{\theta}, \boldsymbol{\phi}}(\boldsymbol{\pi} \vert \boldsymbol{x})~\vert \vert~p(\boldsymbol{\pi} \vert \boldsymbol{x}) \big ) \Big] \end{equation}$ <h3 id="measuring-uncertainty-5">Measuring uncertainty</h3> <p><a class="citation" href="#postnetworks2020">(Charpentier et al., 2020)</a> evaluate many uncertainty tasks such as misclassification detection, confidence calibration, OOD detection and robustness to dataset shift. In each task, they use 2 different measures: one for aleatoric uncertainty and one for epistemic uncertainty</p> <ul> <li><em>Misclassification detection</em>: MCP as aleatoric measure; $$\max_c \alpha_c$$ as epistemic measure, which corresponds to the logits of the predictive class,</li> <li><em>Confidence calibration</em>: they simply use the Brier score with the output of the neural network,</li> <li><em>OOD detection</em>: MCP as aleatoric measure; $\alpha_0$ as epistemic measure,</li> <li><em>Robustness to dataset shifts</em>: obviously the accuracy on the shifted dataset.</li> </ul> <h2 id="summary-and-discussion-">Summary and Discussion <a name="summary"></a></h2> <p>In light of this thorough analysis of each approach, we can see that most of them actually define similarly its loss.</p> <table> <thead> <tr> <th>Method</th> <th style="text-align: left">Loss</th> <th style="text-align: center">$$\alpha$$-parametrization</th> <th style="text-align: center">OOD training data</th> </tr> </thead> <tbody> <tr> <td><strong>Prior Networks</strong></td> <td style="text-align: left">$$\mathcal{L}_{\textrm{RKL-PN}}(\boldsymbol{\theta}) = \mathbb{E}_{p(\boldsymbol{x} \vert i)} \Big [\sum_{c=1}^C p(y=c \vert \boldsymbol{x}) \cdot \textrm{KL} \big ( \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta} ) ~\vert \vert~ \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{\beta}^{(c)}_{\textrm{in}}) \big ) \Big ])$$ <br /> $$~~~\quad\quad\quad\quad\quad + \gamma \cdot \mathbb{E}_{p(\boldsymbol{x} \vert o)} \Big [\sum_{c=1}^C p(y=c \vert \boldsymbol{x}) \cdot \textrm{KL} \big ( \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta} ) ~\vert \vert~ \textrm{Dir} ( \boldsymbol{\pi} \vert \boldsymbol{\beta}^{(c)}_{\textrm{out}}) \big ) \Big ]$$</td> <td style="text-align: center">$$\alpha_c = e^{f_c(\boldsymbol{x}, \boldsymbol{\theta})}$$</td> <td style="text-align: center">Yes</td> </tr> <tr> <td><strong>Max-Gap Prior Networks</strong></td> <td style="text-align: left">$$\mathcal{L}_{\textrm{Max-PN}}(\boldsymbol{\theta}) = \mathbb{E}_{p(\boldsymbol{x}, y \vert i)} \Big [ - \log p(y \vert \boldsymbol{x}, \boldsymbol{\theta}) - \frac{\lambda_{\textrm{i}}}{C} \sum_{c=1}^C \sigma(\alpha_c) \Big ]$$ <br /> $$\quad\quad\quad\quad\quad\quad\quad + \gamma \cdot \mathbb{E}_{p(\boldsymbol{x}, y \vert o)} \Big [ \mathcal{H} \big [p(y \vert \boldsymbol{x}, \boldsymbol{\theta}) \big ] - \frac{\lambda_{\textrm{o}}}{C} \sum_{c=1}^C \sigma(\alpha_c) \Big ]$$</td> <td style="text-align: center">$$\alpha_c = e^{f_c(\boldsymbol{x}, \boldsymbol{\theta})}$$</td> <td style="text-align: center">Yes</td> </tr> <tr> <td><strong>Evidential Networks</strong></td> <td style="text-align: left">$$\mathcal{L}_{\textrm{ENN}}(\boldsymbol{\theta}) = \mathbb{E}_{p(\boldsymbol{x}, y)} \Big [\int \vert \vert \boldsymbol{y} - \boldsymbol{\pi} \vert \vert^2 \cdot p(\boldsymbol{\pi} \vert \boldsymbol{x}, \boldsymbol{\theta}) d\boldsymbol{\pi} + \lambda_t \cdot \textrm{KL} \big ( \textrm{Dir} ( \boldsymbol{\pi} \vert \tilde{\boldsymbol{\alpha}} ) ~\vert \vert~ \textrm{Dir} (\boldsymbol{\pi} \vert \mathbf{u}) \big ) \Big ]$$</td> <td style="text-align: center">$$\alpha_c = \textrm{ReLU} \big ( f_c(\boldsymbol{x}, \boldsymbol{\theta}) \big ) + 1$$</td> <td style="text-align: center">No</td> </tr> <tr> <td><strong>Generative Evidential Networks</strong></td> <td style="text-align: left">$$\mathcal{L}_{\textrm{GEN}}(\boldsymbol{\theta}) = - \sum_{c=1}^C \Big (\mathbb{E}_{p(\boldsymbol{x} \vert c, i)} \big [\log \sigma ( f_c(\boldsymbol{x}, \boldsymbol{\theta})) \big ] + \mathbb{E}_{p(\boldsymbol{x} \vert o)} \big [\log \big ( 1 - \sigma ( f_c(\boldsymbol{x}, \boldsymbol{\theta})) \big ) \big ] \Big )$$ <br /> $$\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad+ \lambda \cdot \mathbb{E}_{p(\boldsymbol{x},y \vert i)} \Big [ \textrm{KL} \big ( \textrm{Dir} (\boldsymbol{\pi}_{-y} \vert \boldsymbol{\alpha}_{-y} )~\vert \vert~ \textrm{Dir} (\boldsymbol{\pi}_{-y} \vert \mathbf{u}) \big ) \Big ]$$</td> <td style="text-align: center">$$\boldsymbol{\alpha} = e^{f(\boldsymbol{x}, \boldsymbol{\theta})} + 1$$</td> <td style="text-align: center">No</td> </tr> <tr> <td><strong>Variational Inference</strong></td> <td style="text-align: left">$$\mathcal{L}_{\textrm{VI}}(\boldsymbol{\theta}) = \mathbb{E}_{p(\boldsymbol{x}, y)} \Big [ \mathbb{E}_{q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})} \big [ - \log p(y \vert \boldsymbol{\pi}, \boldsymbol{x}) \big ] + \mathrm{KL} \big (q_{\boldsymbol{\theta}}(\boldsymbol{\pi} \vert \boldsymbol{x})~\vert \vert~p(\boldsymbol{\pi} \vert \boldsymbol{x}) \big ) \Big]$$</td> <td style="text-align: center">$$\alpha_c = e^{f_c(\boldsymbol{x}, \boldsymbol{\theta})}$$</td> <td style="text-align: center">No</td> </tr> <tr> <td><strong>Posterior Networks</strong></td> <td style="text-align: left">$$\mathcal{L}_{\textrm{PostNet}}(\boldsymbol{\theta}, \boldsymbol{\phi}) = \mathbb{E}_{p(\boldsymbol{x}, y)} \Big [ \mathbb{E}_{q_{\boldsymbol{\theta}, \boldsymbol{\phi}}(\boldsymbol{\pi} \vert \boldsymbol{x})} \big [ - \log p(y \vert \boldsymbol{\pi}, \boldsymbol{x}) \big ] + \mathrm{KL} \big (q_{\boldsymbol{\theta}, \boldsymbol{\phi}}(\boldsymbol{\pi} \vert \boldsymbol{x})~\vert \vert~p(\boldsymbol{\pi} \vert \boldsymbol{x}) \big ) \Big]$$</td> <td style="text-align: center">$$\alpha_c = \textrm{ReLU} \big ( f_c(\boldsymbol{x}, \boldsymbol{\theta}) \big ) + 1$$</td> <td style="text-align: center">No</td> </tr> </tbody> </table> <p>As already pointed out in <a href="#variational-inference">Section 7</a>, the ELBO loss $$\mathcal{L}_{\textrm{VI}}$$ and the in-domain loss term in $$\mathcal{L}_{\textrm{RKL-PN}}$$ are actually similar. Posterior networks also use a KL-divergence minimization loss $$\mathcal{L}_{\textrm{PostNet}}$$ with the specifity of optimizing normalizing flow parameters as well.</p> <h3 id="rightarrow-how-to-induce-such-desired-behavior-when-training-dirichlet-networks">$$\Rightarrow$$ How to induce such desired behavior when training Dirichlet networks?</h3> <p>Going back to our first motivational question, we can distinguish 2 types of approach:</p> <ol> <li><strong>force explicitely low concentration parameters for out-of-distribution samples</strong>, either by specifying a uniform target <a class="citation" href="#malinin2019">(Malinin &amp; Gales, 2019)</a> or by adding an explicit regularization on logits in the training loss <a class="citation" href="#maximize-representation-gap2020">(Nandy et al., 2020)</a>;</li> <li><strong>incorporate a density modeling to account for out-of-distribution samples during training</strong>, either implicitely such as in noise contrastive estimation <a class="citation" href="#sensoy2020">(Sensoy et al., 2020)</a> or explicitely with a density estimator based on normalizing flows <a class="citation" href="#postnetworks2020">(Charpentier et al., 2020)</a>.</li> </ol> <blockquote> <p>Neither <a class="citation" href="#sensoy2018">(Sensoy et al., 2018)</a>,<a class="citation" href="#beingbayesian2020">(Joo et al., 2020)</a> or <a class="citation" href="#beingbayesian2020">(Joo et al., 2020)</a> account for the existence of out-domain samples in their framework. It is unlikely their method will accurately predict a high epistemic uncertainty for this kind of inputs.</p> </blockquote> <h3 id="rightarrow-what-measure-should-we-use-for-each-type-of-uncertainty">$$\Rightarrow$$ What measure should we use for each type of uncertainty?</h3> <p>Except the decomposition proposed by <a class="citation" href="#malinin2018">(Malinin &amp; Gales, 2018)</a> of uncertainty into total, aleatoric and epistemic obtainable via mutual information, there have not been many works on defining a proper uncertainty measure. Many approaches rely simply on MCP or the predictive entropy as their training loss often induce out-of-distribution inputs to have either small logits or high entropy.</p> <p>However, in light of the previous decomposition, it theorically does not seems appropriate to use MCP/predictive entropy to measure OOD detection or misclassification detection. Indeed, these metrics corresponds to measuring the total uncertainty while evaluated tasks involve either aleatoric uncertainty or epistemic uncertainty.</p> <p>Nevertheless, the correct results reported in papers may be explained as follow. We expect misclassifications to have high aleatoric uncertainty. As correct predictions and misclassifications are both in-domain samples, we may also expect them to have negligible epistemic uncertainty. That implies that a measure of total uncertainty will mostly be influenced by aleatoric uncertainty. Hence, as we do not include OOD samples when evaluating misclassification detection, a measure of total uncertainty will be able to perform correctly. Regarding OOD detection, the number of in-domain misclassifications is often very low due to high predicitive performances. Hence, confusing misclassifications with OOD samples will have a nearly negligible impact on average scores.</p> <h2 id="references">References</h2> <ol class="bibliography"><li><span id="murphy2013machine">K. P. Murphy. (2013). <i>Machine learning : A Probabilistic Perspective</i>. MIT Press.</span></li> <li><span id="deepensembles2017">B. Lakshminarayanan, A. Pritzel, &amp; C. Blundell. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In <i>Advances in Neural Information Processing Systems</i>.</span></li> <li><span id="mcdropout2016">Y. Gal, &amp; Z. Ghahramani. (2016). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In <i>Proceedings of the International Conference on Machine Learning</i>.</span></li> <li><span id="malinin2018">A. Malinin, &amp; M. Gales. (2018). Predictive Uncertainty Estimation via Prior Networks. In <i>Advances in Neural Information Processing Systems</i>.</span></li> <li><span id="sensoy2018">M. Sensoy, L. Kaplan, &amp; M. Kandemir. (2018). Evidential Deep Learning to Quantify Classification Uncertainty. In <i>Advances in Neural Information Processing Systems</i>.</span></li> <li><span id="malinin2019">A. Malinin, &amp; M. Gales. (2019). Reverse KL-Divergence Training of Prior Networks: Improved Uncertainty and Adversarial Robustness. In <i>Advances in Neural Information Processing Systems</i>.</span></li> <li><span id="vardir2019">W. Chen, Y. Shen, W. Wang, &amp; H. Jin. (2019). A Variational Dirichlet Framework for Out-of-Distribution Detection. <i>ArXiv-Preprint</i>.</span></li> <li><span id="sensoy2020">M. Sensoy, L. Kaplan, F. Cerutti, &amp; M. Saleki. (2020). Uncertainty-Aware Deep Classifiers using Generative Models. In <i>Proceedings of the AAAI Conference on Artificial Intelligence</i>.</span></li> <li><span id="beingbayesian2020">T. Joo, U. Chung, &amp; M.-G. Seo. (2020). Being Bayesian about Categorical Probability. In <i>Proceedings of the International Conference on Machine Learning</i>.</span></li> <li><span id="maximize-representation-gap2020">J. Nandy, W. Hsu, &amp; M.-L. Lee. (2020). Towards Maximizing the Representation Gap between In-Domain and Out-of-Distribution Examples. In <i>Advances in Neural Information Processing Systems</i>.</span></li> <li><span id="postnetworks2020">B. Charpentier, D. Zügner, &amp; S. Günnemann. (2020). Posterior Network: Uncertainty Estimation without OOD Samples via Density-Based Pseudo-Counts. In <i>Advances in Neural Information Processing Systems</i>.</span></li> <li><span id="depeweg2018decomposition">S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, &amp; S. Udluft. (2018). Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning. In <i>Proceedings of the International Conference on Machine Learning</i>.</span></li> <li><span id="hendrycks2019oe">D. Hendrycks, M. Mazeika, &amp; T. Dietterich. (2019). Deep Anomaly Detection with Outlier Exposure. In <i>Proceedings of the International Conference on Learning Representations</i>.</span></li> <li><span id="josan2016sublogic">A. Josang. (2016). <i>Subjective Logic: A Formalism for Reasoning Under Uncertainty</i>. Springer.</span></li> <li><span id="dempster2008">A. P. Dempster. (2008). <i>A Generalization of Bayesian Inference</i>. Springer.</span></li> <li><span id="implicit2017">B. Lakshminarayanan, &amp; S. Mohamed. (2017). Learning in Implicit Generative Models. <i>ArXiv-Preprint</i>.</span></li> <li><span id="gutmann2012noise">M. U. Gutmann, &amp; A. Hyvärinen. (2012). Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. In <i>Journal of Machine Learning Research</i>.</span></li> <li><span id="rezende15">D. Rezende, &amp; S. Mohamed. (2015). Variational Inference with Normalizing Flows. In <i>Proceedings of the International Conference on Machine Learning</i>.</span></li></ol>Charles Corbièrecharles.corbiere[at]valeo.comIn the past few years, a variety of Dirichlet-based methods for classification have emerged in the machine learning community. Before going into a detailed review of these approaches, we introduce the Dirichlet-multinomial model, a classic problem in machine learning which will help to better understand the intution behind the method.Decomposition of KL_Pred2020-10-26T00:00:00-07:002020-10-26T00:00:00-07:00https://chcorbi.github.io/posts/2020/10/kl-decomposition<h2 id="notations">Notations</h2> <p>Let us consider a training dataset $\mathcal{D}$ consisting of $N$ <em>i.i.d.</em> samples,</p> $\begin{equation*} \mathcal{D}= \{ (\boldsymbol{x}_i, y^*_i) \}_{i=1}^N \in (\mathcal{X} \times \mathcal{Y})^N \end{equation*}$ <p>where $\mathcal{X}$ represents the input space and $\mathcal{Y}={1,\ldots,K}$ is a set of labels. <br /> Samples drawn from $\mathcal{D}$ follow an unknown conditional probability distribution $p(\mathbf{y} \vert \mathbf{x})$ where $\mathbf{x}$ and $\mathbf{y}$ are random variables over input space and label space respectively.</p> <p>Let $f^{\boldsymbol{\theta}}: \mathcal{X} \rightarrow \mathcal{X’}$ be a neural network (NN) parametrized by $\boldsymbol{\theta}$ where $\mathcal{X’} = \mathbb{R}^K$ is the logit space. We consider categorical probabilities over labels as random variable $\mathbf{z}$.</p> <p>Following <a class="citation" href="#malinin2018">(Malinin &amp; Gales, 2018)</a>, a NN explicitly parametrizes a distribution $p_{\theta}(\mathbf{z} \vert \mathbf{x})$ over categorical probabilities on a simplex. For its conjugate properties with categorical distributions, we chose to model $p_{\theta}(\mathbf{z} \vert \mathbf{x})$ as a Dirichlet distribution whose concentration parameters $\boldsymbol{\alpha}(\boldsymbol{x}, \boldsymbol{\theta}) = \exp (f^{\boldsymbol{\theta}}(\boldsymbol{x}))$ are given by the output of the NN: \begin{equation} p_{\theta}(\mathbf{z} \vert \mathbf{x} = \boldsymbol{x}) = \mathrm{Dir} \big ( \mathbf{z} \vert \boldsymbol{\alpha}(\boldsymbol{x}, \boldsymbol{\theta}) \big ) = \frac{\Gamma(\alpha_0 (\boldsymbol{x}, \boldsymbol{\theta}))}{\prod_c \Gamma(\alpha_c(\boldsymbol{x}, \boldsymbol{\theta}))} \prod_{c=1}^K z_j^{\alpha_c(\boldsymbol{x}, \boldsymbol{\theta}) - 1} \end{equation} where $\Gamma$ is the Gamma function, $\forall c \in \mathcal{Y}, \alpha_j &gt; 0$ , $\alpha_0 = \sum_c \alpha_c$ and $\sum_c z_c$ = 1 such that $\mathbf{z}$ lives in the $(K-1)$-dimensional unit simplex $\triangle^{K-1}$.</p> <h2 id="textrmkl_textrmpred-criterion">$\textrm{KL}_{\textrm{Pred}}$ criterion</h2> <p>We propose an uncertainty criterion, denoted $\textrm{KL}_{\textrm{Pred}}$, which measures the KL-divergence between NN’s output and a sharp Dirichlet distribution with concentration parameters $\boldsymbol{\gamma}_{\hat{y}}$ focused on the <em>predictive</em> class $\hat{y}$:</p> $\begin{equation} \textrm{KL}_{\textrm{Pred}}(\boldsymbol{x}) = \textrm{KL} \Big ( \textrm{Dir} \big (\mathbf{z} \vert \boldsymbol{\alpha}(\boldsymbol{x}, \boldsymbol{\hat{\theta}}) \big ) ~\vert \vert~ \textrm{Dir} \big ( \mathbf{z} \vert \boldsymbol{\gamma}^{\hat{y}} \big ) \Big ) \end{equation}$ <p>To ensure an accurate estimation of concentration parameters $\boldsymbol{\gamma}^{\hat{y}}$, we compute the empirical exponential logits mean of the predicted class $\hat{y}$ on training set $\mathcal{D}$:</p> $\begin{equation*} \boldsymbol{\gamma}^{\hat{y}} = \frac{1}{N^{\hat{y}}} \sum_{i: y_i=\hat{y}}^N \boldsymbol{\alpha}(\boldsymbol{x_i}, \boldsymbol{\hat{\theta}}), \quad \quad \textrm{with}~~ \boldsymbol{\alpha}(\boldsymbol{x_i}, \boldsymbol{\hat{\theta}}) = \exp (f^{\boldsymbol{\hat{\theta}}}(\boldsymbol{x}_i)) \end{equation*}$ <p>where $N^{\hat{y}}$ is the number of training samples with label $\hat{y}$.</p> <p><img src="/images/klpred_behavior.png" alt="simplex_behavior" /></p> <p>The lower $\textrm{KL}_{\textrm{Pred}}$ is, the more certain we are in the prediction. Previous figure illustrates the fact that correct predictions will have Dirichlet distributions similar to the computed mean distribution for the predicted class, and thus associated with a low uncertainty measure. Misclassified predictions are expected to present different concentration parameters than the average computed on training set resulting in a higher $\textrm{KL}_{\textrm{Pred}}$ measure. In comparison, <em>differential entropy</em> is not adequate when it comes to detect misclassifications <a class="citation" href="#malinin2018">(Malinin &amp; Gales, 2018)</a> as it corresponds to measuring the KL-divergence of the model’s output and the maximum-entropy distribution, which is the uniform distribution on a simplex.</p> <h2 id="decomposition-into-aleatoric-and-epistemic-uncertainty">Decomposition into aleatoric and epistemic uncertainty</h2> <p>We note that $\textrm{KL}_{\textrm{Pred}}$ corresponds to the definition of reverse KL-divergence loss <a class="citation" href="#malinin2019">(Malinin &amp; Gales, 2019)</a>. It can be decomposed into the reverse cross-entropy and the negative differential entropy:</p> $\begin{equation} \textrm{KL}_{\textrm{Pred}}(\boldsymbol{x}) = \underbrace{\mathbb{E}_{p \big ( \mathbf{z} \vert \boldsymbol{\alpha}(\boldsymbol{x}, \boldsymbol{\hat{\theta}}) \big )} \Big [- \log \textrm{Dir} \big ( \mathbf{z} \vert \boldsymbol{\gamma}_{\hat{y}} \big ) \Big ]}_\text{Reverse Cross-Entropy} - \underbrace{\mathcal{H} \Big [ \textrm{Dir} \big (\mathbf{z} \vert \boldsymbol{\alpha}) \big ) \Big ]}_\text{Differential Entropy} \end{equation}$ <p>(Note we ommit the dependence in $\boldsymbol{x}$ and $\boldsymbol{\hat{\theta}}$ in $\boldsymbol{\alpha}(\boldsymbol{x},\boldsymbol{\hat{\theta})}$ for clarity.)</p> <p>First, the <em>differential entropy</em> can be written as the negative KL-divergence between NN’s output and the uniform Dirichlet distribution $\mathcal{U}(1)$:</p> $\begin{equation} \mathcal{H} \Big [ \textrm{Dir} \big (\mathbf{z} \vert \boldsymbol{\alpha}) \big ) \Big ] = - \textrm{KL} \Big ( \textrm{Dir} \big (\mathbf{z} \vert \boldsymbol{\alpha}) \big ) ~\vert \vert~ \textrm{Dir} \big ( \mathbf{z} \vert \mathcal{U}(1) ) \Big ) \end{equation}$ <p>As stated in <a class="citation" href="#malinin2019">(Malinin &amp; Gales, 2019; Nandy et al., 2020)</a>, the differential entropy measures the <strong>epistemic uncertainty</strong>.</p> <p>When considering the reverse cross-entropy (RCE) term:</p> \begin{align*} \textrm{RCE} &amp;= \mathbb{E}_{p \big ( \mathbf{z} \vert \boldsymbol{\alpha} \big )} \Big [- \log \textrm{Dir} \big ( \mathbf{z} \vert \boldsymbol{\gamma}^{\hat{y}} \big ) \Big ] \\ &amp;= - \log \Gamma(\boldsymbol{\gamma}^{\hat{y}}_0) + \sum_{c=1}^K \log \Gamma(\boldsymbol{\gamma}^{\hat{y}}_c) + \sum_{c=1}^K (\boldsymbol{\gamma}^{\hat{y}}_c - 1) \mathbb{E}_{p \big ( \mathbf{z} \vert \boldsymbol{\alpha} \big )} \big [\log(\boldsymbol{z}_c)] \\ &amp;= \big ( \log \Gamma(K) - \log \Gamma(\boldsymbol{\gamma}^{\hat{y}}_0) + \sum_{c=1}^K \log \Gamma(\boldsymbol{\gamma}^{\hat{y}}_c) \big ) + \sum_{c=1}^K (\boldsymbol{\gamma}^{\hat{y}}_c - 1)(\psi(\boldsymbol{\alpha}_0) - \psi(\boldsymbol{\alpha}_c)) \end{align*} <p>The first term depends only on the fixed target distribution $\boldsymbol{\gamma}^{\hat{y}}$ while the second term also consider logits values through $\boldsymbol{\alpha}$.</p> <h3 id="case-1-textrmkl_textrmpred">Case 1: $\textrm{KL}_{\textrm{Pred}}$</h3> <p>We compute only the empirical exponential logit mean of the predicted class $\hat{y}$:</p> $\begin{equation*} \boldsymbol{\gamma}^{(1)} = [\boldsymbol{\gamma}_1^{(1)},...,\boldsymbol{\gamma}_K^{(1)}] \quad \quad \textrm{with}~~ \boldsymbol{\gamma}_c^{(1)}= \begin{cases} \frac{1}{N^{\hat{y}}} \sum_{i: y_i=\hat{y}}^N \boldsymbol{\alpha}_c, &amp;\text{if}\ c=\hat{y} \\ 1, &amp;\ \text{else.} \end{cases} \end{equation*}$ <p>Hence, we can simplify RCE as:</p> $\begin{equation} \textrm{RCE}^{(1)} = \log \Gamma(K) - \log \Gamma(\boldsymbol{\gamma}_{\hat{y}}^{(1)}+K-1) + \log \Gamma(\boldsymbol{\gamma}_{\hat{y}}^{(1)}) + (\boldsymbol{\gamma}_{\hat{y}}^{(1)} - 1)(\psi(\boldsymbol{\alpha}_0) - \psi(\boldsymbol{\alpha}_{\hat{y}})) \end{equation}$ <p>We obtain the following figures for the decomposition and $\textrm{KL}_{\textrm{Pred}}$ criterion:</p> <p><img src="/images/visu_toy_klpred.png" alt="visu_toy_klpred" /></p> <h3 id="case-2-textrmkl_textrmpredfull">Case 2: $\textrm{KL}_{\textrm{PredFull}}$</h3> <p>We compute the full empirical vector as defined in the introductive part:</p> $\begin{equation*} \boldsymbol{\gamma}^{(2)} = \frac{1}{N^{\hat{y}}} \sum_{i: y_i=\hat{y}}^N \boldsymbol{\alpha} \end{equation*}$ <p>(Note that $\boldsymbol{\gamma}_{\hat{y}}^{(2)} = \boldsymbol{\gamma}_{\hat{y}}^{(1)}$.)</p> <p>In this case, there is no simplification as done previously. However, we can further decompose:</p> $\begin{equation} \textrm{RCE}^{(2)} = \textrm{RCE}^{(1)} + \Big (- \log \Gamma \big ( \sum_{c \neq \hat{y}} (\boldsymbol{\gamma}_c^{(2)}-1 \big ) + \sum_{c \neq \hat{y}} \big ( \log \Gamma(\boldsymbol{\gamma}_c^{(2)}) + (\boldsymbol{\gamma}_c^{(2)} - 1)(\psi(\boldsymbol{\alpha}_0) - \psi(\boldsymbol{\alpha}_c) \big ) \Big ) \end{equation}$ <p><img src="/images/visu_toy_klpredfull.png" alt="visu_toy_klpredfull" /></p> <h2 id="references">References</h2> <ol class="bibliography"><li><span id="malinin2018">A. Malinin, &amp; M. Gales. (2018). Predictive Uncertainty Estimation via Prior Networks. In <i>Advances in Neural Information Processing Systems</i>.</span></li> <li><span id="malinin2019">A. Malinin, &amp; M. Gales. (2019). Reverse KL-Divergence Training of Prior Networks: Improved Uncertainty and Adversarial Robustness. In <i>Advances in Neural Information Processing Systems</i>.</span></li> <li><span id="maximize-representation-gap2020">J. Nandy, W. Hsu, &amp; M.-L. Lee. (2020). Towards Maximizing the Representation Gap between In-Domain and Out-of-Distribution Examples. In <i>Advances in Neural Information Processing Systems</i>.</span></li></ol>Charles Corbièrecharles.corbiere[at]valeo.comNotations