 #jsDisabledContent { display:none; } My Account | Register | Help Flag as Inappropriate This article will be permanently flagged as inappropriate and made unaccessible to everyone. Are you certain this article is inappropriate?          Excessive Violence          Sexual Content          Political / Social Email this Article Email Address:

F-divergence

Article Id: WHEBN0015224289
Reproduction Date:

 Title: F-divergence Author: World Heritage Encyclopedia Language: English Subject: Collection: F-Divergences Publisher: World Heritage Encyclopedia Publication Date:

F-divergence

In probability theory, an ƒ-divergence is a function Df (P  || Q) that measures the difference between two probability distributions P and Q. It helps the intuition to think of the divergence as an average, weighted by the function f, of the odds ratio given by P and Q.

These divergences were introduced and studied independently by Csiszár (1963), Morimoto (1963) and Ali & Silvey (1966) and are sometimes known as Csiszár ƒ-divergences, Csiszár-Morimoto divergences or Ali-Silvey distances.

Contents

• Definition 1
• Instances of f-divergences 2
• Properties 3
• References 4

Definition

Let P and Q be two probability distributions over a space Ω such that P is absolutely continuous with respect to Q. Then, for a convex function f such that f(1) = 0, the f-divergence of Q from P is defined as

D_f(P\parallel Q) \equiv \int_{\Omega} f\left(\frac{dP}{dQ}\right)\,dQ.

If P and Q are both absolutely continuous with respect to a reference distribution μ on Ω then their probability densities p and q satisfy dP = p dμ and dQ = q dμ. In this case the f-divergence can be written as

D_f(P\parallel Q) = \int_{\Omega} f\left(\frac{p(x)}{q(x)}\right)q(x)\,d\mu(x).

The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances (Nielsen & Nock (2013)).

Instances of f-divergences

Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of f-divergence, coinciding with a particular choice of f. The following table lists many of the common divergences between probability distributions and the f function to which they correspond (cf. Liese & Vajda (2006)).

Divergence Corresponding f(t)
KL-divergence t \ln t \, , -\ln t
Hellinger distance (\sqrt{t} - 1)^2,\,2(1-\sqrt{t})
Total variation distance |t - 1| \,
\chi^2-divergence (t - 1)^2,\,t^2 -1
α-divergence \begin{cases} \frac{4}{1-\alpha^2}\big(1 - t^{(1+\alpha)/2}\big), & \text{if}\ \alpha\neq\pm1, \\ t \ln t, & \text{if}\ \alpha=1, \\ - \ln t, & \text{if}\ \alpha=-1 \end{cases}

Properties

• Non-negativity: the ƒ-divergence is always positive; it's zero if and only if the measures P and Q coincide. This follows immediately from Jensen’s inequality:
D_f(P\!\parallel\!Q) = \int \!f\bigg(\frac{dP}{dQ}\bigg)dQ \geq f\bigg( \int\frac{dP}{dQ}dQ\bigg) = f(1) = 0.
• Monotonicity: if κ is an arbitrary transition probability that transforms measures P and Q into Pκ and Qκ correspondingly, then
D_f(P\!\parallel\!Q) \geq D_f(P_\kappa\!\parallel\!Q_\kappa).
The equality here holds if and only if the transition is induced from a sufficient statistic with respect to {P, Q}.
• Joint Convexity: for any 0 ≤ λ ≤ 1
D_f\Big(\lambda P_1 + (1-\lambda)P_2 \parallel \lambda Q_1 + (1-\lambda)Q_2\Big) \leq \lambda D_f(P_1\!\parallel\!Q_1) + (1-\lambda)D_f(P_2\!\parallel\!Q_2).
This follows from the convexity of the mapping (p,q) \mapsto q f(p/q) on \mathbb{R}_+^2.