Equivalence of entropies of discrete and continuous distributions

I give a simple derivation that shows the equivalence of entropies of discrete and continuous distributions. The distributions can be understood in terms of shape and domain size, with corresponding terms in the entropy for each.

P(x), a piecewise-constant density

M(i), a discrete probability mass function

Shown above are two probability distributions with the same shape. The first, P(x) is a peculiar looking, piecewise-constant density, defined on the real number interval x: [0, n] for some integer n. The second, M(i) is a discrete distribution defined on the integers i: [0, n-1]. They are the same shape in the sense that P(i) = M(i)\, \forall i \in dom(M). We then have: \begin{aligned} - H(P) &\equiv \int_0^n{dx\, P(x) \log P(x)} \\ &= \sum_{i=0}^{n-1}{\int_i^{i+1}{dx\, P(x) \log P(x)}} \\ &= \sum_{i=0}^{n-1}{P(i) \log P(i)} \\ &= \sum_{i=0}^{n-1}{M(i) \log M(i)} \\ &= -H(M) \end{aligned}

Breaking up the integral into unit-sized intervals with constant value, then solving each one, we see that the entropies are the same. Putting this aside now, what happens when we compare P(x) to a distribution of the same shape, but defined on the size 1 domain [0, 1]? \begin{aligned} Q(t) &\equiv n P(nt) \\ -H(Q) &\equiv \int_0^1{dt\, Q(t) \log Q(t)} \\ &= \int_0^1{ dt\, n P(nt) \log [n P(nt)]} \\ &= \int_0^n{ \frac{dx}{n}\, n P(x) [\log n + \log P(x)]} \\ &= \int_0^n{ dx\, P(x) \log n} + \int_0^n{dx\, P(x) \log P(x)} \\ &= \log n - H(P) \\ \\ H(Q) &= H(P) - \log n \end{aligned}

We see that it differs by \log n, which happens to be the domain size of P. Also, \log n happens to be the entropy of the uniform distribution on [0, n] or any domain of size n for continuous distributions, or of n categories for discrete distributions.

This is a nice result. It says that we can interpret discrete distributions over n categories as piecewise constant continuous distributions over a domain of size n, and the entropy will be the same. Secondly, it shows that the entropy of a distribution can be conceptualized into two parts: a shape component and a domain size component.

Finally, this implies that if we have two distributions defined on the same domain, the difference of their entropies only reflects the difference in shape of the two distributions. Domain size does not play a role.

The shape + size idea also applies to joint entropies H(P, Q) and conditional entropies H(P|Q). Because of this, KL-divergence is also domain-size invariant, because it is a difference of entropies over the same domain:

\begin{aligned} D_{KL}(P||Q) &\equiv \int{dx\, P(x) \log ({P(x) \over Q(x)})} \\ &= H(P,Q) - H(P) \end{aligned}

Another example is mutual information, defined on a two variable joint distribution:

\begin{aligned} I(A; B) &\equiv D_{KL}(P(A, B) || P(A)P(B)) \\ &= H(A) - H(A|B) \\ &= H(B) - H(B|A) \end{aligned}

The first difference above is a difference of entropies over the domain A, and the second is a difference over B. Even though the domain sizes may be different, both differences are domain-size invariant, and turn out to be equivalent, which makes mutual information between two variables a symmetric measure.