Equivalence of entropies of discrete and continuous distributions
I give a simple derivation that shows the equivalence of entropies of
discrete and continuous distributions. The distributions can be
understood in terms of shape and domain size, with corresponding terms in
the entropy for each.
P(x), a piecewise-constant densityM(i), a discrete probability mass
function
Shown above are two probability distributions with the same shape.
The first, P(x) is a peculiar looking, piecewise-constant
density, defined on the real number interval x: [0,
n] for some integern. The second,
M(i) is a discrete distribution defined on the integers
i: [0, n-1]. They are the same shape in the sense that
P(i) = M(i)\, \forall i \in dom(M). We then have:
\begin{aligned}
- H(P) &\equiv \int_0^n{dx\, P(x) \log P(x)} \\
&= \sum_{i=0}^{n-1}{\int_i^{i+1}{dx\, P(x) \log P(x)}} \\
&= \sum_{i=0}^{n-1}{P(i) \log P(i)} \\
&= \sum_{i=0}^{n-1}{M(i) \log M(i)} \\
&= -H(M)
\end{aligned}
Breaking up the integral into unit-sized intervals with constant value,
then solving each one, we see that the entropies are the same. Putting this
aside now, what happens when we compare P(x) to a
distribution of the same shape, but defined on the size 1 domain [0,
1]?
\begin{aligned}
Q(t) &\equiv n P(nt) \\
-H(Q) &\equiv \int_0^1{dt\, Q(t) \log Q(t)} \\
&= \int_0^1{ dt\, n P(nt) \log [n P(nt)]} \\
&= \int_0^n{ \frac{dx}{n}\, n P(x) [\log n + \log P(x)]} \\
&= \int_0^n{ dx\, P(x) \log n} + \int_0^n{dx\, P(x) \log P(x)} \\
&= \log n - H(P) \\
\\
H(Q) &= H(P) - \log n
\end{aligned}
We see that it differs by \log n, which happens to be the
domain size of P. Also, \log n happens to
be the entropy of the uniform distribution on [0, n] or any
domain of size n for continuous distributions, or of
n categories for discrete distributions.
This is a nice result. It says that we can interpret discrete
distributions over n categories as piecewise constant
continuous distributions over a domain of size n, and the
entropy will be the same. Secondly, it shows that the entropy of a
distribution can be conceptualized into two parts: a shape component
and a domain size component.
Finally, this implies that if we have two distributions defined on the
same domain, the difference of their entropies only reflects the difference
in shape of the two distributions. Domain size does not play a role.
The shape + size idea also applies to joint entropies H(P,
Q) and conditional entropies H(P|Q). Because of
this, KL-divergence is also domain-size invariant, because it is a difference
of entropies over the same domain:
Another example is mutual information, defined on a two variable joint
distribution:
\begin{aligned}
I(A; B) &\equiv D_{KL}(P(A, B) || P(A)P(B)) \\
&= H(A) - H(A|B) \\
&= H(B) - H(B|A)
\end{aligned}
The first difference above is a difference of entropies over the domain
A, and the second is a difference over B.
Even though the domain sizes may be different, both differences are
domain-size invariant, and turn out to be equivalent, which makes mutual
information between two variables a symmetric measure.