Compare 2 Probability Distribution or 2 Messages

Translated by ChatGPT.

Information and Probability

After feeling self-satisfied for a while, I realized that this article is basically a translation of the English Wikipedia entry Quantities of Information. However, the corresponding Chinese Wikipedia entry doesn’t cover as much content as the English version, so it’s not entirely a waste of effort.

Information and Probability

Information is expressed through a proposition, which can represent multiple propositions through logical operations. This proposition answers a specific question. Before receiving definite information, many other propositions could potentially be answers to this question, making probability theory applicable. Together, all possible propositions form a random variable space.

For example, in a multiple-choice question with four options, A, B, C, and D:

If it’s a single-choice question, the random variable space for the answer is {A,B,C,D}.
If it’s multiple-choice, it’s {A,B,C,D,AB,AC,AD,BC,BD,CD,ABC,ABD,ACD,BCD,ABCD}, and so on.

Quantifying Information in a Probability Distribution

Self-Information

Self-information is a property of a random event or a specific possible value of a random variable. It is expressed as:

$I(m) = -\log_n\left(p(M=m)\right)$

This is a dimensionless quantity, but the logarithm base can vary:

When $n=2$ , the unit is called a bit or shannon. A shannon represents the maximum information a binary digit can convey.
When $n=e$ , the unit is nat, with loge or $ln=\log_e$ being the natural logarithm.
When $n=10$ , the unit is hartley.

The formula is significant in information theory as it represents the minimum number of single-choice questions with nnn options required to find the answer.

Entropy

Entropy is the overall property of a random variable’s probability distribution. The formula is straightforward, as it’s the expected value of self-information:

$S(p(M))=\mathbb{E}_p[-\log_n p(M)]=-\sum_{m\in M}p(m)\log_n p(m)$

For example, binary entropy for a specific random event would be:

$S_{binary} = -(1-p)\log(1-p)-p\log p$

Entropy in information theory signifies the minimum number of questions required to equate to the information contained in multiple propositions.

Comparing Information Content in Two Probability Distributions

Kullback–Leibler (K-L) Divergence

Also known as relative entropy or I-divergence, K-L divergence is defined for two probability distributions of the same random variable space:

$D_{KL}(p(X)|q(X))=\sum_{x\in X}p(x)\log\frac{p(x)}{q(x)}$

This measures the extra number of single-choice questions required when using the optimal questions for $q(X)$ instead of $p(X)$ .

Cross Entropy

Cross entropy describes the total number of single-choice questions needed when using the questions optimal for $q(X)$ with respect to $p(X)$ :

$CE(p(X),q(X))=-\sum_{x\in X}p(x)\log q(x)$

Binary cross entropy between ppp and qqq can be expressed as:

$BCE(p,q)=-p\log q-(1-p)\log(1-q)$

Mutual Information

Mutual information, defined for two different random variable spaces, XXX and YYY, is:

$MI(X,Y)=\sum_{x,y}p(x,y)\log\frac{p(x,y)}{p(x)p(y)}$

This measures the difference between the joint distribution of $X$ and $Y$ and the assumption that $X$ and $Y$ are independent, with $MI=0$ if they are independent.

Loss Functions in PyTorch

Some relevant loss functions in PyTorch are:

torch.nn.KLDivLoss
torch.nn.CrossEntropyLoss
torch.nn.BCELoss
torch.nn.BCEWithLogitsLoss

Boltzmann’s Epitaph

$S=k\log W$

Here, $S$ is the thermodynamic entropy, $k$ is Boltzmann’s constant $k_B$ , and $W$ is the number of microstates. In statistical mechanics, when all states are equally probable, information entropy and thermodynamic entropy differ only by $k_B$ .

Fortune Telling and Low Information Content

In 451 AD, Attila invaded Roman territory and battled a coalition led by Aetius, including Theodoric, King of the Visigoths. Before the battle, Attila consulted a fortune teller who predicted the death of a king and the collapse of a state. Attila assumed Theodoric would fall, but it turned out that while Theodoric did die, the coalition defeated the Huns, and Attila’s ambitions were thwarted.

This story illustrates how fortune tellers employ low-information predictions that lead people to infer higher-information outcomes, playing on the difference in entropy between statements.