Translated by ChatGPT.

    After feeling self-satisfied for a while, I realized that this article is basically a translation of the English Wikipedia entry Quantities of Information. However, the corresponding Chinese Wikipedia entry doesn’t cover as much content as the English version, so it’s not entirely a waste of effort.

    Information and Probability

    Information is expressed through a proposition, which can represent multiple propositions through logical operations. This proposition answers a specific question. Before receiving definite information, many other propositions could potentially be answers to this question, making probability theory applicable. Together, all possible propositions form a random variable space.

    For example, in a multiple-choice question with four options, A, B, C, and D:

    • If it’s a single-choice question, the random variable space for the answer is A,B,C,D{A,B,C,D}.
    • If it’s multiple-choice, it’s A,B,C,D,AB,AC,AD,BC,BD,CD,ABC,ABD,ACD,BCD,ABCD{A,B,C,D,AB,AC,AD,BC,BD,CD,ABC,ABD,ACD,BCD,ABCD}, and so on.

    Quantifying Information in a Probability Distribution

    Self-Information

    Self-information is a property of a random event or a specific possible value of a random variable. It is expressed as:

    I(m)=logn(p(M=m))I(m) = -\log_n\left(p(M=m)\right)

    This is a dimensionless quantity, but the logarithm base can vary:

    • When n=2n=2, the unit is called a bit or shannon. A shannon represents the maximum information a binary digit can convey.
    • When n=en=e, the unit is nat, with loge or ln=logeln=\log_e being the natural logarithm.
    • When n=10n=10, the unit is hartley.

    The formula is significant in information theory as it represents the minimum number of single-choice questions with nnn options required to find the answer.

    Entropy

    Entropy is the overall property of a random variable’s probability distribution. The formula is straightforward, as it’s the expected value of self-information:

    S(p(M))=Ep[lognp(M)]=mMp(m)lognp(m)S(p(M))=\mathbb{E}_p[-\log_n p(M)]=-\sum_{m\in M}p(m)\log_n p(m)

    For example, binary entropy for a specific random event would be:

    Sbinary=(1p)log(1p)plogpS_{binary} = -(1-p)\log(1-p)-p\log p

    Entropy in information theory signifies the minimum number of questions required to equate to the information contained in multiple propositions.

    Comparing Information Content in Two Probability Distributions

    Kullback–Leibler (K-L) Divergence

    Also known as relative entropy or I-divergence, K-L divergence is defined for two probability distributions of the same random variable space:

    DKL(p(X)q(X))=xXp(x)logp(x)q(x)D_{KL}(p(X)|q(X))=\sum_{x\in X}p(x)\log\frac{p(x)}{q(x)}

    This measures the extra number of single-choice questions required when using the optimal questions for q(X)q(X) instead of p(X)p(X).

    Cross Entropy

    Cross entropy describes the total number of single-choice questions needed when using the questions optimal for q(X)q(X) with respect to p(X)p(X):

    CE(p(X),q(X))=xXp(x)logq(x)CE(p(X),q(X))=-\sum_{x\in X}p(x)\log q(x)

    Binary cross entropy between ppp and qqq can be expressed as:

    BCE(p,q)=plogq(1p)log(1q)BCE(p,q)=-p\log q-(1-p)\log(1-q)

    Mutual Information

    Mutual information, defined for two different random variable spaces, XXX and YYY, is:

    MI(X,Y)=x,yp(x,y)logp(x,y)p(x)p(y)MI(X,Y)=\sum_{x,y}p(x,y)\log\frac{p(x,y)}{p(x)p(y)}

    This measures the difference between the joint distribution of XX and YY and the assumption that XX and YY are independent, with MI=0MI=0 if they are independent.

    Loss Functions in PyTorch

    Some relevant loss functions in PyTorch are:

    • torch.nn.KLDivLoss
    • torch.nn.CrossEntropyLoss
    • torch.nn.BCELoss
    • torch.nn.BCEWithLogitsLoss

    Boltzmann’s Epitaph

    S=klogWS=k\log W

    Here, SS is the thermodynamic entropy, kk is Boltzmann’s constant kBk_B, and WW is the number of microstates. In statistical mechanics, when all states are equally probable, information entropy and thermodynamic entropy differ only by kBk_B.

    Fortune Telling and Low Information Content

    In 451 AD, Attila invaded Roman territory and battled a coalition led by Aetius, including Theodoric, King of the Visigoths. Before the battle, Attila consulted a fortune teller who predicted the death of a king and the collapse of a state. Attila assumed Theodoric would fall, but it turned out that while Theodoric did die, the coalition defeated the Huns, and Attila’s ambitions were thwarted.

    This story illustrates how fortune tellers employ low-information predictions that lead people to infer higher-information outcomes, playing on the difference in entropy between statements.

    See also: