
The Physics of Information Lecture Note 3
Lecture 3
Information theory
- data compression (lossless transmission) : What is the best encoding scheme?
- Noisy channel
- How many auxiliary bits do we need to send besides the original data?
- How many messages can be reliably transmitted through a noisy channel?
Shannon entropy
Information is used to characterize how surprised you are when you learn something. Mathematically, Shannon defined a quantity called self-information. Suppose we have an event A, which is a set of outcomes of some random experiment. If P(A) is the probability that the event A will occur, then self-information associated with A is given by
i(A)=−logfP(A).
Intuitively, it makes sense as a highly probable event gives a low information.
If we have a set of independent events Ai, which are sets of outcomes of some experiment, S={A1,A2,…,An}, when the average information associated with the random experiment is given by
H=−∑iP(Ai)logfP(Ai)
which is exactly the Gibb’s entropy.
Application in data compression
Shannon showed that if the experiment is a source that puts out symbols Ai from a set A, then the entropy is a measure of the average number of binary symbols needed to code the output of the source. The best that a lossless compression scheme can do is to encode the output of a source with an average number of bits equal to the entropy of the source.
Source : X={x1,x2,…,xl} with {P(x1),P(x2),…,P(xl)}
Suppose the messages has N letters a piece.
typical message : a message consisting of NP(x1) number of x1, NP(x2) number of x2, … , NP(xl) number of xl.
Question : How many typical messages do we have?
Nt=N!(NP(x1))!(NP(x2))!…(NP(xl))!log2Nt=log2N!−l∑i=1log(NP(xi))!
Using log2N!≈Nlog2N−Nlog2e,
log2Nt=Nlog2N−Nlog2e−∑i(NpilogNpi−Npilog2e)=−N∑iP(xi)log2P(xi)=NH.
Thus,
Nt=2NH.
It means we can use NH bits in total to encode all these messages, instead of using Nlog2l.
H=−∑iP(xi)log2P(xi)⟶ Shannon entropy
Example : {1,2,3,4} with probability 14.
H=−∑i14log214=2.
{x1x1x2…xl1x1x2x1…xl−12⋮⋮x2x3x1…x12NH→NH: number of bits required to store messages, H: average number of bits required to store a letter
Example1 : Let A={a1,a2,…,a5} with P(a1)=P(a3)=0.2, P(a2)=0.4,P(a4)=P(a5)=0.1. Use this source to generate messages.
H=∑5i=1P(ai)log2P(ai)=2.122,
Huffman coding :
codeword | |
---|---|
a2 | 1 |
a1 | 01 |
a3 | 000 |
a4 | 0010 |
a5 | 0011 |
The average bits per symbol using the Huffman coding is
0.4+2×0.2+3×0.2+4×0.1+4×0.1=2.2
which is larger than 2.122 by 0.078.
Example 2 : Consider the following sequence 1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10. How can we code this sequence?
Simply, we can code each letter by four binary symbols, e.g. 10 = 1010, 2 = 0010. To code this sequence, we require 4 binary symbols per letter (symbol).
Suppose that each number in the sequence occurs independently and the probability is that each number happens is reflected accurately in the sequence. Then A=1,2,3,4,5,6,7,8,9,10 and the associated probabilities are P(1)=P(6)=P(7)=P(10)=116,P(2)=P(3)=P(4)=P(5)=P(8)+P(9)=216. The entropy is calculated as H=−∑10i=1P(i)log2P(i)=3.25.
That means that the best scheme we could find for coding this sequence could only code it at 3.25 per symbol, which is smaller than 4 per symbol.
But this sequence has structures. The difference between two neighboring numbers is 1 1 1 -1 1 1 1 -1 1 1 1 1 1 -1 1 1(the first 1 is the initial one since we also need to code this one). In this case, P(1)=1316,P(−1)=316,H=0.7, which is much smaller than the previous one.
This example indicates that it’s not possible to know the entropy of a physical source that generates a sequence unless we are quite clear about the structure of the sequence.
Noisy channels
We may regard X and Y as sets of letters. If the channel is noiseless, the map from X to Y is one-to-one, that is, the outcome of a letter y corresponds to an input of the same letter y. However, a channel usually contains noises, which modifies the outcome. How many auxiliary bits do we need to send besides the original data?
Conditional probability P(yi∣xi) : given that input xi is prepared, the probability that output yi will be produced.
Noiseless channel : P(xj∣xi)=δij
Joint probability P(xi∧yi) : the probability that the input is xi and output is yj.
P(xi∧yj)=P(xi)P(yj∣xi)=P(yj)P(xi∣yj).
2NH(X) : the number of typical distinct input messages
2NH(Y) : the number of typical distinct output messages
Joint entropy (uncertainty) : H(X∧Y)≡−∑ijP(xi∧yj)log2P(xi∧yj)
- H(X∧Y)≥H(X)
- Consider the noiseless channel, P(xi∧yj)=δijP(xi),H(X∧Y)=H(X).
Conditional entropy : H(X∣Y)≡∑jP(yj)[−∑iP(xi∣yj)log2P(xi∣yj)]
The latter part is the uncertainty for the input for a given output yi. Thus for all yj, H(X∣Y) gives us the average uncertainty for the input given the outputs.
Noiseless channel : P(xi∣yj)=δij,H(X∧Y)=H(X)=H(Y),H(X∣Y)=0.
Completely noisy channel : input and output letters are independent (do not have any correlation)
P(xi∣yj)=P(xi),H(X∣Y)=−∑iP(xi)log2P(xi)=H(X).
These two limits tells us
H(X∣Y)≤H(X).
The uncertainty of X is never increased by knowledge of Y. It will be decreased unless Y and X are independent events, in which case it is not changed.
H(X∧Y)=H(X)+H(Y∣X)=H(Y)+H(X∣Y)
Proof :
LHS=−∑ijP(xi∧yj)log2P(xi∧yj)=−∑ijP(xi∣yj)P(yj)(log2P(xi∣yj)+log2P(yj))=H(X∣Y)+H(Y)

Each sequence of pairs of X,Y values represents a map from X to Y. Thus 2NH(X∧Y)2NH(Y) represents the average number of distinct sequence of X for a given output. In other words, this number measures the number of possible input sequences that could have given rise to the observed output.
2NH(X∧Y)2NH(Y)=2NH(X∣Y)

When the channel is noiseless, H(X∣Y)=0 and 2NH(X∣Y)=1, suggesting the one-to-one correspondence.
Shannon points out that if one is trying to use a noisy channel to send a message, then the conditional entropy specifies the number of bits per letter that would need to be sent by an auxiliary noiseless channel in order to connect all the errors that have crept into the transmitted sequence, as an insult of the noise.
Mutual information : H(X:Y)=H(X)−H(X∣Y) measures how much information X and Y have in common.
If the map from X to Y is one-to-one, then H(X:Y)=H(X) telling us that all input information is shared. If the map from X to Y is totally random, that is, X and Y occurs independently, then H(X:Y)=0 since H(X∣Y)=H(X).
Shannon’s noisy coding theorem : 2NH(X:Y) tells us the number of messages that can be reliably transmitted through the noisy channel.
Example : Binary symmetric channel : X={0,1},Y={0,1}.
P(y=0∣x=0)=1−f,P(y=0∣x=1)=f,P(y=1∣x=0)=f,P(y=1∣x=1)=1−f
where f denotes the probability that a number is flipped. Let f=0.15 and P(x=0)=0.9 and P(x=1)=0.1. Determine H(X),H(Y),H(Y∣X) and H(X:Y).
Solution :
H(X)=−0.9×log20.9−0.1log20.1=0.47.
P(y=0)=P(y=0∣x=0)P(x=0)+P(y=0∣x=1)P(x=1)=0.78,P(y=1)=0.22,H(Y)=0.76.
H(Y∣X)=−∑iP(xi)∑jP(yj∣xi)logP(yj∣xi)=−0.9×(0.85×log20.85+0.15×log0.15)+0.1×(0.85×log20.85+0.15×log0.15)=0.61.
H(X:Y)=H(Y)−H(Y∣X)=0.15.
Suppose the length of a message is N. Using this source, we can only send 20.15N typical distinct messages instead of 20.47N under a noiseless channel.
H(X∣Y)=H(X)−H(X:Y)=0.32.
Comparison between Boltzmann entropy and Shannon entropy
Boltzmann | Shannon |
---|---|
SB=−kBN∑ipilogpi | H=−∑ipilogpi |
pi=e−βEi, the probability that a particle occupies an energy level. pi is determined by energy. | pi is the probability that a letter xi occurs. pi is determined by a source. |
eSB/kB measures the total number of configuration under such probability distribution. | 2NH measures the total number of distinct typical messages with length being N. |
Evidently, we can regard Boltzmann entropy as a special case of the Shannon entropy.
Capacity of a channel Q
C(Q)=maxPX{H(X:Y)}
the maximum mutual entropy in the distribution PX. The distribution PX that achieves the maximum is called the optimal input distribution.
Example : Consider the binary symmetric channel with f=0.15. Above, we consider PX={P0=0.9,P1=0.1} and found I(X:Y)=0.15 bits.
C(QBSC)=0.39.
No Comments