Aller au contenu principal

Directed information


Directed information


Directed information is an information theory measure that quantifies the information flow from the random string X n = ( X 1 , X 2 , , X n ) {\displaystyle X^{n}=(X_{1},X_{2},\dots ,X_{n})} to the random string Y n = ( Y 1 , Y 2 , , Y n ) {\displaystyle Y^{n}=(Y_{1},Y_{2},\dots ,Y_{n})} . The term directed information was coined by James Massey and is defined as

I ( X n Y n ) i = 1 n I ( X i ; Y i | Y i 1 ) {\displaystyle I(X^{n}\to Y^{n})\triangleq \sum _{i=1}^{n}I(X^{i};Y_{i}|Y^{i-1})}

where I ( X i ; Y i | Y i 1 ) {\displaystyle I(X^{i};Y_{i}|Y^{i-1})} is the conditional mutual information I ( X 1 , X 2 , . . . , X i ; Y i | Y 1 , Y 2 , . . . , Y i 1 ) {\displaystyle I(X_{1},X_{2},...,X_{i};Y_{i}|Y_{1},Y_{2},...,Y_{i-1})} .

Directed information has applications to problems where causality plays an important role such as the capacity of channels with feedback, capacity of discrete memoryless networks, capacity of networks with in-block memory, gambling with causal side information, compression with causal side information, real-time control communication settings, and statistical physics.

Causal conditioning

The essence of directed information is causal conditioning. The probability of y n {\displaystyle y^{n}} causally conditioned on x n {\displaystyle x^{n}} is defined as

P ( x n | | y n ) i = 1 n P ( x i | x i 1 , y i ) {\displaystyle P(x^{n}||y^{n})\triangleq \prod _{i=1}^{n}P(x_{i}|x^{i-1},y^{i})} .

This is similar to the chain rule for conventional conditioning P ( x n | y n ) = i = 1 n P ( x i | x i 1 , y n ) {\displaystyle P(x^{n}|y^{n})=\prod _{i=1}^{n}P(x_{i}|x^{i-1},y^{n})} except one conditions on "past" and "present" symbols y i {\displaystyle y^{i}} rather than all symbols y n {\displaystyle y^{n}} . To include "past" symbols only, one can introduce a delay by prepending a constant symbol:

P ( x n | | ( 0 , y n 1 ) ) i = 1 n P ( x i | x i 1 , y i 1 ) {\displaystyle P(x^{n}||(0,y^{n-1}))\triangleq \prod _{i=1}^{n}P(x_{i}|x^{i-1},y^{i-1})} .

It is common to abuse notation by writing P ( x n | | y n 1 ) {\displaystyle P(x^{n}||y^{n-1})} for this expression, although formally all strings should have the same number of symbols.

One may also condition on multiple strings: P ( x n | | y n , z n ) i = 1 n P ( x i | x i 1 , y i , z i ) {\displaystyle P(x^{n}||y^{n},z^{n})\triangleq \prod _{i=1}^{n}P(x_{i}|x^{i-1},y^{i},z^{i})} .

Causally conditioned entropy

The causally conditioned entropy is defined as:

H ( X n | | Y n ) = E [ log P ( X n | | Y n ) ] = i = 1 n H ( X i | X i 1 , Y i ) {\displaystyle H(X^{n}||Y^{n})=\mathbf {E} \left[-\log {P(X^{n}||Y^{n})}\right]=\sum _{i=1}^{n}H(X_{i}|X^{i-1},Y^{i})}

Similarly, one may causally condition on multiple strings and write H ( X n | | Y n , Z n ) = E [ log P ( X n | | Y n , Z n ) ] {\displaystyle H(X^{n}||Y^{n},Z^{n})=\mathbf {E} \left[-\log {P(X^{n}||Y^{n},Z^{n})}\right]} .

Properties

A decomposition rule for causal conditioning is

P ( x n , y n ) = P ( x n | | y n 1 ) P ( y n | | x n ) {\displaystyle P(x^{n},y^{n})=P(x^{n}||y^{n-1})P(y^{n}||x^{n})} .

This rule shows that any product of P ( x n | | y n 1 ) , P ( y n | | x n ) {\displaystyle P(x^{n}||y^{n-1}),P(y^{n}||x^{n})} gives a joint distribution P ( x n , y n ) {\displaystyle P(x^{n},y^{n})} .

The causal conditioning probability P ( y n | | x n ) = i = 1 n P ( y i | y i 1 , x i ) {\displaystyle P(y^{n}||x^{n})=\prod _{i=1}^{n}P(y_{i}|y^{i-1},x^{i})} is a probability vector, i.e.,

P ( y n | | x n ) 0 and y n P ( y n | | x n ) = 1 for all  ( x n , y n ) {\displaystyle P(y^{n}||x^{n})\geq 0\quad {\text{and}}\quad \sum _{y^{n}}P(y^{n}||x^{n})=1\quad {\text{for all }}(x^{n},y^{n})} .

Directed Information can be written in terms of causal conditioning:

I ( X N Y N ) = E [ log P ( Y N | | X N ) P ( Y N ) ] = H ( Y n ) H ( Y n | | X n ) {\displaystyle I(X^{N}\rightarrow Y^{N})=\mathbf {E} \left[\log {\frac {P(Y^{N}||X^{N})}{P(Y^{N})}}\right]=H(Y^{n})-H(Y^{n}||X^{n})} .

The relation generalizes to three strings: the directed information flowing from X n {\displaystyle X^{n}} to Y n {\displaystyle Y^{n}} causally conditioned on Z n {\displaystyle Z^{n}} is

I ( X n Y n | | Z n ) = H ( Y n | | Z n ) H ( Y n | | X n , Z n ) {\displaystyle I(X^{n}\to Y^{n}||Z^{n})=H(Y^{n}||Z^{n})-H(Y^{n}||X^{n},Z^{n})} .

Conservation law of information

This law, established by James Massey and his son Peter Massey, gives intuition by relating directed information and mutual information. The law states that for any X n , Y n {\displaystyle X^{n},Y^{n}} , the following equality holds:

I ( X n ; Y n ) = I ( X n Y n ) + I ( Y n 1 X n ) . {\displaystyle I(X^{n};Y^{n})=I(X^{n}\to Y^{n})+I(Y^{n-1}\to X^{n}).}

Two alternative forms of this law are

I ( X n ; Y n ) = I ( X n Y n ) + I ( Y n X n ) I ( X n Y n ) {\displaystyle I(X^{n};Y^{n})=I(X^{n}\to Y^{n})+I(Y^{n}\to X^{n})-I(X^{n}\leftrightarrow Y^{n})}
I ( X n ; Y n ) = I ( X n 1 Y n ) + I ( Y n 1 X n ) + I ( X n Y n ) {\displaystyle I(X^{n};Y^{n})=I(X^{n-1}\to Y^{n})+I(Y^{n-1}\to X^{n})+I(X^{n}\leftrightarrow Y^{n})}

where I ( X n Y n ) = i = 1 n I ( X i ; Y i | X i 1 , Y i 1 ) {\displaystyle I(X^{n}\leftrightarrow Y^{n})=\sum _{i=1}^{n}I(X_{i};Y_{i}|X^{i-1},Y^{i-1})} .

Estimation and optimization

Estimating and optimizing the directed information is challenging because it has n {\displaystyle n} terms where n {\displaystyle n} may be large. In many cases, one is interested in optimizing the limiting average, that is, when n {\displaystyle n} grows to infinity termed as a multi-letter expression.

Estimation

Estimating directed information from samples is a hard problem since the directed information expression does not depend on samples but on the joint distribution { P ( x i , y i | x i 1 , y i 1 ) i = 1 n } {\displaystyle \{P(x_{i},y_{i}|x^{i-1},y^{i-1})_{i=1}^{n}\}} which may be unknown. There are several algorithms based on context tree weighting and empirical parametric distributions and using long short-term memory.

Optimization

Maximizing directed information is a fundamental problem in information theory. For example, given the channel distributions { P ( y i | x i , y i 1 } i = 1 n ) {\displaystyle \{P(y_{i}|x^{i},y^{i-1}\}_{i=1}^{n})} , the objective might be to optimize I ( X n Y n ) {\displaystyle I(X^{n}\to Y^{n})} over the channel input distributions { P ( x i | x i 1 , y i 1 } i = 1 n ) {\displaystyle \{P(x_{i}|x^{i-1},y^{i-1}\}_{i=1}^{n})} .

There are algorithms to optimize the directed information based on the Blahut-Arimoto, Markov decision process, Recurrent neural network, Reinforcement learning. and Graphical methods (the Q-graphs). For the Blahut-Arimoto algorithm, the main idea is to start with the last mutual information of the directed information expression and go backward. For the Markov decision process, the main ideas is to transform the optimization into an infinite horizon average reward Markov decision process. For a Recurrent neural network, the main idea is to model the input distribution using a Recurrent neural network and optimize the parameters using Gradient descent. For Reinforcement learning, the main idea is to solve the Markov decision process formulation of the capacity using Reinforcement learning tools, which lets one deal with large or even continuous alphabets.

Marko's theory of bidirectional communication

Massey's directed information was motivated by Marko's early work (1966) on developing a theory of bidirectional communication. Marko's definition of directed transinformation differs slightly from Massey's in that, at time n {\displaystyle n} , one conditions on past symbols X n 1 , Y n 1 {\displaystyle X^{n-1},Y^{n-1}} only and one takes limits:

T 12 = lim n E [ log P ( X n | X n 1 ) P ( X n | X n 1 , Y n 1 ) ] and T 21 = lim n E [ log P ( Y n | Y n 1 ) P ( Y n | Y n 1 , X n 1 ) ] . {\displaystyle T_{12}=\lim _{n\to \infty }\mathbf {E} \left[-\log {\frac {P(X_{n}|X^{n-1})}{P(X_{n}|X^{n-1},Y^{n-1})}}\right]\quad {\text{and}}\quad T_{21}=\lim _{n\to \infty }\mathbf {E} \left[-\log {\frac {P(Y_{n}|Y^{n-1})}{P(Y_{n}|Y^{n-1},X^{n-1})}}\right].}

Marko defined several other quantities, including:

  • Total information: H 1 = lim n E [ log P ( X n | X n 1 ) ] {\displaystyle H_{1}=\lim _{n\to \infty }\mathbf {E} \left[-\log P(X_{n}|X^{n-1})\right]} and H 2 = lim n E [ log P ( Y n | Y n 1 ) ] {\displaystyle H_{2}=\lim _{n\to \infty }\mathbf {E} \left[-\log P(Y_{n}|Y^{n-1})\right]}
  • Free information: F 1 = lim n E [ log P ( X n | X n 1 , Y n 1 ) ] {\displaystyle F_{1}=\lim _{n\to \infty }\mathbf {E} \left[-\log P(X_{n}|X^{n-1},Y^{n-1})\right]} and F 2 = lim n E [ log P ( Y n | Y n 1 , X n 1 ) ] {\displaystyle F_{2}=\lim _{n\to \infty }\mathbf {E} \left[-\log P(Y_{n}|Y^{n-1},X^{n-1})\right]}
  • Coincidence: K = lim n E [ log P ( X n | X n 1 ) P ( Y n | Y n 1 ) P ( X n , Y n | X n 1 , Y n 1 ) ] . {\displaystyle K=\lim _{n\to \infty }\mathbf {E} \left[-\log {\frac {P(X_{n}|X^{n-1})P(Y_{n}|Y^{n-1})}{P(X_{n},Y_{n}|X^{n-1},Y^{n-1})}}\right].}

The total information is usually called an entropy rate. Marko showed the following relations for the problems he was interested in:

  • K = T 12 + T 21 {\displaystyle K=T_{12}+T_{21}}
  • H 1 = T 12 + F 1 {\displaystyle H_{1}=T_{12}+F_{1}} and H 2 = T 21 + F 2 {\displaystyle H_{2}=T_{21}+F_{2}}

He also defined quantities he called residual entropies:

  • R 1 = H 1 K = F 1 T 21 {\displaystyle R_{1}=H_{1}-K=F_{1}-T_{21}}
  • R 2 = H 2 K = F 2 T 12 {\displaystyle R_{2}=H_{2}-K=F_{2}-T_{12}}

and developed the conservation law F 1 + F 2 = R 1 + R 2 + K = H 1 + H 2 K {\displaystyle F_{1}+F_{2}=R_{1}+R_{2}+K=H_{1}+H_{2}-K} and several bounds.

Relation to transfer entropy

Directed information is related to transfer entropy, which is a truncated version of Marko's directed transinformation T 21 {\displaystyle T_{21}} .

The transfer entropy at time i {\displaystyle i} and with memory d {\displaystyle d} is

T X Y = I ( X i 1 , , X i d ; Y i | Y i 1 , , Y i d ) . {\displaystyle T_{X\to Y}=I(X_{i-1},\dots ,X_{i-d};Y_{i}|Y_{i-1},\dots ,Y_{i-d}).}

where one does not include the present symbol X i {\displaystyle X_{i}} or the past symbols X i d 1 , Y i d 1 {\displaystyle X^{i-d-1},Y^{i-d-1}} before time i d {\displaystyle i-d} .

Transfer entropy usually assumes stationarity, i.e., T X Y {\displaystyle T_{X\to Y}} does not depend on the time i {\displaystyle i} .

Giuseppe Zanotti Luxury Sneakers

References


Text submitted to CC-BY-SA license. Source: Directed information by Wikipedia (Historical)