Computing Source Entropy Rates Using Probabilistic Automata
- I. Chattopadhyay and H. Lipson , "Computing Entropy Rate Of Symbol Sources & A Distribution-free Limit Theorem", http://arxiv.org/abs/1401.0711
Try with your own data in your browser!
-
Choose data files with space separated entries
-
Specify the quantization
-
Compute entropy rate
(Also try the example- Load Example )
Entropy rate of sequential data-streams naturally quantifies the complexity of the generative process. Thus entropy rate fluctuations could be used as a tool to recognize dynamical perturbations in signal sources, and could potentially be carried out without explicit background noise characterization. However, state of the art algorithms to estimate the entropy rate have markedly slow convergence; making such entropic approaches non-viable in practice.
The arxiv post linked above is a fundamentally new approach to estimate entropy rates, which is demonstrated to converge significantly faster in terms of input data lengths, and is shown to be effective in diverse applications ranging from the estimation of the entropy rate of English texts to the estimation of complexity of chaotic dynamical systems.
Additionally, the convergence rate of entropy estimates do not follow from any standard limit theorem, and reported algorithms fail to provide any confidence bounds on the computed values. Exploiting a connection to the theory of probabilistic automata, we establish a convergence rate of as a function of the input length, which then yields explicit uncertainty estimates, as well as required data lengths to satisfy pre-specified confidence bounds.
This approach is based on modeling discrete and finite-valued stationary and ergodic sources as probabilistic automata. These automata are distinct from that of Paz, and each model in this case is in fact an encoding of a measure defined on the space of strictly infinite strings over a finite alphabet. While the formalisms are completely different, some aspects of this approach has subtle parallels to that of Rissanen’s context algorithm; his search for contexts which yield similar probabilities of generating future symbols is analogous to our search for a synchronizing string in the input stream - a finite sequence of symbols that, once executed on a probabilistic automaton, leads to a fixed state irrespective of the initial conditions. Of course we do not know anything about the hidden model a priori; but nevertheless we establish that such a string, at least in a well-defined approximate sense, always exists and is identifiable efficiently. Finally, given such an approximate synchronizing string, we can use results from non-parametric statistics to bound the probability of error as a function of the input length.