Universal Similarity & Data Smashing
- I. Chattopadhyay and H. Lipson, "Data Smashing: Uncovering Lurking Order In Data", Roy. Soc. Interface, 2014 vol. 11 no. 101 20140826
Any time a data mining algorithm searches beyond simple correlations, a human expert must help define a notion of similarity - by specifying important distinguishing ``features'' of the data to compare, or by training learning algorithms using copious amounts of examples. The data smashing principle removes the reliance on expert-defined features or examples, and in many cases, does so faster and with better accuracy than traditional methods.
The term "data smashing" might conjure up images of erasing information or destroying hard drives. But just as smashing atoms can reveal their composition, "colliding" quantitative data streams can reveal their hidden structure.
We describe here a new principle, where quantitative data streams have corresponding anti-streams , which inspite of being non-unique, are tied to the stream's unique statistical structure. We then describe "data smashing", a process by which streams and anti-streams can be algorithmically collided to reveal differences that are difficult to detect using conventional techniques.
A toy example is given by the four binary data streams (figure 1) with roughly equal frequency of 0s and 1s. It is obvious that the fourth stream is very dissimilar; and the feature to pick is clear as well. But can we quantify this notion of similarity *without* picking any a priori features, or specifying the analysis "depth"? Use the "Example A" option to run this dataset in the web-tool below.
A slightly more non-trivial example is given in Fig. 2, which shows the recorded EEG signals: the first and second stream are from the same subject, while the third stream is from a second subject. Can you tell what is the appropriate discriminative feature? Dont have to! Use the "Example B" option to run this dataset in the web-tool below.
Try with your own data in your browser!
Choose data files with space separated entries
Specify the quantization
Compute the smashing matrix