Ishanu Chattopadhyay : Research Philosophy

You are just a machine...

Just a machine!?

That's like saying you are just an ape.

What can we do with data? At what point, and how, does data become information, and information become knowledge? Can purely algorithmic analysis of data lead to understanding in any human sense, or are machines fundamentally limited in what they can learn? Is it possible for unsupervised machines to duplicate human intuition - can machines go beyond simple predictions and superficial correlations, and distill new scientific insight from data? In other words, can we automate the scientific method ?

These are the answers I seek.

Answers to these questions are crucial to the societal role of technology in shaping the future - is the hype about "Big Data" indeed going to pan out? Perhaps more importantly, it is increasingly obvious that our ability - or rather inability - to imbue machines with human-like reasoning will dictate fundamental limits on scientific knowledge. Almost every field of scientific inquiry has seen an explosion of data; astronomers have amassed more photometry and high resolution images in the last six months than they had done in the previous 600 year history of the field. The typical data volumes are now measured in tens of terabytes, with petascale astronomical surveys being imminent, and even larger surveys under development, e.g., the Large Synoptic Survey Telescope (LSST) with estimated data rates of ∼30 TB per night. Modern physics experiments such as the Large Hadron Collider produces ∼25 petabytes of particle-collision data per year. With the advent of single-cell methodologies, and routine high-throughput experiments, biology faces a similar data deluge. Processing, analyzing and exploring such massive data streams, and detecting and characterizing interesting and novel phenomena is highly non-trivial. Automating this process is crucial, since data rates and volumes are already overwhelming human scientists. We can no longer depend on human serendipity to make scientific breakthroughs; future progress will be throttled, and much of the data collection rendered effectively useless, unless we automate reasoning in a meaningful way.

What is exactly new here? The idea of automating computation is what "computer science" is about. Indeed automation is all around us - there is "an app" for everything. However, such automation is generally viewed as a tool, designed to carry out a calculation, produce an answer, and stop. Distinct to this narrow scope is the idea that we can "close the loop": design machines that examine the results of their actions, decide what to try next based on such results, and potentially cycle forever. Some work in this direction has already been reported: King et al. describe a robotic system for running biological experiments, evaluating their results, and deciding what experiments to carry out next. The importance of King's work lies in that it describes how a machine can meaningfully contribute to the distillation of scientific knowledge.

However, King's approach did require setting up the system with significant apriori knowledge: in particular a comprehensive logical model encoding knowledge of S. cerevisiae metabolism, a general bioinformatic database of genes and proteins involved in metabolism and so on. What can we do without such a priori knowledge? Can we reverse-engineer the data at hand: find the hidden physical laws or dynamical relationships that produce the data, particularly in the absence of human intuition, a priori system knowledge, or domain expertise? My recent paper in the Proceedings of the National Academy of Sciences illustrates a positive result in this direction: I showed that it is possible to learn reaction structures de novo from intermittent and sparse observations in a chemical system. The algorithm outlined in this paper is equally applicable to interacting population of organisms in an ecosystem, or more broadly to genetic circuitry governing cellular dynamics. Algorithms for forward simulation of a given model was known before, and so was how to tune parameters given a data-set and a specific model-family; what is new here is the idea that we can make do without such supervision or a priori knowledge: we showed that we can not only identify model parameters, but can come up with the model structure de novo.

More generally, given a data stream, we can ask if there exists algorithms that produce succinct descriptions, ie models, that are "correct" generative representations of the data. Such a task is clearly impossible without restriction: the more important question therefore is characterizing the constraints under which the hidden structure in data is indeed learnable. In a recent paper "Abductive learning of quantized stochastic processes with probabilistic finite automata" , I showed that symbolic streams of sufficient length can indeed be algorithmically compressed to yield certain class of probabilistic graphical models. The unsupervised learning algorithm ( GenESeSS ) presented in this paper is shown to infer the causal structure of quantized stochastic processes, under the constraint that such processes satisfy the conditions of ergodicity and stationarity. I believe this is a small step in the direction of the rather grand vision I painted above; in that these learning algorithms work strictly in the absence of human intervention and require no knowledge of the how and where the input data is generated. (Note: Refer to the full paper for an accurate picture of the contribution of this paper above the pre-existing state-of-the-art)

However, there is much to do. While my area of work falls squarely within "machine learning", I do not subscribe to the often-touted idea that machines may predict and correlate, but never understand. I believe that investigating the limits of unsupervised learning will lead us to not only creating intelligence indistinguishable from humans, but serve us with the caviar that the ability to recreate such intelligence brings with it an understanding of the true nature of conscious thought. Machine learning is therefore as much about machines learning about the world, as it is us learning about ourselves.

- Ishanu Chattopadhyay