27 April 2006

Future shock

We look to advanced technologies to solve our problems, but what happens when we are presented with a solution that is cutting edge?

Talk to a researcher these days and they will tell you that the rate limiting step in research is no longer the gathering of data, but rather the interpretation of it. Gathering information is highly automated, but analyzing it remains a fairly manual process.

This problem is felt acutely in industrial research. I have some experience with two industrial research fields confronting this problem: pharmaceutical research in toxicology and financial risk management.

Both industries claim to be overwhelmed by the volumes of data that they must analyze and interpret. This has triggered interest in quantitative methods to analyze large, complicated (high dimensional) datasets.

These fields produce experimental results that are just too large to hold in your head. Methods such as Self Organising Maps, principal components analysis (PCA) and supervised machine learning algorithms are ideal for analyzing such data. They have been around for years, but fast, inexpensive computers and user friendly interfaces have made them available on everyone's desktop.

PCA can be used to create a view of complex (high dimensional) data based on a few dimensions. Supervised machine learning can fish out patterns in data that exist in high dimensional spaces; patterns that have a complexity that our mind cannot grasp.

So you might expect that industry would jump at the chance to use these methods. Well, not so fast.

Before I describe my experience of industry's reaction to such methods, it's worth looking at a bit of background on the life of a company toxicologist or financial risk manager.

Toxicologists and financial risk managers don't have it easy. Suppose that a medicine produces a serious unexpected side effect. All drugs have been very carefully tested in development to reduce the chance of this happening. The buck stops with the toxicologist that performs these tests.

In finance there is a similar problem. Fund managers invest money using pre-agreed strategies. The strategies have been assessed in terms of their profitability and their risk. Financial risk managers must answer for financial losses caused by events that were not anticipated in their risk estimates.

Within this context, it should not come as a surprise that these industries do not exactly leap at the chance to use PCA and machine learning algorithms. I would go as far as to say that there can be a disconnect between the eager mathematical physicist pitching a statistical method and the industrial practitioners that are their target audience.

I suspect that the problem goes deeper than mere suspicion of a new and relatively untested approach. Arguably, it lies with the very nature of the methods themselves.

Consider this. Most data analysis begins with a hunch about what the data will ultimately show. This hunch might be a correlation between two known variables, or perhaps a simple pattern of results across two or three experimental conditions. We can create a picture of these results in our head.

PCA and machine learning algorithms don't work this way. They produce projections from high dimensional spaces onto low dimensional spaces.

It is beyond our minds capacity to visualize the exact combination of the original variables that is captured by a principle component. The low dimensional projection is delivered to us without a name. The same goes for patterns derived by machine learning.

Without names, these results don't tell a story.

Skillful interpretation of a PCA can yield a story of sorts. But the power of these methods is that they create an unbiased view of the data, one that doesn't need to adhere to a pre-existing story.

I don't know about you but I find this distinction rather deep. And if you compare these approaches with more familiar analytical approaches - ones that involve hunches, stories and easy intuitions for the patterns of results, then its not surprising that my friends in industry tend to shy away from them.

But analytically, this unbiased view is a big, big plus. These are methods that can reveal unexpected patterns in results and they scale well with the size of the data set.

That is what excites the mathematical physicists.

Collaboration between Biopolis and RIKEN

My eyes are on Biopolis, Singapore's biosciences research initiative occupying a futuristic campus next to the National University of Singapore. Championed by Philip Yeo, Chair of Singapore's Agency for Science, Technology and Research (A*Start), Biopolis is still relatively early phase. But it already includes a crop of several shiny buildings connected by soaring above-ground walkways.

Biopolis has been making impressive connections lately, and actively working itself into the public imagination. Recent news describes a collaborative agreement with Japan's RIKEN (translated as "Institute of Physical and Chemical Research", but these days doing plenty of top biological research).

Biopolis goes into the RIKEN collaboration with an interest to expand its biomedical research focus beyond infectious disease to cancer drug development. The collaboration will focus on the exchange of ideas as well as training programs that will send Singapore's burgeoning supply of enthusiastic trainee scientists to Japan.


Biopolis


RIKEN