building a better model

a couple weeks ago, news of some exciting new computational neuroscience research out of UC Berkeley was circulating around the internet — or, at least, credulous reactions to breathless articles cribbed from a press release were.  ("scientists can almost read your mind!", "watch your own dreams on YouTube!", "UC Berkeley invents mind-reading machine!", &c.  no, i am not linking these.  they don't deserve it.)

the research

first, i have to give credit where credit is due: the website of the lab that ran the experiment has an extremely good explanation of how it was performed, and exactly what the technology can and cannot do; it's even perfectly understandable by a lay audience, and is longer than the paper itself [university access required].  you really should go read that entire page to gain an appreciation of the project, but here's the tl;dr.

  1. three grad students strapped themselves into an fMRI machine and watched lots of 1-second clips from YouTube videos.
  2. they made a big database of the fMRI output matched to the clips.
  3. they fed new YouTube clips into a ranking algorithm that made several guesses as to which clip in the database was most similar to the new one.

that's it.  that's the really breakthrough discovery: novel video can be assigned a similarity metric to previous video, not by comparing actual visual similarity, but only by comparing a secondary measure that relies purely on the neurophysical function of the human brain.  it's a fantastic proof of concept.  but then to drive the point home, they did the following:

  1. take the best guesses based on the ranking algorithm, look up the videos, overlay them and output a composite video.

this is the thing that freaked everybody out, because taken on its own — forget what data it was based on — the images are fairly spooky.  good Halloween fodder.  take a look if you haven't seen it already.

i have several directions that i want to take in analyzing this.  i'll start with the spooky result and proceed to the bigger question: should we bother making these models in the first place?

the opaqueness of transparencies

first, consider the reconstruction method used in this experiment: the algorithm makes several similarity guesses based on on fMRI data, and then (with or without weighting, it's unclear), stacks them on top of each other, does some pixel math, and outputs new video.  we would have no idea what those underlying components of the reconstruction were, except that the Gallant Lab kindly provided another video that shows the steps of reconstruction.  let's look closely at two of the clips: Inspector Clouseau and the elephants.

in the second clip that contains Steve Martin portraying Inspector Clouseau, we can see just how much visual similarity falls flat compared to what people actually see.  the algorithm, when using Subject 3's training data, matches this clip to a ones depicting Mythbusters' Adam Savage a full 50% of the time.  why?  because he's a dude in a dark shirt standing over there.

Steve Martin ≠ Adam Savage

Steve Martin ≠ Adam Savage

and this is actually a case of a particularly good match.  the output is completely dependent on the training data, and the sample fortunately contains at least a handful of videos with people in dark clothing standing in the left-hand side of the frame.  how about elephants?  does the training data include any elephants?

any elephants here?

any elephants here?

apparently not.  here the best guesses run from fish to airplanes to Tom and Jerry.  the only thing they have in common is an area of contrast in a similar part of the frame.  (i think it's interesting that contrast seems to be the key criterion here; notice how the majority of the matches are light on dark, the opposite of the input.)  this makes the confusion in the Inspector Clouseau case look downright understandable.

clearly the model cannot actually show you what people see, or what exists in "the mind's eye", no matter what pageview-baiting headline writers try to sell you.  people do not look at Steve Martin and mis-see Adam Savage, regardless of whether they can identify them by name.  nobody with any sort of visual acuity mistakes elephants for cartoon mice or vehicles.  this is just not how people's visual systems work.

the visual system is remarkably complex, and understanding it would be a great breakthrough, but does this model get us any closer?  to me, it seems like a flawed approach.  say you wanted to study the technique of master painters, who, given the right time and tools, can paint photorealistic scenes.  you bring them in for a study, show them a photo and say "we want you to recreate this image using your artistic skills".  then, instead of letting them bring their paints, brushes, and canvas, you point them to a file cabinet in the corner of the room that is filled with other photos, printed on plastic transparency sheets.  "stack 'em up, and see how close you can get."  you can repeat this as many times as you like, and you will learn nothing about painting.

the stack hack for language

as a theoretical linguist, i see the fallacy that you can learn things about complex systems this way all the time.  even worse, i sometimes see the claim that as long as you can make decent-looking transparency collages, it's not interesting or relevant to ask about the master artist's methods.  i'm not trying to put down computational models for language as a whole here, nor am i saying that they are completely worthless; that would be taking the converse argument too far.  it's just incredibly frustrating that theory and practice are so at odds with each other.

one of the first things taught when introducing the concept of syntax in introductory linguistics is that there is underlying structure to phrases and sentences, and strings of words will never suffice.  stringing together words is the equivalent of making transparency collages instead of trying to replicate paintings.   yet so many computational systems that try to deal with constructing sentences (in translation or other applications), do it via ngram models: taking groups of words that co-occur, evaluating their likelihood, and concatenating them.  the parallel with the spooky movie reconstructions should be obvious.

of course, just like the neuroscientists can't peer into the exact workings of the brain, neither can linguists.  i.e., none of us could set up a painter experiment where we set up multiple cameras and record them bringing in their tools, setting up their workspace, and thereby create a log of each brushstroke in order to build a perfect reconstruction of the steps involved in crafting a final product.  but we can do the equivalent of sending the painter into the room, saying "take your time, make a good painting, we're not watching", and then submitting the result to careful scrutiny.  ah, they always establish base layers of paint in this order.  ah, for this effect, the brush strokes always go in this direction.  from that, a theory of painting, if not an exact manual, could be reconstructed.  then if we wanted to create a computer program to mimic the process, it would have the conclusions drawn from these observations in mind.  it would deal in layers and brushstrokes, not hacking up an approximation with photos.  theoretical linguists attempt to do the equivalent of this by looking at lots and lots of sentences, instead of lots and lots of paintings.

walking the walk; not talking the talk

as i pointed out in my Ignite Ithaca talk on writing, humans excel over their primate cousins most markedly in two areas: the ability to talk and the ability to walk on two feet.  the disparity in the prevailing philosophy on how to best model these two skills couldn't be greater.  how?  people love walking robots.  Honda's ASIMO was hailed as a massive achievement, not just for being adorable, but for being the first robot able to walk on two feet like a human.  projects stemming from it have continued to improve on its abilities, making it more humanlike at each step: able to walk up stairs, able to walk backwards, able to bear more weight.  at Cornell, a robotics team set an unofficial world record for longest unaided walk by a robot, around the school's indoor track.  (i would go to run laps and see the crew with their remote controls, plodding along next to this gangly computer on legs with a baseball cap stuck on top.)  at my alma mater, the University of Michigan, they're setting records for the fastest bipedal running robot:

there is a fascination with how we walk and run, and making machines move the same way we do.  it is not the absolute most efficient way to get around; NASA doesn't send bipedal robots to Mars, they send highly articulated six-wheelers that can navigate almost any rock put in their path without falling over or getting stuck.  if all we want to do is build machines to get from point A to point B, putting legs on them is foolish.  but that's not the point.  we want to understand what it is like to be human, and to build machines that are approximations of us.

i just hope that the neuroscientists, the computational linguists, and everyone who's trying to emulate humanity with software instead of hardware doesn't lose sight of this.  be inquisitive.  ask exactly who we are and what we do, not how to hack it to 75% or 90% or even 99% accuracy.  figuring out the exact ways in which the world works has been one of the pillars of scientific inquiry since the Renaissance.  by giving it up, science risks confining the master painter to sitting in a room, staring at an ugly collage that he's not very happy with.