The Followup: Back to Work e43

blog note: this is the first installment in what i plan to be a regular feature, that i'm tentatively calling The Followup. the theme is simple: i listen to podcasts, and as i'm listening i tend to have lots of thoughts. a lot of times i wish i could get right into the conversation. i don't have the luxury (or desire) to listen to live recordings of podcasts and provide feedback that way, and half of the benefit of the format is the ability to time-shift. hence, i have to write things up after the fact, or let them pass forgotten. even though i time-shift, i'll try to keep these fresh…at least to the extent that i try to not let my listening backlog get too long. this post is based on episode 43 of the fantastic Back to Work podcast by Merlin Mann and Dan Benjamin, which was recorded two weeks ago.

on with the show.

discussing questions

at right around the 15:00 mark of the episode, Merlin said this and my ears perked up:

i want to talk about the potential role of questions in mostly helping you figure out what the next question should be. and that's not to say that a question cannot have an answer, but i'm interested in the idea of questions as part of an iterative process.

i got excited because even though Merlin meant this in a very real-world, productivity way, this sounds a lot like some key concepts in the areas of linguistics that i've been doing research in lately. in particular i've been working with and adding to a framework that formalizes discourse, founded by the Roberts / Tonhauser / Simons / Beaver research group. in plainer terms, i deal with a logic or way of computing how the pieces of a conversation have to line up in order to make sense or be useful.

stack 'em up

one important piece of the theory is that every conversation has a Question Under Discussion, or in some versions, a stack of them. you can think of this stack either in the plain physical sense, like a stack of cards, or in the computer science sense. either way, there's always one question that's on top. the goal of whoever is participating in the conversation is to answer that question. that might be an easy job, or it may be a very difficult one.

kinds of answers

Merlin actually has a really good intuitive grasp of what sorts of questions have simple answers and which don't. semanticists typically say that the meaning of a question is the set of all its possible answers. one example he gave was "do you want lemonade with dinner?" – for this yes/no question, there are just two answers: {(yes) i want lemonade with dinner, (no) i don't want lemonade with dinner}. if you're having trouble picking one of those, it means you're just indecisive, not that there's anything faulty with the question.

on the other hand, Wh-questions (so called because they are introduced by words that, in English, mostly start with the letters wh) have lots and lots of answers. "where are my keys?" is represented by the set of every statement of the form "my keys are [in x place]”. when we discuss these answers, even in formal logic, we tend to make some sort of reasonable or sensible partition. {my keys are on the desk, …on the nightstand, …under a pile of papers, …at work, etc.} are the options we present, not {my keys are on Olympus Mons, my keys are at the bottom of the ocean, etc.} nor {my keys are 1 inch from the north edge of my desk, my keys are 1.1 inches from the north edge of my desk, …1.2 inches…, …1.3 inches…, etc.}, even though these are all possibilities.

once we have a sensible set of choices, the goal of the conversation is to find the right choice or choices. this could take a single step: if i saw the keys on the desk, i can say so, and the problem is resolved. if i'm looking at the desk and don't see them, i can say they're not there and, by doing so, refine the question. these are called complete and partial answers respectively. in discourse theory, whatever you say next has to give a partial or complete answer to whatever question is on top of the stack.  if you fail to do that, your response is considered irrelevant, and whoever you're talking to will probably look at you funny.

questioning questions

one of the interesting things about relevance (and one of the major new ideas in a paper i presented at WECOL in November and am drafting for publication now) is that you don't have to assert things — that is, make statements — to be relevant; you can also ask questions or give commands. here's where the linguistics meshes with Merlin's ideas: it's possible to iterate questions to ask better questions, but there are rules on how to do it.  the new question has to be relevant to the previous question. since questions are always defined as a set containing multiple statements, they can never give a complete answer, but they can go a long way towards giving a partial answer.

think of the two questions as a Venn diagram. each circle represents all the answers to a particular question, and the overlap are the statements that could answer both questions. this set of statements is the new, refined question. it takes a little bit of logical thought to get a good overlap, but that's just the process that Merlin is after. so there's a lot in common between sound, rational, productive work processes and the way every single conversation works. i'm guessing a lot of people haven't thought of it in those terms, but doing so might help, both in how we get things done and how we talk.

building a better model

a couple weeks ago, news of some exciting new computational neuroscience research out of UC Berkeley was circulating around the internet — or, at least, credulous reactions to breathless articles cribbed from a press release were.  ("scientists can almost read your mind!", "watch your own dreams on YouTube!", "UC Berkeley invents mind-reading machine!", &c.  no, i am not linking these.  they don't deserve it.)

the research

first, i have to give credit where credit is due: the website of the lab that ran the experiment has an extremely good explanation of how it was performed, and exactly what the technology can and cannot do; it's even perfectly understandable by a lay audience, and is longer than the paper itself [university access required].  you really should go read that entire page to gain an appreciation of the project, but here's the tl;dr.

  1. three grad students strapped themselves into an fMRI machine and watched lots of 1-second clips from YouTube videos.
  2. they made a big database of the fMRI output matched to the clips.
  3. they fed new YouTube clips into a ranking algorithm that made several guesses as to which clip in the database was most similar to the new one.

that's it.  that's the really breakthrough discovery: novel video can be assigned a similarity metric to previous video, not by comparing actual visual similarity, but only by comparing a secondary measure that relies purely on the neurophysical function of the human brain.  it's a fantastic proof of concept.  but then to drive the point home, they did the following:

  1. take the best guesses based on the ranking algorithm, look up the videos, overlay them and output a composite video.

this is the thing that freaked everybody out, because taken on its own — forget what data it was based on — the images are fairly spooky.  good Halloween fodder.  take a look if you haven't seen it already.

i have several directions that i want to take in analyzing this.  i'll start with the spooky result and proceed to the bigger question: should we bother making these models in the first place?

the opaqueness of transparencies

first, consider the reconstruction method used in this experiment: the algorithm makes several similarity guesses based on on fMRI data, and then (with or without weighting, it's unclear), stacks them on top of each other, does some pixel math, and outputs new video.  we would have no idea what those underlying components of the reconstruction were, except that the Gallant Lab kindly provided another video that shows the steps of reconstruction.  let's look closely at two of the clips: Inspector Clouseau and the elephants.

in the second clip that contains Steve Martin portraying Inspector Clouseau, we can see just how much visual similarity falls flat compared to what people actually see.  the algorithm, when using Subject 3's training data, matches this clip to a ones depicting Mythbusters' Adam Savage a full 50% of the time.  why?  because he's a dude in a dark shirt standing over there.

Steve Martin ≠ Adam Savage

Steve Martin ≠ Adam Savage

and this is actually a case of a particularly good match.  the output is completely dependent on the training data, and the sample fortunately contains at least a handful of videos with people in dark clothing standing in the left-hand side of the frame.  how about elephants?  does the training data include any elephants?

any elephants here?

any elephants here?

apparently not.  here the best guesses run from fish to airplanes to Tom and Jerry.  the only thing they have in common is an area of contrast in a similar part of the frame.  (i think it's interesting that contrast seems to be the key criterion here; notice how the majority of the matches are light on dark, the opposite of the input.)  this makes the confusion in the Inspector Clouseau case look downright understandable.

clearly the model cannot actually show you what people see, or what exists in "the mind's eye", no matter what pageview-baiting headline writers try to sell you.  people do not look at Steve Martin and mis-see Adam Savage, regardless of whether they can identify them by name.  nobody with any sort of visual acuity mistakes elephants for cartoon mice or vehicles.  this is just not how people's visual systems work.

the visual system is remarkably complex, and understanding it would be a great breakthrough, but does this model get us any closer?  to me, it seems like a flawed approach.  say you wanted to study the technique of master painters, who, given the right time and tools, can paint photorealistic scenes.  you bring them in for a study, show them a photo and say "we want you to recreate this image using your artistic skills".  then, instead of letting them bring their paints, brushes, and canvas, you point them to a file cabinet in the corner of the room that is filled with other photos, printed on plastic transparency sheets.  "stack 'em up, and see how close you can get."  you can repeat this as many times as you like, and you will learn nothing about painting.

the stack hack for language

as a theoretical linguist, i see the fallacy that you can learn things about complex systems this way all the time.  even worse, i sometimes see the claim that as long as you can make decent-looking transparency collages, it's not interesting or relevant to ask about the master artist's methods.  i'm not trying to put down computational models for language as a whole here, nor am i saying that they are completely worthless; that would be taking the converse argument too far.  it's just incredibly frustrating that theory and practice are so at odds with each other.

one of the first things taught when introducing the concept of syntax in introductory linguistics is that there is underlying structure to phrases and sentences, and strings of words will never suffice.  stringing together words is the equivalent of making transparency collages instead of trying to replicate paintings.   yet so many computational systems that try to deal with constructing sentences (in translation or other applications), do it via ngram models: taking groups of words that co-occur, evaluating their likelihood, and concatenating them.  the parallel with the spooky movie reconstructions should be obvious.

of course, just like the neuroscientists can't peer into the exact workings of the brain, neither can linguists.  i.e., none of us could set up a painter experiment where we set up multiple cameras and record them bringing in their tools, setting up their workspace, and thereby create a log of each brushstroke in order to build a perfect reconstruction of the steps involved in crafting a final product.  but we can do the equivalent of sending the painter into the room, saying "take your time, make a good painting, we're not watching", and then submitting the result to careful scrutiny.  ah, they always establish base layers of paint in this order.  ah, for this effect, the brush strokes always go in this direction.  from that, a theory of painting, if not an exact manual, could be reconstructed.  then if we wanted to create a computer program to mimic the process, it would have the conclusions drawn from these observations in mind.  it would deal in layers and brushstrokes, not hacking up an approximation with photos.  theoretical linguists attempt to do the equivalent of this by looking at lots and lots of sentences, instead of lots and lots of paintings.

walking the walk; not talking the talk

as i pointed out in my Ignite Ithaca talk on writing, humans excel over their primate cousins most markedly in two areas: the ability to talk and the ability to walk on two feet.  the disparity in the prevailing philosophy on how to best model these two skills couldn't be greater.  how?  people love walking robots.  Honda's ASIMO was hailed as a massive achievement, not just for being adorable, but for being the first robot able to walk on two feet like a human.  projects stemming from it have continued to improve on its abilities, making it more humanlike at each step: able to walk up stairs, able to walk backwards, able to bear more weight.  at Cornell, a robotics team set an unofficial world record for longest unaided walk by a robot, around the school's indoor track.  (i would go to run laps and see the crew with their remote controls, plodding along next to this gangly computer on legs with a baseball cap stuck on top.)  at my alma mater, the University of Michigan, they're setting records for the fastest bipedal running robot:

there is a fascination with how we walk and run, and making machines move the same way we do.  it is not the absolute most efficient way to get around; NASA doesn't send bipedal robots to Mars, they send highly articulated six-wheelers that can navigate almost any rock put in their path without falling over or getting stuck.  if all we want to do is build machines to get from point A to point B, putting legs on them is foolish.  but that's not the point.  we want to understand what it is like to be human, and to build machines that are approximations of us.

i just hope that the neuroscientists, the computational linguists, and everyone who's trying to emulate humanity with software instead of hardware doesn't lose sight of this.  be inquisitive.  ask exactly who we are and what we do, not how to hack it to 75% or 90% or even 99% accuracy.  figuring out the exact ways in which the world works has been one of the pillars of scientific inquiry since the Renaissance.  by giving it up, science risks confining the master painter to sitting in a room, staring at an ugly collage that he's not very happy with.