76: Indexicals are hard

Major improvements are coming to Siri in the 2016 editions of Apple's operating systems. This is the steady march of progress; in its five-year existence, Siri has consistently gained new abilities. From the user's perspective, those abilities fall into two categories: speaking new languages and using new APIs. (Of course, the addition of public APIs is the big step forward in iOS 10.)

Yet there is valid criticism that Siri hasn't made five years' worth of progress. In particular, it's easy to find areas where Siri is poor at the fundamentals, like taking the spoken word and constructing meaning from it. Take, for example, the seemingly simple Siri command "tell my wife that I love her." You can probably already guess how this will go wrong. Siri unthinkingly drafts the message with the text "I love her".

Fellow linguist Michael Erlewine summarized this sad state of affairs as "indexicals are hard". There are entire graduate-level linguistics seminars devoted to this sort of problem — I've taken some. The fact is, Siri should do better. I'll prove it to you. As an English speaker, you know how to use indexicals, but in just a couple minutes you'll understand them too.

Indexicals are words that don't mean much without an additional frame of reference. I and you could be anyone, if you don't know who's speaking or being spoken to. In a simple conversation with Siri, the context is completely predictable: when Siri hears "I", it means the iPhone user.

The misguided text message example adds complexity; there's additional context earlier in the sentence that identifies who "she" is. As English speakers we know that it matches up with "my wife", and that "my" is the possessive of "I" — which is predictable! — so the whole phrase should be interpreted as "the iPhone user's wife". That's information that Siri knows exactly what to do with, as long as your contact cards are organized properly. It should also know that a pronoun referring to the recipient of a new message should always become "you."

Yes, it's tricky, with a couple of frame shifts, but this is a solved natural language processing problem. The explanations of the new public Siri APIs indicate to me that Siri isn't even attempting to construct meanings based on the grammars of languages it speaks. Instead it heuristically goes from audio to text to "intent". That's absurd for an organization with the massive NLP talent Apple has. Implementing a basic solution to this problem is the stuff of undergraduate NLP term projects. In other words, if you're Apple, literally a summer intern could do it, at least passably. Perhaps passable isn't good enough, and they're holding out for perfection, but that's a losing game; natural language is uniquely human, and humans are never perfect. In the meantime, the best way to register our intents isn't through a speech interface, but through the direct interface between our brains and our thumbs.