67: The scale of a Siri audio API

So many new technologies and features were unveiled at WWDC that it seems absurd to complain about what was lacking. But Apple deliberately left one list of features short: the "intents" available to the SiriKit API. A major missing member of that list is the ability for Siri to control audio in third-party applications.

I think it's hasty to say "This is clearly a top potential use case! There's no reason it shouldn't have been present at launch! Why hailing a car over this?" I sincerely doubt that it comes down to spite, that Apple is competing with Spotify but not Uber. (And they may even be; recall that they recently made a significant investment in a Chinese ride-sharing company.) No, it comes down to the fact that an audio intent would be far more complex than any of the seven ones being implemented.

The introduction to SiriKit made it clear that applications have to provide "vocabulary" — terms that will be relevant to the command being issued.

For the ride-booking intent, how broad could that vocabulary be? An app only has to provide a few terms, like the applicable fare classes. Some audio apps would also have small vocabularies, like a list of subscribed podcasts. That would be sufficient for instructing Siri to play an episode from that finite list.

But what if I wanted to subscribe to a new podcast by voice? I think this may be where SiriKit, in its current form, hits a wall. Siri has access to a basic dictionary, but will need help from apps for non-standard words. For example, "Picomac" is not in its dictionary, and wouldn't be recognized. An app like Overcast could only offer a vocabulary lesson if it knew the word. It can't make a network query to find out about it, since Siri hasn't recognized much of anything yet that would allow even fuzzy matching. To be comprehensive, the app would have to take on the insane burden of keeping a local copy of the iTunes podcast directory. Otherwise, Siri's vocabulary skills are caught in a chicken-and-egg scenario.

Apple manages to handle Siri requests against a massive vocabulary with Apple Music, but that service has a privileged position. The logic for resolving speech against artists and songs in Apple Music can be baked directly into Apple's servers that handle Siri requests. I imagine that Spotify access via Alexa is in a similar situation, with Spotify providing Amazon a special feed of metadata for its entire catalog. Apple can't accept a firehose from every single developer. And given the competition with several would-be Siri audio apps, they don't want to broker individual deals either. The goal is probably a Siri audio intent open to all; there's just a serious scale problem that needs to be addressed first.