10: QuickType's quirky collocates

At work, I'm the grammar and vocabulary specialist. This means that I think a lot about the selection and position of words in the more cromulent uses of English. It also means that I'm fascinated by computer systems that try to do similarly, often with amusing results. Perhaps you have a twitter friend with an alter ego bot that mashes up their previous tweets in weird and wonderful ways. Or if you type with the default keyboard on an iOS device, you've certainly seen the QuickType bar, which since iOS 8 has provided suggestions for the current word you're typing — or the next word you might type, if you've just hit space.

The technology varies slightly, but all of these language guessers are built on the basic premise of a Markov chain, an algorithm that takes where it is and guesses what comes next. The simplest version works with bigrams, or pairs of words. English speakers have a sense for common collocations, or words that go together, and QuickType tries to approximate that. For example, typing "physical" gives two close collocates, "therapy" and "activity", plus "and" (a default option). As little as I use QuickType, since my thumbs are pretty adroit on the iPhone keyboard, I keep it on to watch these word pairs fly by. Sometimes they're disconcerting — like when "murder" is suggested for a split second for an unrelated word that begins with "m".

And sometimes they reveal all too much. As far as I've found, "teen" is the most loaded word in the QuickType lexicon, with suggestions "wolf", "mom", and "pregnancy". Yikes. But let's roll with it — in fact, when the feature was brand new, a lot of nonsense poetry generated by repeatedly choosing the first QuickType option was passed around the internet. These tend to get weird fast; the algorithm can only "look back" two words (maybe three? I'm not sure), and hitting common function words like "and" or "the" tends to reset things. It's hard to get a chain of 5 words or more all clearly conditioned on the previous.

So back to our "teen". "Wolf" comes first, and generates "pack". Next a function word, "of", which could send us astray. But "pack of" has a strong collocate: "cigarettes". It all makes sense in microcosm. But step back, and there's your next band name: Teen Wolf Pack of Cigarettes.