14: Scraping the web with PDFpen

Yesterday I talked about how decreased internet access influenced how I dealt with my iOS apps while on vacation. My trip also put me in a rare situation: being completely without internet for several hours. This is of course a first-world problem — I was only disconnected because I was in an airplane over the Pacific Ocean — but it made me realize that an airplane cabin is a place I ordinarily assume will have access to the wider world, for a price anyway.

Those few hours were the perfect (and only) opportunity I had to plan my following day in Seattle, where I was making a 36-hour stopover. But I didn't have any printed travel guides with me, and I couldn't do research en route. Fortunately, I remembered a little-used feature of PDFpen Pro, which would work in a pinch. PDFpen can scrape websites and generate PDFs from them. Literally as I packed up my suitcase, I navigated to a TripAdvisor page on "things to do in Seattle" and set the machine in motion.

The process is simple: select File > New > From HTML…, provide a single URL, and tweak some parameters such as how many levels of links to follow and a maximum size for the output. After initially setting it too low, I capped my file at 250 pages. In about 15 minutes, I had the world's ugliest Seattle tour book saved to my desktop, ready for access 30,000 feet above the Pacific. Formatting and images weren't perfect, and some irrelevant pages were accessed (PDFpen is smart enough to stay within the hierarchical structure of a site, but many large, modern sites have extremely flat structure, foiling it). Nonetheless, I could see a list of top attractions and even many of the reviews for them — just enough info to plan an afternoon of touristing.

Obscure features like this seem silly at times when you don't need them. Why should a PDF app do something so esoteric? Certainly, basic PDF applications shouldn't need to scrape the web, but PDFpen is a professional app whose design is based upon having everything you need in an omnibus package. If I needed to download a separate app in that moment of packing and planning, it would have been too much of a barrier, and all that data would have stayed on the ground.