Progress isn’t made by early risers. It’s made by lazy men trying to find easier ways to do something.
— Robert Heinlen
Yesterday a friend came to me with a problem. He and his colleagues regularly make use of articles posted on a website. It’s a useful site, but it has two shortcomings. The first is that articles on the website are sometimes redacted. This is less-than-ideal—just because an article gets redacted doesn’t mean that they won’t be interested in it at some point in the future. The second is that the site’s search bar is terrible.
My friend wanted to solve both these problems at once by creating a mirror of the content. Roughly what he had in mind was a cronjob that polled the site for new updates every hour, and saved them to his machine as flat files. This way, no data would ever be lost, and he could search over it however he likes. He didn’t actually use the words “cronjob” or “polling”, but this is the solution I read between the lines. This all sounded reasonable enough, so I agreed to go down the cronjob rabbit hole for him.
Before you start scraping a site you should always read two things: the robots.txt file and the terms of service. The robots.txt was fairly unrestrictive; it didn’t mention the path containing the updates he was interested in. Similarly, the TOS suggested that not only were they okay with users making copies of the data, they encouraged it. In particular, you could request that updates be emailed to you.
At this point I realized I was going to do more harm than good. I suggested my friend fire me, set up a special purpose email account, and request that the organization send html copies of all their updates. This is a much better solution than running a cronjob on a Macbook. In addition to avoiding the edge cases and workarounds that come with a cronjob, you have a straightforward way to share the data with colleagues, and you can use your email client’s filtering and search functions to organize and find documents. By firing myself, I saved my friend hours of work, as well as the inevitable headache that would result when he missed an important update because his Macbook died.
I heard a good joke at work once: “How many programmers does it take to screw in a lightbulb? / Why are you talking about a lightbulb—what you need is light.” The second person, if profoundly annoying, is probably a dynamite programmer. The possibility of an easier way shouldn’t cross your mind on occasion. It should be front and centre, all the time, with respect to basically any problem you’re working on. I’m glad I didn’t end up helping my friend, and I hope that the next time he asks I don’t even consider it.