A while back Kaggle introduced a database containing all the comments that were posted to reddit in May 2015. (The data is 30Gb in SQLite format and you can still download it here). Kagglers were encouraged to try NLP experiments with the data. One of the more interesting responses was a script that queried and displayed comments containing the /s flag, which indicates sarcasm.
A natural follow-up question is whether we can train a model to predict which comments have /s flags given only the words occurring in them, in a supervised learning setting. Prima facie it’s a tough problem—recognizing irony depends upon context—but with this much data you can often find interesting trends without fully solving the problem. A bag of words model combined with a discriminative linear classifier typically does well on this sort of task. Knowing this, I spent a few hours over Thanksgiving weekend trying to find out whether my laptop can grasp the lowest form of wit.
(I should add here that there are swaths of literature on automatic irony detection, and if you want a serious treatment of the subject you could start with this paper or this paper. Most of them mention the sort of results posted here, and some even use reddit data. But to my knowledge none of these use the /s flag as a labelling heuristic.)
How often are redditors sarcastic?
There are about 54 million comments in the database, of which only 30,000 have the /s flag. It would be unreasonable to expect a machine to improve upon a majority baseline this strong. (Humans can’t even solve the problem with full accuracy, as evidenced by our need for the /s flag). But a machine might able to learn something if we re-frame the problem with a uniform class distribution. If so we can ask how the machine managed to do this, and learn which words redditors reserve for berating one another.
How often can a machine pick up on this?
About 72% of the time. Details and code can be found in the final section.
What can this tell us about the language of sarcasm?
More interesting than the model’s accuracy is the collection of features to which it gives the most weight. Roughly speaking, we can interpret these as the words that separate sarcastic comments from serious ones. Intent and context ultimately determine irony, but we nonetheless tend to use certain words when we’re being sarcastic.
Consistent with the previous studies, the top five words that separate sarcastic comments from serious ones are intensifiers:
- clearly e.g. “Nefarious corporate interest clearly at work /s“
- obviously e.g. “Yes, captioning a 30+ second long Seinfeld GIF while staying faithful to the script is obviously low effort /s“
- totally e.g. “No agenda though, totally neutral broadcasting /s“
- must e.g. “You must have a great software background /s“
- dare (as in “how dare”) e.g. “Yeah, how dare she fact check Romney, that’s the last thing I want someone to do to a politician. /s“
Other high-ranking features are topic-specific: what sports fans call a “scrub,” gamers call a “shitlord,” and if you use sarcasm in a political debate, be sure to end with “amirite?”
You also see meta-features, resulting from discussion of the /s flag. This sort of problem frequently arises when you use a heuristic to obtain labels (as opposed to labelling your training data by hand). Examples include “forgot,” e.g. “You forgot the /s“; “dropped,” e.g. “You dropped this /s“; and “sarcasm,” e.g. “If you want to convey sarcasm just put a /s.”
While some of the results are charming, others serve as a depressing reminder of reddit’s well-documented sexism problem. The gender politics lexicon includes multiple strong indicators of sarcasm—“patriarchy,” “sexist,” “women,” and “privilege” all show up in the top 30. Reddit’s anti-harassment policy was introduced the same month this data was collected. It would be interesting to repeat the experiment on more recent data and see if there has been any change in the number of sarcastic mentions of these words.
Code, full results, and boring technical minutiae
I used TF-IDF feature scaling over a unigram bag-of-words model, and L2-regularized logistic regression. Words appearing less than 5 times were ignored and non-alphanumeric chars were stripped. (In retrospect this is a debatable choice—some studies have found that punctuation correlates with irony—and ideally you’d want to do proper tokenization.) The heavy lifting was done with scikit-learn.