building a term-document matrix in spark

Leave a comment

A year-old stack overflow question that I’m able to answer? This is like spotting Bigfoot. I’m going to assume access to nothing more than a spark context. Let’s start by parallelizing some familiar sentences.

corpus = sc.parallelize([
    "We're talking about practice, man.",
    "We're talking about practice.",
    "We ain't talking about the game."])

The first step is to tokenize our documents and cache the resulting RDD. In practice, by which I mean the game, we would use a real tokenizer, but for illustrative purposes I’m going to keep it simple and split on spaces.

tokenized_corpus = corpus \
    .map(lambda raw_text: raw_text.split()) \

Next we need to map words to distinct indices and pull the resulting dictionary into the driver.

local_vocab_map = tokenized_corpus \
    .flatMap(lambda token: token) \
    .distinct() \
    .zipWithIndex() \

Then we can broadcast the vocabulary-index map (and its size) back out to the workers:

vocab_map = sc.broadcast(local_vocab_map)
vocab_size = sc.broadcast(len(local_vocab_map))

Finally, we need to:
1) Convert each document into a dictionary of word counts
2) Replace the keys in the dictionary with their indices
3) Convert the dictionary into a vector

With the help of two imports we can do this in three calls to map:

from pyspark.mllib.linalg import SparseVector
from collections import Counter

term_document_matix = tokenized_corpus \
    .map(Counter) \
    .map(lambda counts: {vocab_map.value[token]: float(counts[token]) for token in counts}) \
    .map(lambda index_counts: SparseVector(vocab_size.value, index_counts))


for vec in term_document_matix.collect():
    print vec


The practice rant turned fourteen today.

visualizing piero scaruffi’s music database

comments 5

Scaruffi’s music database

Since the mid-1980s, Piero Scaruffi has written essays on countless topics, and published them all for free on the internet – which he helped develop. You can learn more about him (and pretty much anything else that might interest you) on his legendary website.

I was introduced to the site when a friend began referring to certain records as “Scaruffi 7s,” in reference to their ratings in Scaruffi’s music database. One of the oldest components of the site, the database contains entries on thousands of acts. My initial reaction was along the lines of “7 doesn’t sound that great.” But scales are relative and Scaruffi is a judicious critic. Getting a 7 puts your record in the 84th percentile of albums that have received ratings (thousands don’t even get a number).

The Distribution of Scaruffi's Favor.png

In music writing, where homogeneity of style and opinion are typical, Scaruffi is a complete outlier. He reps for over-looked bands and fulminates against the orthodoxy. If top-40 is your thing you’ll likely be disappointed, but if you want to find out which Georgia Anne Muldrow album to start with, or about the interactive CD-ROM album Todd Rundgren made in 1993, well, Scaruffi is your guy. Having read the music database for some time, I decided I wanted to see what it looks like.


If you squint you can see band names

I wrote a web scraper to download Scaruffi’s entries on over 5000 bands. Using the networkx Python package, I converted the entries into a graph. Nodes represent bands and edges represent hyper-links between them. Finally, I plugged the results into the Gephi visualization tool.

Loosely speaking, we can interpret a node’s in-degree (the number of other nodes pointing to it) as the corresponding band’s level of influence. Why? Suppose Scaruffi ends up mentioning band x in his discussion of 20 other bands. The odds are that many of these mentions will discuss a creative debt to band x.


“I definitely don’t think of myself as being an influence.” – Kate Bush

Admittedly this interpretation is coarse. Not all references include hyper-links, and not all links represent an “influenced by” relation; other possibilities include shared members and side-projects. But empirically, it kind of pans out: the nodes of influential artists tend to have high in-degree.


Alternatively we could run PageRank, the original algorithm behind Google. PageRank measures a node’s influence not just in terms of its immediate neighbours, but throughout the entire network. So, for example, a large number of links from ’90s nu-metal bands probably won’t change your PageRank, but a single link from Bob Dylan helps a lot. The top five PageRank results were Dylan, the Beatles, ELO (see: a single link from Dylan helps a lot), Zappa, and Television.


Something about selling 30,000 copies

The two metrics have different flavours. The in-degree counts are more closely tied to Scaruffi’s interests and opinions (and if you don’t believe me read the first sentence of this).  Additionally, they are robust against outliers—missing and erroneous links won’t have a drastic effect. By contrast, PageRank is less tied to Scaruffi’s opinions. It may deem a band influential even when he doesn’t particularly like them. On the other hand, it can be extremely sensitive to outliers.

The in-degree metric is ultimately more interesting. The database is singular not because it is big enough to facilitate network analysis—both Wikipedia and AllMusic are considerably larger—but because it was crafted by a single person, and reflects his views.

(This post was updated on 10-02-2016)




the lowest form of wit: modelling sarcasm on reddit

Leave a comment

A while back Kaggle introduced a database containing all the comments that were posted to reddit in May 2015. (The data is 30Gb in SQLite format and you can still download it here). Kagglers were encouraged to try NLP experiments with the data. One of the more interesting responses was a script that queried and displayed comments containing the /s flag, which indicates sarcasm.

A natural follow-up question is whether we can train a model to predict which comments have /s flags given only the words occurring in them, in a supervised learning setting. Prima facie it’s a tough problem—recognizing irony depends upon context—but with this much data you can often find interesting trends without fully solving the problem. A bag of words model combined with a discriminative linear classifier typically does well on this sort of task. Knowing this, I spent a few hours over Thanksgiving weekend trying to find out whether my laptop can grasp the lowest form of wit.

(I should add here that there are swaths of literature on automatic irony detection, and if you want a serious treatment of the subject you could start with this paper or this paper. Most of them mention the sort of results posted here, and some even use reddit data. But to my knowledge none of these use the /s flag as a labelling heuristic.)

How often are redditors sarcastic?

There are about 54 million comments in the database, of which only 30,000 have the /s flag. It would be unreasonable to expect a machine to improve upon a majority baseline this strong. (Humans can’t even solve the problem with full accuracy, as evidenced by our need for the /s flag). But a machine might able to learn something if we re-frame the problem with a uniform class distribution. If so we can ask how the machine managed to do this, and learn which words redditors reserve for berating one another.

How often can a machine pick up on this?

About 72% of the time. Details and code can be found in the final section.

What can this tell us about the language of sarcasm?

More interesting than the model’s accuracy is the collection of features to which it gives the most weight. Roughly speaking, we can interpret these as the words that separate sarcastic comments from serious ones. Intent and context ultimately determine irony, but we nonetheless tend to use certain words when we’re being sarcastic.

Consistent with the previous studies, the top five words that separate sarcastic comments from serious ones are intensifiers:

  1. clearly e.g. “Nefarious corporate interest clearly at work /s
  2. obviously e.g. “Yes, captioning a 30+ second long Seinfeld GIF while staying faithful to the script is obviously low effort /s
  3. totally e.g. “No agenda though, totally neutral broadcasting /s
  4. must e.g. “You must have a great software background /s
  5. dare (as in “how dare”) e.g. “Yeah, how dare she fact check Romney, that’s the last thing I want someone to do to a politician. /s

Other high-ranking features are topic-specific: what sports fans call a “scrub,” gamers call a “shitlord,” and if you use sarcasm in a political debate, be sure to end with “amirite?

You also see meta-features, resulting from discussion of the /s flag. This sort of problem frequently arises when you use a heuristic to obtain labels (as opposed to labelling your training data by hand). Examples include “forgot,” e.g. “You forgot the /s“; “dropped,” e.g. “You dropped this /s“; and “sarcasm,” e.g. “If you want to convey sarcasm just put a /s.

While some of the results are charming, others serve as a depressing reminder of reddit’s well-documented sexism problem. The gender politics lexicon includes multiple strong indicators of sarcasmpatriarchy,” “sexist,” “women,” and “privilege” all show up in the top 30. Reddit’s anti-harassment policy was introduced the same month this data was collected. It would be interesting to repeat the experiment on more recent data and see if there has been any change in the number of sarcastic mentions of these words.

Code, full results, and boring technical minutiae

I used TF-IDF feature scaling over a unigram bag-of-words model, and L2-regularized logistic regression. Words appearing less than 5 times were ignored and non-alphanumeric chars were stripped. (In retrospect this is a debatable choice—some studies have found that punctuation correlates with irony—and ideally you’d want to do proper tokenization.) The heavy lifting was done with scikit-learn.

My code is available here and a full list of output can be found here. If you have a Kaggle account you can run it yourself and play around with the parameters. Have fun /s.