solving SAT in python

SAT is hard, but there are algorithms that tend to do okay empirically. I recently learned about the Davis-Putnam-Logemann-Loveland (DPLL) procedure and rolled up a short Python implementation.

(I can’t get no) satisfaction

A boolean formula is called “satisfiable” if you can assign truth values to the underlying atoms in such a way that the entire formula comes out true. For example, the formula p \land \lnot q is satisfiable: just set p to true and q to false. By contrast, the formula p \land \lnot p is “unsatisfiable” because it comes out false irrespective of what value you assign to p.

The general problem – does a satisfying assignment exist for a given formula – is called boolean satisfiability (SAT). It turns out that solving SAT is equivalent to solving the restricted problem of 3SAT. This makes representation easier so I focused on that. The problem of 3SAT is the same, except that we only consider formulae in 3CNF (conjunctive normal form). That is, formulae that are conjunctions where every conjunct is a disjunction of three literals. For example, the formula (p \lor \lnot q \lor r) \land (p \lor q \lor \lnot s) is in 3CNF; the formula (p \lor \lnot q \land r) \lor (p \lor \lnot s) is not.

John Franco and John Martin have written extensively on the history of SAT, which goes as far back as Aristotle. The problem received attention from medieval, enlightenment, and early analytic philosophers. Modern computational approaches emerged in the middle of the last century. The DPLL procedure was published in 1962 (here’s the original paper). Its core ideas remain important ingredients in modern SAT solvers.


After a brief and irritating foray down the OOP path, I decided to use raw types to represent everything. My code uses the following conventions:

A literal is a tuple consisting of a name and a sign. E.g. \lnot p is represented as:

("p", False)

A disjunctive clause is a set of literals. E.g. p \lor  \lnot q \lor r is represented as:

{("p", True), ("q", False), ("r", True)}

A CNF formula is a list of disjunctive clauses. E.g. (p \lor  \lnot q) \land (p \lor r) is represented as:

[{("p", True), ("q", False)}, {("p", True), ("r", True)}]

the naive approach

The simplest way to solve sat is to try everything. Given a single assignment (represented as a set of literals), we can determine if it satisfies the formula by testing whether it intersects with every conjunct. To test if the formula is satisfiable, we iterate over all assignments, testing each as we go.

def brute_force(cnf):
    literals = set()
    for conj in cnf:
        for disj in conj:

    literals = list(literals)
    n = len(literals)
    for seq in itertools.product([True,False], repeat=n):
        a = set(zip(literals, seq))
        if all([bool(disj.intersection(a)) for disj in cnf]):
            return True, a

    return False, None


The DPLL procedure is a recursive search algorithm. The base cases are those of an empty conjunct, which is true, and a conjunct containing an empty disjunct, which is false. (If you find the latter off-putting read this explanation). The recursion case then deals with a non-empty CNF formula that has no empty clauses. We pick an arbitrary literal and set it to true. This means that all conjuncts containing the literal drop out, and all conjuncts containing its negation have its negation removed. Then we recurse. If this doesn’t result in a satisfying assignment, we set the same literal to false and repeat the same procedure.

def __select_literal(cnf):
    for c in cnf:
        for literal in c:
            return literal[0]

def dpll(cnf, assignments={}):

    if len(cnf) == 0:
        return True, assignments

    if any([len(c)==0 for c in cnf]):
        return False, None
    l = __select_literal(cnf)

    new_cnf = [c for c in cnf if (l, True) not in c]
    new_cnf = [c.difference({(l, False)}) for c in new_cnf]
    sat, vals = dpll(new_cnf, {**assignments, **{l: True}})
    if sat:
        return sat, vals
    new_cnf = [c for c in cnf if (l, False) not in c]
    new_cnf = [c.difference({(l, True)}) for c in new_cnf]
    sat, vals = dpll(new_cnf, {**assignments, **{l: False}})
    if sat:
        return sat, vals

    return False, None


I used the following code to generate random 3CNF formulae:

def random_kcnf(n_literals, n_conjuncts, k=3):
    result = []
    for _ in range(n_conjuncts):
        conj = set()
        for _ in range(k):
            index = random.randint(0, n_literals)
                str(index).rjust(10, '0'),
    return result

I then compared the performance of the brute force approach and DPLL over various numbers of literals. Each scores is averaged over one hundred random formulae.


Full code can be found here.

active learning

Active learning is a subfield of machine learning that probably doesn’t receive as much attention as it should. The fundamental idea behind active learning is that some instances are more informative than others, and if a learner can choose the instances it trains on, it can learn faster than it would on an unbiased random sample.

This post is a high-level primer with some Python code. If you want a serious introduction to the topic I strongly recommend you buy Burr Settles’ textbook.

how it works

Suppose we have a small collection of labeled data and a large collection of unlabeled data. We want more labeled data so that we can train a better classifier. We have the option of paying to get some of our unlabeled data points labeled. (This is not an unrealistic scenario — examples include hiring experts to annotate data, or conducting lab tests). If we have a fixed budget, what’s the optimal way to spend our money?

One suggestion is random sampling. This is a pretty reasonable thing to do and it generally works fine in practice. But in many cases, we can outperform random sampling by learning actively. The idea is to:

  • Train a probabilistic supervised model on our labeled data
  • Use the model to perform inference on the unlabeled data
  • Identify instances that confuse our model, and request their labels

We then request labels for these points. Of course, we have to say more about what it means for a model to be “confused” by an instance. Proposing various definitions of this — and seeing how they cash out in experiments — is what active learning is fundamentally about.

uncertainty sampling

The simplest approach to active learning is uncertainty sampling. Despite being simple, it’s typically competitive. It comes in three forms. The first is “least confidence” sampling, in which we seek instances which minimize the probability the classifier assigns to the predicted label. In Python, it might looks something like this:

# Perform inference:
probs = clf.predict_proba(X_unlabeled)

# Rank predictions by confidence:
scores = 1 - np.amax(probs, axis=1)

# Sort them:
rankings = np.argsort(-scores)

# Take the points for which we have a budget:
X_active = X_unlabeled[rankings[:budget]]

Another approach is to seek points where the difference between the probabilities of the top two predictions is minimized. Effectively, looking for cases where the model is torn between two labels. This is called “max margin” sampling (it turns out this is what people do).

# Compute the margin of each instance:
ordered = np.partition(-probs, 1, axis=1)
margins = -np.abs(ordered[:,0] - ordered[:, 1])

# Sort them:
rankings = np.argsort(-margins)

One might criticize the first two approaches on the grounds that they throw away all the information about the relative probabilities of the other labels. A third technique, which doesn’t share this feature, is to maximize the entropy of the output distribution:

from scipy.stats import entropy
scores = np.apply_along_axis(entropy, 1, probs)
rankings = np.argsort(-scores)

It’s not hard to prove that in binary classification, all three approaches are equivalent. Full implementations of these techniques (and a couple others) can be found here.

what it looks like

We can better see what is going on by letting a linear model learn actively on mnist. It’s well-known that the mnist instances that models tend to misclassify are not necessarily the same ones that people tend to misclassify. But if we look at the instances an active learner requests, they match our intuitions surprisingly well.

The top 50 instances requested by uncertainty sampling to help discern a seven from a non-seven

Above is a plot of the 50 most informative instances obtained after:

  • Restricting the mnist task to a binary problem of “seven or not-seven”
  • Training a linear model on 2000 random samples
  • Performing uncertainty sampling

Most of them are either shittily drawn sevens, or other digits that have some of the distinctive properties of a seven. Additionally, the distribution of digits is heavily skewed to favour digits that are more similar in structure to seven.

On the one hand, it would be a mistake to fully equate the algorithm singling out a slanted “one” with a human asking whether the line across the top is an essential property of a “seven.” Statistical models of character recognition don’t operate in terms of necessary and sufficient conditions. On the other hand, it’s clear that both people and computers can glean insight from cases close to the margin for similar (if not identical) reasons.


building a term-document matrix in spark

A year-old stack overflow question that I’m able to answer? This is like spotting Bigfoot. I’m going to assume access to nothing more than a spark context. Let’s start by parallelizing some familiar sentences.

corpus = sc.parallelize([
    "We're talking about practice, man.",
    "We're talking about practice.",
    "We ain't talking about the game."])

The first step is to tokenize our documents and cache the resulting RDD. In practice, by which I mean the game, we would use a real tokenizer, but for illustrative purposes I’m going to keep it simple and split on spaces.

tokenized_corpus = corpus \
    .map(lambda raw_text: raw_text.split()) \

Next we need to map words to distinct indices and pull the resulting dictionary into the driver.

local_vocab_map = tokenized_corpus \
    .flatMap(lambda token: token) \
    .distinct() \
    .zipWithIndex() \

Then we can broadcast the vocabulary-index map (and its size) back out to the workers:

vocab_map = sc.broadcast(local_vocab_map)
vocab_size = sc.broadcast(len(local_vocab_map))

Finally, we need to:
1) Convert each document into a dictionary of word counts
2) Replace the keys in the dictionary with their indices
3) Convert the dictionary into a vector

With the help of two imports we can do this in three calls to map:

from pyspark.mllib.linalg import SparseVector
from collections import Counter

term_document_matix = tokenized_corpus \
    .map(Counter) \
    .map(lambda counts: {vocab_map.value[token]: float(counts[token]) for token in counts}) \
    .map(lambda index_counts: SparseVector(vocab_size.value, index_counts))


for vec in term_document_matix.collect():
    print vec


The practice rant turned fourteen today.

visualizing piero scaruffi’s music database

Scaruffi’s music database

Since the mid-1980s, Piero Scaruffi has written essays on countless topics, and published them all for free on the internet – which he helped develop. You can learn more about him (and pretty much anything else that might interest you) on his legendary website.

I was introduced to the site when a friend began referring to certain records as “Scaruffi 7s,” in reference to their ratings in Scaruffi’s music database. One of the oldest components of the site, the database contains entries on thousands of acts. My initial reaction was along the lines of “7 doesn’t sound that great.” But scales are relative and Scaruffi is a judicious critic. Getting a 7 puts your record in the 84th percentile of albums that have received ratings (thousands don’t even get a number).

The Distribution of Scaruffi's Favor.png

In music writing, where homogeneity of style and opinion are typical, Scaruffi is a complete outlier. He reps for over-looked bands and fulminates against the orthodoxy. If top-40 is your thing you’ll likely be disappointed, but if you want to find out which Georgia Anne Muldrow album to start with, or about the interactive CD-ROM album Todd Rundgren made in 1993, well, Scaruffi is your guy. Having read the music database for some time, I decided I wanted to see what it looks like.

If you squint you can see band names

I wrote a web scraper to download Scaruffi’s entries on over 5000 bands. Using the networkx Python package, I converted the entries into a graph. Nodes represent bands and edges represent hyper-links between them. Finally, I plugged the results into the Gephi visualization tool.

Loosely speaking, we can interpret a node’s in-degree (the number of other nodes pointing to it) as the corresponding band’s level of influence. Why? Suppose Scaruffi ends up mentioning band x in his discussion of 20 other bands. The odds are that many of these mentions will discuss a creative debt to band x.

“I definitely don’t think of myself as being an influence.” – Kate Bush

Admittedly this interpretation is coarse. Not all references include hyper-links, and not all links represent an “influenced by” relation; other possibilities include shared members and side-projects. But empirically, it kind of pans out: the nodes of influential artists tend to have high in-degree.


Alternatively we could run PageRank, the original algorithm behind Google. PageRank measures a node’s influence not just in terms of its immediate neighbours, but throughout the entire network. So, for example, a large number of links from ’90s nu-metal bands probably won’t change your PageRank, but a single link from Bob Dylan helps a lot. The top five PageRank results were Dylan, the Beatles, ELO (see: a single link from Dylan helps a lot), Zappa, and Television.

Something about selling 30,000 copies

The two metrics have different flavours. The in-degree counts are more closely tied to Scaruffi’s interests and opinions (and if you don’t believe me read the first sentence of this).  Additionally, they are robust against outliers—missing and erroneous links won’t have a drastic effect. By contrast, PageRank is less tied to Scaruffi’s opinions. It may deem a band influential even when he doesn’t particularly like them. On the other hand, it can be extremely sensitive to outliers.

The in-degree metric is ultimately more interesting. The database is singular not because it is big enough to facilitate network analysis—both Wikipedia and AllMusic are considerably larger—but because it was crafted by a single person, and reflects his views.

(This post was updated on 10-02-2016)




the lowest form of wit: modelling sarcasm on reddit

A while back Kaggle introduced a database containing all the comments that were posted to reddit in May 2015. (The data is 30Gb in SQLite format and you can still download it here). Kagglers were encouraged to try NLP experiments with the data. One of the more interesting responses was a script that queried and displayed comments containing the /s flag, which indicates sarcasm.

A natural follow-up question is whether we can train a model to predict which comments have /s flags given only the words occurring in them, in a supervised learning setting. Prima facie it’s a tough problem—recognizing irony depends upon context—but with this much data you can often find interesting trends without fully solving the problem. A bag of words model combined with a discriminative linear classifier typically does well on this sort of task. Knowing this, I spent a few hours over Thanksgiving weekend trying to find out whether my laptop can grasp the lowest form of wit.

(I should add here that there are swaths of literature on automatic irony detection, and if you want a serious treatment of the subject you could start with this paper or this paper. Most of them mention the sort of results posted here, and some even use reddit data. But to my knowledge none of these use the /s flag as a labelling heuristic.)

How often are redditors sarcastic?

There are about 54 million comments in the database, of which only 30,000 have the /s flag. It would be unreasonable to expect a machine to improve upon a majority baseline this strong. (Humans can’t even solve the problem with full accuracy, as evidenced by our need for the /s flag). But a machine might able to learn something if we re-frame the problem with a uniform class distribution. If so we can ask how the machine managed to do this, and learn which words redditors reserve for berating one another.

How often can a machine pick up on this?

About 72% of the time. Details and code can be found in the final section.

What can this tell us about the language of sarcasm?

More interesting than the model’s accuracy is the collection of features to which it gives the most weight. Roughly speaking, we can interpret these as the words that separate sarcastic comments from serious ones. Intent and context ultimately determine irony, but we nonetheless tend to use certain words when we’re being sarcastic.

Consistent with the previous studies, the top five words that separate sarcastic comments from serious ones are intensifiers:

  1. clearly e.g. “Nefarious corporate interest clearly at work /s
  2. obviously e.g. “Yes, captioning a 30+ second long Seinfeld GIF while staying faithful to the script is obviously low effort /s
  3. totally e.g. “No agenda though, totally neutral broadcasting /s
  4. must e.g. “You must have a great software background /s
  5. dare (as in “how dare”) e.g. “Yeah, how dare she fact check Romney, that’s the last thing I want someone to do to a politician. /s

Other high-ranking features are topic-specific: what sports fans call a “scrub,” gamers call a “shitlord,” and if you use sarcasm in a political debate, be sure to end with “amirite?

You also see meta-features, resulting from discussion of the /s flag. This sort of problem frequently arises when you use a heuristic to obtain labels (as opposed to labelling your training data by hand). Examples include “forgot,” e.g. “You forgot the /s“; “dropped,” e.g. “You dropped this /s“; and “sarcasm,” e.g. “If you want to convey sarcasm just put a /s.

While some of the results are charming, others serve as a depressing reminder of reddit’s well-documented sexism problem. The gender politics lexicon includes multiple strong indicators of sarcasmpatriarchy,” “sexist,” “women,” and “privilege” all show up in the top 30. Reddit’s anti-harassment policy was introduced the same month this data was collected. It would be interesting to repeat the experiment on more recent data and see if there has been any change in the number of sarcastic mentions of these words.

Code, full results, and boring technical minutiae

I used TF-IDF feature scaling over a unigram bag-of-words model, and L2-regularized logistic regression. Words appearing less than 5 times were ignored and non-alphanumeric chars were stripped. (In retrospect this is a debatable choice—some studies have found that punctuation correlates with irony—and ideally you’d want to do proper tokenization.) The heavy lifting was done with scikit-learn.

My code is available here and a full list of output can be found here. If you have a Kaggle account you can run it yourself and play around with the parameters. Have fun /s.