Austin Schwartz

# Developing Heuristics for Evaluating Jeopardy Answers

April 3, 2018

Jeopardy is an American quiz game show where contestants are presented with clues in the form of answers and have to present their responses in the form of questions. The board looks similar to the following, with categories at the top and prize amounts on each clue cell:

When a contestant picks a dollar amount, the clue is revealed and read aloud, and the first to buzz in gets to answer. An example from the category “Ancient Worlds” could be something like “Where all roads were supposed to have led” with the answer of “Rome”.

## Background

Back in 2013, I wrote a simple single page app for playing jeopardy. It presents a grid of clues, and you can click and answer them. You can play it at http://jeopardy.austinschwartz.com

Implementing all of this was pretty straightforward. However, grading the user’s input proved to be a headache.

## Problem

For a given clue, let’s assume that the system expects the input “Dan Aykroyd & Eddie Murphy”, but we enter “eddie murphy and dan ackroyd”. If I were grading it by hand, then I’d be inclined to mark it correct, but:

1. None of the words are capitalized
2. I used “and” instead of &
3. I spelled Aykroyd wrong
4. Everything is out of order

Somehow, we need to design a grading system that will mark inputs correct as long as they look mostly acceptable.

## Tests

It might be helpful to start off with a few test cases. We can’t evaluate our algorithm without tests to check against, right?

I built my list of test cases from my own answers and the answers of my friends while testing. Later, once I published the site, I accumulated more answers from others users in order to improve the accuracy.

Note that whether some of these should pass or not is subjective, but I lean more towards accepting an answer even though it’s incorrect (false positives), rather than marking answers wrong when they’re actually right (false negatives).

Here are some example tests from my own playing:

a shot shot
apples and oranges oranges and apples
keanu (reeves) keanu
booze or ooze booze
dan aykroyd & eddie murphy eddie murphy and dan ackroyd
mammon (money later accepted) money
the woodwork wood work
baby tooth baby teeth
toys “r” us toys r us
kerouac karuak

And here are some that I don’t think should pass:

tin chromium
bosons electrons
john locke descartes
tokyo shanghai
scotland england
espresso coffee
elizabeth i victoria

We can go ahead and put our positive and negative test cases into lists and shuffle both. We’ll then need to combine our lists, along with making a new list containing the expected output labels (true/false). This format will help make our code more concise later on.

import numpy as np
positive_tests = [["a shot", "shot"],["the woodwork", "wood work"], ... ]
np.random.shuffle(positive_tests)

negative_tests = [["cocoa puffs", "cookie crisps"],["tin", "chromium"], ...]

import re

def clean_string(x):
return re.sub('[^\w\s]', '', x)
X = []
y = []
for good_case in good_cases:
a, b = map(clean_string, good_case)
X.append([a, b])
y.append(True)
X.append([a, b])
y.append(False)
return (np.array(X), np.array(y))

(X, y) = gen_tests(positive_cases, negative_cases)


For testing, I have 2930 positive test cases and 1698 negative test cases. Ideally we would have much more, but we can always go back and add more later.

Also, as a first pass, we strip any character that isn’t a letter, number, or a space. We can get additional information if we keep those characters, but for now we just treat them as noise.

### Strcmp

Assume at first that we just want to run a simple string comparison on both strings. For every question, we have a single stored answer in a database. When the user inputs their answer, we compare their string with the stored string character by character, and return true or false depending on whether it matches or not.

We’re taking two strings as input, and outputting a boolean value. In other words, we have a binary classifier.

One easy way of determining how well our binary classifier is doing is to compute the accuracy, or percent of tests that are classified correctly. To compute it, we add the count of true positives and true negatives then divide by the total number of tests.

There are other methods of evaluation, but accuracy is straightforward enough for now.

Given a classifier function $f$, which takes in two string arguments and returns a boolean (pass/fail), we can compute the accuracy with the following code:

def accuracy(X, y, f):
correct = 0
for i in range(len(X)):
a, b = X[i]
if f(a, b) == y[i]:
correct += 1
return float(correct) / len(X)

def print_acc(acc):
print("Accuracy: {:.2f}%".format(acc * 100))


If we pass in a straightforward string compare function:

def strcmp(a, b):
return a == b

print_acc(accuracy(X, y, strcmp))

Accuracy: 73.54%


We get a surprising 73.54% accuracy simply checking if the user entered the exact answer string. Not bad, but we can do better.

### Stemming

If the user enters “wall” and the answer is “walls”, we should still accept the answer. If the English language was standardized and the plural to every word just added an ‘s’, this would be easy. However, English kinda sucks, and we quickly run into edge cases like “calves” and “calf”, “elves” and “elf”, etc.

Along with singularity, verb tense and other grammatical modifications cause issues as well. What if the answer contained the word “run”, but the user entered “running”? We should probably mark those correct as well.

Thankfully, most of these aren’t too difficult to deal with. We could use any stemming library to reduce input words to their base, i.e. “dogs” would become “dog”, but also “argue”, “argued”, and “arguing” would all reduce to something like “argu”. If we did this to every input word, along with every answer word, then the comparison would be exact.

We can use Python’s NLTK (Natural Language Toolkit) library for this!

>>> from nltk.stem.porter import *
>>> stemmer = PorterStemmer()
>>> stemmer.stem("dogs")
'dog'
>>> stemmer.stem("swimming")
'swim'

>>> stemmer.stem("agoraphobe")
'agoraphob'
>>> stemmer.stem("agoraphobic")
'agoraphob'


So if we were comparing “agoraphobe” and “agoraphobic”, but stemmed both first, we would end up passing the test.

import nltk
from nltk.stem.porter import PorterStemmer

def strcmp_with_stemming(a, b):
stemmer = PorterStemmer()
return stemmer.stem(a) == stemmer.stem(b)

print_acc(accuracy(X, y, strcmp_with_stemming))

Accuracy: 75.01%


We’ve bumped our accuracy a little over one percent, which isn’t too bad, but we can still do a lot better.

### Unnecessary words

If the answer is “The Beatles”, and the user gives “Beatles”, we should mark their answer as correct. Same for “a shot” vs “Shot” or “on Thursday” vs “Thursday”. We can use something called a part-of-speech tagger.

A part-of-speech tagger takes a sentence as input and adds a tag to every word. These tags, state what “part of speech” each word is, based on definition, but also context. From there, there are a few things we could do, but an easy solution would be to remove anything that isn’t a noun.

Here’s another example using NLTK:

>>> import nltk
>>> sentence = "those elephants over there in the clearing"
>>> tagged_words = nltk.pos_tag(nltk.word_tokenize(sentence))
>>> tagged_words
[('those', 'DT'), ('elephants', 'NNS'), ('over', 'IN'), ('there', 'RB'), ('in', 'IN'), ('the', 'DT'), ('clearing', 'NN')]


Each word returns with a tag attached, with any tags starting with an ‘N’ representing nouns. We could use this to remove any non-noun words:

>>> [word for (word, token) in tagged_words) if token[0] == 'N']
['elephants', 'clearing']

def strcmp_with_tokenizing(a, b):
def remove_non_nouns(s):
tags = nltk.pos_tag(nltk.word_tokenize(s))
filtered_tags = [ tag[0] for tag in tags if tag[1][0] == 'N' ]
return ' '.join(filtered_tags)
a, b = map(remove_non_nouns, (a, b))
return stemmer.stem(a) == stemmer.stem(b)

print_acc(accuracy(X, y, strcmp_with_tokenizing))

Accuracy: 82.63%


Around 7% more than before!

### Potpourri

Sooner or later, we’ll encounter examples where these techniques break down. Many of these strings are complicated, and we should accept a huge variety of answers. Unfortunately, we can’t exactly predict what the user’s going to give as input.

Any kind of misspelling won’t be helped by stemming, removing certain words, etc. We need a good way to tell the difference between two strings.

## Distance Functions

A distance function or metric, $d(a, b)$ is a function that takes two strings as input and returns a value indicating the “distance” between two strings. In our case, we can use this distance to determine the similarity between the user’s input and the stored answer. We can then define a cutoff, so that if the similarity is above some number, we accept the pair as being the same, and give the user points.

We’ll look at two types of distance functions. The first being edit-distance based, such as Jaro-Winkler distance and Levenshtein distance, and the second being the token-based Sørensen-Dice coefficient.

### Jaro-Winkler

Jaro distance, and by extension, Jaro-Winkler distance, are both distance functions for string similarity.

Jaro distance measures the minimum number of single-character transpositions required to change one string into the other. This is why we refer to this function as an edit-distance based function. Jaro-Winkler adds an additional weight, giving more favorable scores to strings where the first few characters match.

For two strings, $a$, and $b$, the Jaro distance function is as follows: $\displaystyle d_j(a, b) = \begin{cases} 0 & \text{if } m = 0 \\ \frac{1}{3} \big(\frac{m}{|a|} + \frac{m}{|b|} + \frac{m - t}{m} \big) & \text{otherwise } \end{cases}$ where $m$ is the number of matching characters, and $t$ is half the number of transpositions.

Jaro-Winkler takes Jaro distance, and adds some weight if the strings begin with the same prefix (up to 4 characters).

$\displaystyle d_w(a, b) = d_j(a, b) + \ell p(1 - d_j(a, b))$ where $d_j$ is the function for computing Jaro distance, $\ell$ is the length of the common prefix (up to 4 characters), and $p$ is a constant of $0.1$, used as a scaling factor.

At the time, Winkler was working on improving duplicate string identification in Census records. He realized that errors were more likely to occur towards the end of a string. If two strings started with the same few characters, it was more likely that they were identical than if they ended with the same characters.

By forgiving some amount of dissimilarity based on how many of the first few characters match, Jaro-Winkler is able to get very good results in practice.

We can use an implementation available in the ‘jellyfish’ python package. We also should cache the results since we’ll be running this function pretty frequently.

import jellyfish
from functools import lru_cache

@lru_cache(maxsize = None)
def jaro_winkler(a, b):
return jellyfish.jaro_winkler(a, b)


Since our distance function takes two strings and returns a value in the $[0, 1]$ range, we’ll need to figure out a threshold $t$ such that any value above $t$ is considered a pass, and any value below is a fail.

We can check for every value $t$ in the range $[0, 1]$, what our accuracy would be if we just accepted the strings when $d_w(a,b) > t$.

### Model training

Previously, we fed test data into a classification function, returning the accuracy of the function. Once we introduce new variables, attempting to find values to make our model fit the data well, we’re now involving training into the mix.

Originally we weren’t doing any sort of training, so it was fine to just calculate the accuracy. It makes sense in this new world to partition our data into two sets: training and testing. We can train our model on the training set, fitting our variables/parameters using that data. When it comes time to figure out our function’s accuracy, we’ll feed the test data into our trained model.

This will lower the chances that we “overfit” our data. If we had one single set of data that we try to find the highest accuracy model for, it doesn’t necessarily mean that our model will generalize to new data. Splitting our data lets us optimize on one set of data (train set), then check our accuracy using the other set of data (test set).

The python ‘sklearn’ package has some nice tools available. There’s a function available for splitting our data into train/test sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)


This splits our data into train and test sets, with the test set containing 40% of our data, and training set containing 60%.

We can make a class for our classifier model, which takes in a classifier function (in this case Jaro-Winkler), then fits our model by finding good parameters/weights for the given function. We can then use our trained model to find the accuracy for our test data. We could also use k-fold cross validation, or other tools, which are provided by sklearn.

def weighted_f(f, a, b, t, *args):
return f(a, b, *args) > t

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels

class LinearWeightClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, f, param_count=0):
self.f = f
self.param_count = param_count + 1

def fit(self, X, y):
X, y = check_X_y(X, y)
self.classes_ = unique_labels(y)
self.X = X
self.y = y

def opt_f(args):
return -1 * accuracy(self.X, self.y, lambda a, b: weighted_f(self.f, a, b, *args))

bounds = [(0, 5)] * self.param_count
opt = optimize.differential_evolution(opt_f, bounds=bounds, maxiter=2000)
self.t = list(map(float, opt.x))
self.weighted_f = lambda a, b: weighted_f(self.f, a, b, *self.t)
return self

def predict(self, X):
check_is_fitted(self, ['X', 'y'])
y_pred = []
for x in X:
y_pred.append(self.weighted_f(*x))
print("Weight(s): {}".format(self.t))
return y_pred


The fit function here uses a generic optimization function “optimize.differential_evolution” from sklearn which attempts to find the global minimum for a given function. Here, we give it our accuracy function from earlier (filled in with whatever classification function we give it), and tell it to find the maximum accuracy, saving it for later. This is run on the test set of data.

In the predict function, we give it the training set of data and tell it to find the accuracy, using the weights found during the fit step previously. It will print out the stored weights along with the accuracy.

clf = LinearWeightClassifier(jaro_winkler).fit(X_train, y_train)
print_acc(clf.score(X_test, y_test))

Weight(s): [0.6849205041170059]
Accuracy: 92.12%


### Levenshtein Distance

Levenshtein distance represents the minimum number of single-character edits required to change one string to another, edits here being insertions, deletions, and substitutions. It’s an edit-distance based metric, same as Jaro-Winkler.

The formula for Levenshtein distance is as follows:

$\displaystyle \text{lev}_{a,b}(i, j) = \begin{cases} \text{max}(i, j) & \text{if min}(i, j) = 0, \\ \text{min}\begin{cases} \text{lev}_{a,b}(i - 1, j) + 1 \\ \text{lev}_{a,b}(i, j - 1) + 1\\ \text{lev}_{a,b}(i - 1, j - 1) + 1_{(a_i \ne b_j)} \end{cases} & \text{otherwise} \end{cases}$

Mathematically, $lev_{a,b}(|a|, |b|)$ is the Levenshtein distance between the two strings $a$ and $b$ (of length $|a|$ and $|b|$ respectively)

Let’s say we have two strings like cat and bats. We would count the substitution between c and b as 1 and count the insertion of the s at the end as another 1, for a total of a Levenshtein distance of 2.

We can also normalize the output value to be within a [0, 1] range using the following formula:

$\displaystyle d_{ln}(a, b) = 1 - \frac{\text{lev}_{a,b}(|a|, |b|)}{\text{max}(|a|, |b|)}$

The other algorithms we’ll look at already output in the range of [0, 1], so it’s useful to keep them uniform.

The python package ‘distance’ has an implementation that we can use.

import distance

@lru_cache(maxsize = None)
def levNorm(a, b):
return 1 - (distance.levenshtein(a, b) / max(len(a), len(b)))


We can use our classifier class from earlier:

clf = LinearWeightClassifier(levNorm).fit(X_train, y_train)
print_acc(clf.score(X_test, y_test))

Weight(s): [0.28736244416686807]
Accuracy: 94.87%


Slightly better than only using jaro-winkler, but still not ideal.

### Sørensen–Dice

The Sørensen–Dice index, aka Sorensen or Dice, was discovered independently by two botanists, and initially used to determine the similarity of plant species.

In the context of string comparison, Dice compares the number of matching adjacent character pairs, also referred to as bigrams, between two strings. The good thing about Dice is that these character pairs can be anywhere in the string! So if we were comparing “apples and oranges” with “oranges and apples”, we’d get a similarity score of 1, indicating a perfect match between the two strings!

The formula for Sørensen–Dice is a bit different than the previous two:

$\displaystyle d_s(a, b) = \frac{2|B_a \cap B_b|}{|a| + |b|}$ where $B_a$ is the set of bigrams in $a$, $B_b$ is the set of bigrams in $b$,$|B_a \cap B_b|$ is the number of bigrams that match between the two strings, and $|a|$ and $|b|$ represent the lengths of strings $a$ and $b$.

Here’s a cute visual using the dan aykroyd example from earlier, courtesy of this website:

In the visual, every pair of matching bigrams has the same color. Even though the words in our string are out of order, they’re still counting as matched! The remaining bigrams aren’t colored in, as they don’t match up between the two strings.

I couldn’t find a library with a good implementation, but this site has one we can use:

@lru_cache(maxsize = None)
def dice(a, b):
if not len(a) or not len(b):
return 0.0
if a == b:
return 1.0
if len(a) == 1 or len(b) == 1:
return 0.0

a_bigram_list = [a[i:i + 2] for i in range(len(a) - 1)]
b_bigram_list = [b[i:i + 2] for i in range(len(b) - 1)]

a_bigram_list.sort()
b_bigram_list.sort()

lena = len(a_bigram_list)
lenb = len(b_bigram_list)
matches = i = j = 0
while (i < lena and j < lenb):
if a_bigram_list[i] == b_bigram_list[j]:
matches += 2
i += 1
j += 1
elif a_bigram_list[i] < b_bigram_list[j]:
i += 1
else:
j += 1

score = float(matches) / float(lena + lenb)
return score

dice("dan aykroyd & eddie murphy", "eddie murphy and dan ackroyd")

0.7692307692307693


Running this with our LinearWeightClassifier:

clf = LinearWeightClassifier(dice).fit(X_train, y_train)
print_acc(clf.score(X_test, y_test))

Weight(s): [0.34486629479279207]
Accuracy: 97.46%


97.46%! That’s pretty high compared to using either of the other two functions.

## Testing

You can see some of these tests visualized on this handy-dandy slope graph:

Each of the 3 columns represents a distance function. The values represent the result of those distance functions, and we have the two strings we’re comparing on the left and right side of the graph. Note that the values have all been multiplied by 100.

The slope graph presents an interesting view of the differences between the three distance functions.

For example, “keanu (reeves)” -> “keanu” scored very highly on Jaro-Winkler, likely due to so many adjacent characters being the same (especially at the beginning), but it bombed on Levenshtein and Dice, likely due to it having so many extra characters.

Each distance function has its own strengths and weaknesses, and it might be good to see if we can combine them in some way, extracting the best of each.

## Multiple distance functions

We can introduce a new distance function, one that combines Jaro-Winkler, Levenshtein, and Dice.

$\displaystyle d(a, b) = w_1 * d_w(a, b) + w_2 * d_{ln}(a, b) + w_3 * d_s(a, b)$

If $d(a, b)$ is above some threshhold, then we accept the answer. Then we just need to find suitable weights!

def linear_combination_f(a, b, w1, w2, w3):
return w1 * jaro_winkler(a, b) + w2 * levNorm(a, b) + w3 * dice(a, b)

clf = LinearWeightClassifier(linear_combination_f, 3).fit(X_train, y_train)
print_acc(clf.score(X_test, y_test))

Weight(s): [2.3934858600956623, 0.3159910494971623, 2.3338070149206827, 4.105535998508461]
Accuracy: 97.52%


Surprisingly, we only got a small fraction of a percent higher than using dice on its own!

## Further Work

The algorithm we’ve developed is pretty generic, but we have domain-specific information we can use to our advantage.

• A lot of these answers are “x or y or z”. We could easily split on the or’s, and try each combination.
• Some answers contain “x (y later accepted)” which we could easily parse and split on
• Symbols like “&” are currently stripped, but we could replace them with “and”. Other symbols like parentheses give us information as well.
• Some answers like “Central Intelligence Agency” could be answered with the acronym, e.g. “CIA”. It would be very difficult for our current classifier to learn this, but we could maybe import acronym information from elsewhere.

We could also try other binary classifiers (SVM, random forests, neural nets, etc), along with other distance functions!

Here is the Juypter Notebook with all of this code: Link