Paris - France + Germany = Berlin
This bizarre equation is a genuine result from a natural language processing technique that represents words as vectors of numbers. In this post we apply this tool to an analogical reasoning problem, then go on to explore some possible applications.
Words as vectors of numbers
Word “embeddings” are numerical representations of words or phrases. They are created by a machine-learning process that seeks to predict the probability of a given word occurring given the context of other co-occurring words. Thus the vectors embed a representation of the contexts in which a word arises. This means they also encode semantic and syntactic patterns. It works a bit like a form of compression. Instead of words being arbitrarily long sequences of letters, they become fixed-length sequences of decimal numbers.
The vectors themselves are just sequences of numbers. They don’t mean a lot in isolation, much like a single word in a foreign language written in an exotic alphabet. When these vectors are compared, however, they provide a fascinating insight into the relationships between words. This comparison allows us to interpret the vectors in natural language terms (to extend the above metaphor - this is like having a translator for that foreign alphabet).
We can explore this a little with visualisation. It is practically impossible to display vectors with e.g. 300 dimensions graphically in a way the human eye can interpret, but there are techniques available to compress high dimensional vectors so that we can visualise a more-or-less faithful representation of them in 2-dimensional space. The t-SNE (t-Distributed Stochastic Neighbour Embedding) algorithm is one such approach that is designed to preserve the local structure of the network - arranging the vectors according to how neighbours relate to one another.
This example, from Pennington, Socher, Manning (2014) Global Vectors for Word Representation, shows comparisons of the vectors for pairs of words. On the below diagram, each word vector is displayed as a point in the 2d space, the comparison between pairs of words is displayed as a dotted-blue line. The vector difference between
man - woman,
king - queen, and
brother - sister all seem broadly similar. This suggests there is a common structure that represents the semantic distinction between these two genders.
By combining the vectors of several words, we can calculate new aggregate vectors. These transformation don’t produce vectors that correspond to other words exactly but we can look for the word with the closest match. The distance can be measured with cosine-similarity - the angle that separates the vectors.
Ship - Water + Earth = Spacecraft
When we think about a
Ship that sails not on
Water, but on
Earth, we have a idea that’s mostly closely related to the word
Another way of describing these “equations” is to use the semantics of analogy: “What relates to Germany as Paris does to France?”. Or in analogy notation:
France : Paris :: Germany : Berlin
These analogies are used to test mental ability. In the US, the Miller Analogies Test is used to assess candidates for post graduate education. The student is presented with a partial analogy - one in which a term is removed - and then asked to fill-in the blanks.
Can we use word embeddings to respond to a Miller Analogies Test?
Analogical reasoning with word embeddings
The Miller Analogies test presents a partial analogy:
Shard : Pottery :: ? : Wood
And a set of 4 options:
You need to choose the one which best completes the analogy. In this case,
splinter is the correct answer.
We can think of the analogies
A : B :: C : D working like a mathematical formula:
A - B = C - D. This equation may be rearranged to solve for the missing element. This gives us an expression which we can use to calculate a new vector that represents the blanked-out term in the analogy. The resulting vector is then compared against the multiple-choice options to find the option that most closely resembles the analogical “idea” that results from the embeddings.
We can visualise this for the splinter example. We take the vector differnce
shard - pottery (dashed blue line) then add that to
wood to yield a vector for the missing term
?. We then choose the nearest option (choices in green).
Note that the nearness is measured with cosine-similarity so it’s not exactly the same as 2d euclidian difference as suggested by the plot (although it ought to be broadly equivalent). Indeed proximity doesn’t necessarily mean anything in t-SNE visualisations. The plot is presented to help explain how we’re solving the partial analogies. The distill journal (which provides excellent visual explanations of machine-learning) has some insightful guidance on configuring and interpreting tsne effectively. In this case the plot is quite contrived - it only includes the 8 terms we’re interested in, and I generated several random variations until I found one that best demonstrated the overall shape that illustrates the idea.
Putting the theory to the test
I implemented this partial-analogy solving algorithm using the Deep Learning for Java library to ingest and analyse the GloVe data and a handful of Clojure functions to tie all the linear algebra together.
I then applied the algorithm to some Miller Analogy tests taken from this example set with the kind permission of majortests.com. I had to remove a couple of questions that didn’t use terms present in the vocabulary of the model - e.g. “Sri Lanka” is actually two words and “-ful” etc are suffixes not words.
This approach leads to 9 correct answers of a possible 13 - a 70% success rate. This is far better than the 25% score we would get from responding at random and is even better than the 55% average score for humans reported by majortests.com.
Which then does it get wrong?
An incorrect answer was given to the question which asks about numbers:
17 : 19 :: ? : 37
The options presented were:
31. The algorithm picked
39 while the correct answer was
31 - the question is looking for consecutive pairs of primes. As insightful as the word embeddings are, we wouldn’t have expected the co-incidence of numbers in text (which the embeddings capture) to tell us anything about mathematical properties of those numbers, primes or otherwise (expect perhaps something like Benford’s Law which describes the relative incidence of numbers!).
These aren’t logical equations. There’s no mathematical reasoning or understanding of physical cause and effect here. Some results from GloVe again:
Five - One + Two = Four Fire - Oxygen + Water = Fires
Another incorrect answer was given to this question:
? : puccini :: sculpture : opera
The options presented were:
Petrarch. The algorithm suggests
Petrarch whereas the correct answer was
Cellini. Ultimately the equation here becomes
Puccini - Opera + Sculpture. It’s hard to say where things went wrong but I wonder if Cellini was penalised as there is an opera named after him - leaving Petrarch as the only Italian. Reversing the relation (
Cellini - Sculpture + Opera) results in vector very close to
This leaves the following partial analogy that we might otherwise have expected the word embeddings to solve correctly:
Penury : Money :: Starvation : ?
The options presented were:
Care. The algorithm suggests
Care is the best fit, whereas we are expecting
Sustenance; as explained by millertests.com, penury is the result of having no money, starvation is the result of having no food. The closest words found by the algorithm (i.e. an open answer, not those options specified in the test) are in the topic area we’d expect:
The Oxford English Dictionary defines two meanings of
Penury: the first is poverty, and the second is scarcity. Similarly
Sustenance can refer to a livelihood as well as food. If we replace
Food the same algorithm finds the correct answer. This suggests that the multiple meanings are introducing some ambiguity. Curiously these replacements aren’t in the list of 10 words closest to the words we’re substituting them for - there is a substantive distinction to be made.
I think this example serves as a useful reminder that this approach is naively mechanistic. The human reader will automatically distinguish the appropriate meaning by re-appraising the context - the context of starvation means the nutrition-interpretation of sustenance comes to mind much more readily. Perhaps the algorithm could be further improved by doing something similar… We can use the idea of “Dog” to distinguish between “Cat as in Pet” and “Cat as in CAT scan”:
Cat - Dog = Tomography
Applications of word-embeddings
As we’ve seen above the vectors give a numerical value for the similarity of words. A simple use would be to expand search requests using synonyms and associations or even to automatically generate a thesaurus. This would be much more powerful than a keyword search as a measure of the proximity of meaning is encoded - i.e. not literally the word you searched for but also related terms, you could even make distinctions like “please find documents about cat as in scan, not as in pet”.
Compound vectors built-up from word embeddings can be used to represent larger structures too - like sentences or whole documents. These can be used to define the sentiment of a sentence or topic of a document. They could also help to identify documents that are related to a given topic (or corpus) for grouping or retrieval. You don’t necessarily need to identify the topic by name (or a collection of names) - the topic is identified by an abstract vector of numbers - so you can start with a seed document or corpus and request some suggestions of other related ones.
From retrieval we can extend further into discovery. Swanson linking is the idea of discovering new relationships by finding connections between existing (but separate) knowledge. This is not about finding new knowledge, but a process through which we can find lessons from one area and apply them to another.
More curiously this can also be extended to machine translation or the expansion of bilingual dictionaries. To achieve this feat, the embeddings are initialised using existing translations (“word alignments”) and then further trained monolingually (so they capture word context) or bilingually (so they optimise for between-language translational equivalence as well as learning within-language context). Similarities in semantic structures between two languages can become apparent once they are mapped onto the same feature space. This provides insight into new translations and incidentally makes for a novel way to explore alternative interpretations.
Perhaps the most powerful prospect of word-embeddings through, is their potential for application in other algorithms that need a base representation of language. An important task in natural-language processing, for example, is that of named-entity recognition (i.e. being able to spot the names of people, companies or places in documents). It has been shown that word embeddings can be used as a feature (i.e. way of describing the input text) when training such systems, leading to improved accuracy.
Finally, although it’s not really a practical application, I’ve been quite entertained playing with these vectors. There is a dense network of meaning deep within the numbers. Of course many of the results are prosaic but others are very pleasing. There is only as much imagination present as you put in - the process is entirely mechanical after all - but the outputs can be surprisingly expressive…
life - death + stone = bricks
productivity - economic + social = absenteeism
word - mechanism + creativity = poetry
poetry - words + sound = music
home - house + building = stadium
trust + money - banking = charitable
lifestyle + happy - sad = habits
habit + happy - sad = eating
eating + happy - sad = vegetarian
power - energy + money = bribes