Knowledge Base Christmas carol search using TF-IDF and cosine similarity
Using Christmas carols to explore how machine learning algorithms work with words rather than numbers.
Have you ever wondered how you could use word vector similarity to search for Christmas carols? No, me neither. But I recently spent some time looking at how machine learning algorithms work with words rather than numbers and thought the following was worth sharing. It starts with TF-IDF.
TF-IDF is short for Term Frequency - Inverse Document Frequency and it provides a numerical measure of the importance of a term within a document that exists within a collection of documents.
It is one of several techniques used within automated text analysis, machine learning algorithms and search engines. Perhaps unsurprisingly, there is an entire field of research concerned with mapping words and phrases to numbers, or more specifically, mathematical vectors.
The TF-IDF metric for a term within a document is calculated by multiplying two numbers: the term frequency and the inverse document frequency.
Term frequency (TF)
The term frequency is the number of times a term appears in a document. Now since the frequency of any particular term is likely to be greater within a large document compared to a small one, a weighting is usually applied to normalise the frequencies - this can be as simple as dividing the term frequency by the total number of terms within the document.
Inverse document frequency (IDF)
The inverse document frequency is based on the term frequency across all of the documents in a collection. Using a logarithmic formula, it acts to reduce the weight of terms that occur very frequently (like “the”) and increase the weight of less frequent and potentially more significant terms.
TF-IDF and cosine similarity
With the TF-IDFs calculated, a vector can be derived for each document, which exists in vector space with an axis for each term. And now, without too much effort to reach this point, we have a collection of vectors (one for each document) which can be compared against each other or against some other query vector using the formula for cosine similarity.
Using TF-IDF and cosine similarity to build a simple search
Using a notebook instance on SageMaker and the excellent scikit-learn library for Python, let’s step through how to create a (very) simple search for Christmas carols.