Knowledge Base Christmas carol search using TF-IDF and cosine similarity

Dan Cooper AWS

Using Christmas carols to explore how machine learning algorithms work with words rather than numbers.

Have you ever wondered how you could use word vector similarity to search for Christmas carols? No, me neither. But I recently spent some time looking at how machine learning algorithms work with words rather than numbers and thought the following was worth sharing. It starts with TF-IDF.

TF-IDF is short for Term Frequency - Inverse Document Frequency and it provides a numerical measure of the importance of a term within a document that exists within a collection of documents.

It is one of several techniques used within automated text analysis, machine learning algorithms and search engines. Perhaps unsurprisingly, there is an entire field of research concerned with mapping words and phrases to numbers, or more specifically, mathematical vectors.

The TF-IDF metric for a term within a document is calculated by multiplying two numbers: the term frequency and the inverse document frequency.

Term frequency (TF)

The term frequency is the number of times a term appears in a document. Now since the frequency of any particular term is likely to be greater within a large document compared to a small one, a weighting is usually applied to normalise the frequencies - this can be as simple as dividing the term frequency by the total number of terms within the document.

Inverse document frequency (IDF)

The inverse document frequency is based on the term frequency across all of the documents in a collection. Using a logarithmic formula, it acts to reduce the weight of terms that occur very frequently (like “the”) and increase the weight of less frequent and potentially more significant terms.

TF-IDF and cosine similarity

With the TF-IDFs calculated, a vector can be derived for each document, which exists in vector space with an axis for each term. And now, without too much effort to reach this point, we have a collection of vectors (one for each document) which can be compared against each other or against some other query vector using the formula for cosine similarity.

Using TF-IDF and cosine similarity to build a simple search

Using a notebook instance on SageMaker and the excellent scikit-learn library for Python, let’s step through how to create a (very) simple search for Christmas carols.

Downloads

Download notebook

Download carols.csv

Dan Cooper: Allies AWS Principal Consultant

Dan Cooper

CEO and Principal Consultant, Allies Computing Professional Services

Dan has a proven track record in helping customers leverage their data and technology to gain a commercial edge. During his 20-year career, Dan has worked in roles covering solution design, hands-on development, product management and IT strategy within both SME and enterprise orgs. He is an AWS Certified Solution Architect, a Certified Scrum Master and qualified in business analysis, PRINCE2 and ITIL 4.

Receive more great content just like this

Our email newsletter goes out once a month and we will only use your details to send you links to products and articles we think you might be interested in.

You may also like

Live Start Loading NowChatting Offline
Cookie

We use cookies to improve your experience on our site, and to provide a live chat feature. To find out more please read our privacy policy.