Hacking together a search engine and getting it to improve as people use it

Full text search has lots of applications, from customized website search engines to enterprise knowledge management systems. There are very elaborate tools (e.g. Elasticsearch) that will allow you to set up a full text search index for pretty much anything. As soon as people start using your search, you start getting feedback: information about which search results get clicked on and which get skipped. Unfortunately, a lot of times this information isn’t used.

In this post we are going to build a toy search engine in Python, using nltk and scikit-learn to:

  1. Give you a basic understanding of how classic full text search works
  2. Propose a simple way to utilise user feedback to improve search results

You don’t need any knowledge of machine learning to understand this post, only Python and basic math notation.

Classic search

Suppose we have a collection of documents, for example, different web pages:

docs = [
    '''About us. We deliver Artificial Intelligence & Machine Learning
       solutions to solve business challenges.''',
    '''Contact information. Email [martin davtyan at filament dot ai]
       if you have any questions''',
    '''Filament Chat. A framework for building and maintaining a scalable
       chatbot capability''',
]

Consider a user who wants to find contact information and types in a query into a search box: query = 'contacts'

Just like with Google, our job is to come back with a set of documents, sorted by their relevance to the user’s query. We aim to have the most relevant results on top of the page. For that, we would like to come up with a number, a score, to quantify how relevant a particular document is to a given query.

Once we figure out how to compute these scores, our search can work like that:

  1. For a given query, compute scores for all documents in the collection
  2. Sort all documents according to their scores
  3. Return N documents with a highest score

Tokenization and stemming

If we want to quantify how a particular document is relevant to a given query, a simple idea would be to count a number of overlapping terms. Splitting a text into words (while removing punctuation, etc.) is called tokenization and it’s very easily done with NLTK:

import string
import nltk
from nltk.tokenize import TreebankWordTokenizer
​
REMOVE_PUNCTUATION_TABLE = str.maketrans({x: None for x in string.punctuation})
TOKENIZER = TreebankWordTokenizer()
​
example_doc = docs[1]
example_doc_tokenized = TOKENIZER.tokenize(
        example_doc.translate(REMOVE_PUNCTUATION_TABLE)
        )
example_doc_tokenized

['Contact',
 'information',
 'Email',
 'martin',
 'davtyan',
 'at',
 'filament',
 'dot',
 'ai',
 'if',
 'you',
 'have',
 'any',
 'questions']

We are still not going to find any overlapping words between the document and the query, because “contacts” is not capitalized and is in plural form. To solve this issue, we use stemming to strip words of their plural forms, formatting, etc.:

from nltk.stem.porter import PorterStemmer

STEMMER = PorterStemmer()
example_doc_tokenized_and_stemmed = [STEMMER.stem(token) for token
                                     in example_doc_tokenized]
example_doc_tokenized_and_stemmed

['contact',
 'inform',
 'email',
 'martin',
 'davtyan',
 'at',
 'filament',
 'dot',
 'ai',
 'if',
 'you',
 'have',
 'ani',
 'question']

We also transform a user query into a list of terms:

def tokenize_and_stem(s):
    return [STEMMER.stem(t) for t 
            in TOKENIZER.tokenize(s.translate(REMOVE_PUNCTUATION_TABLE))]

tokenize_and_stem(query)

['contact']

Great, if our score is now a number of overlapping terms between a query and a document, docs[1] scores 1 while the other two score 0. We can now sort documents based on that score and show to our user. Our first search system is ready.

Vector Space Model

One problem with our first approach is that all terms count the same towards a relevance score. For example, a query-document pair (“a cat”, “this is a page about dogs”) will have the same score as a pair (“a cat”, “my favourite cats”), as both of them have exactly one overlapping term (“a” for the first pair, “cat” for the second). Intuitively, if “cat” is a rare term in our document collection, any document which does mention it should score high for this query. On the other hand, “a” is so frequently seen that it shouldn’t matter much if it is metioned in a query or not: queries “cat” and “a cat” should return the same results.

Of course, to address this issue we could just have a list of so-called stop words — words we want to exclude from consideration as they are too popular. A more elaborate approach would be to weight counts of terms by their popularity.

To do this, it is useful to think of texts not as lists of terms, but rather as numerical vectors in a common vector space. In a field of Information Retrieval, this view is called a vector space model.

A simple way to map each text (document or query) to a vector in the same space is:

  1. Extract a set T of all unique terms mentioned in the collection of documents — this is called a vocabulary
  2. Sort the vocabulary alphabetically, so that each unique term t T has an index
  3. Map each document to a vector v , where v[i] is how many times term t[i] has been used in the document — so called term frequency (tf)

This representation is called bag of words.

To make “cat” count more because it’s less popular, we can multiply the tfcomponents of the vector corresponding to different terms with the measure of rarity, called inverse document frequency:

This way to represent documents as vectos is called tf-idf and can be easily computed with scikit-learn Vectorizer class. Let us use it to score documents.

Scoring

We first gather the vocabulary across the entire collection of documents by calling the fit method:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words='english')
vectorizer.fit(docs)

Every term now is mapped to a index in a vector we are going to be outputting:

vectorizer.vocabulary_

{'ai': 0,
 'ani': 1,
 'artifici': 2,
 'build': 3,
 'busi': 4,
 'capabl': 5,
 'challeng': 6,
 'chat': 7,
 'chatbot': 8,
 'contact': 9,
 'davtyan': 10,
 'deliv': 11,
 'dot': 12,
 'email': 13,
 'filament': 14,
 'framework': 15,
 'inform': 16,
 'intellig': 17,
 'learn': 18,
 'machin': 19,
 'maintain': 20,
 'martin': 21,
 'question': 22,
 'scalabl': 23,
 'solut': 24,
 'solv': 25}

Now we can transform texts into vectors, e. g. queries:

query = 'contact email to chat to martin'
query_vector = vectorizer.transform([query]).todense()
query_vector

matrix([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.5, 0. , 0.5, 0. , 0. , 0. ,
         0.5, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.5, 0. , 0. , 0. , 0. ]])

As each document and query is now respresented as a vector, we can now say the score is a similarity of these two representations, for example, cosine similarity — cosine of an angle between two vectors. If two vectors in this (|T|-dimensional) space have similar directions, that means they have similar terms mentioned in them.

Let’s calculate cosine distance between the query vector and all vectors for all documents we have in our collection:

from sklearn.metrics.pairwise import cosine_similarity

doc_vectors = vectorizer.transform(docs)
similarity = cosine_similarity(query_vector, doc_vectors)
similarity

array([[0.        , 0.48466849, 0.18162735]])

These are our scores! Intuitively, similarity between the query and docs[0] is zero as there is no term overlap at all.

We can now order (rank) documents by their scores (from high to low):

ranks = (-similarity).argsort(axis=None)
ranks

array([1, 2, 0])

most_relevant_doc = docs[ranks[0]]
most_relevant_doc

'Contact information. Email [martin davtyan at filament dot ai] if you have any questions'

Success, we hacked together a search engine based on a tf-idf vectorizer: we extract a vocabulary from a collection of documents and, when a user comes with a query, return a document with the closest (in terms of cosine similarity) vector to the query. We could, of course, easily return top 10 documents as well.

Features

Our simple system only uses one bit of information to score documents for a query: similarity of tf-idf vectors. If we want to improve the search, we may take other things into consideration, i.e. sizes of documents, how close seach terms are in the text, how new the document is, etc. As long as all these things can be expressed as floating point numbers, we can just calculate the score as a weighted sum of these different features:

Weights w[i] allow us to control how relatively important features f[i] are. For example, if we think that bigger documents are going to be of more interest to users, we can add document size as a feature to our system with a small weight (compared to that of tf-idf similarity) and look at feedback to see if it improves results.

Improving search with feedback

Let’s imagine people started using our machine learning system and we have received some feedback on which results turned out relevant and which not:

feedback = {
        'who makes chatbots': [(2, 0.), (0, 1.), (1, 1.), (0, 1.)],
        'about page': [(0, 1.)]
}

In this dict keys are queries and values are (document index, feedback value) pairs, where 0. means irrelevant and 1. means relevant. Let’s see what our current search system would return for the given query:

similarity = cosine_similarity(vectorizer.transform(['who makes chatbots']), doc_vectors)
ranks = (-similarity).argsort(axis=None)
docs[ranks[0]]

'Filament Chat. A framework for building and maintaining a scalable chatbot capability'

So, our system returns a chatbot-related page as it mentions the right term, but people mostly want to see About us page, judging by the feedback. How do we incorporate this information in our results?

Machine-learned scoring

A widespread approach to improve search performance with feedback is using machine learning to optimise feature weights w[i] from the score formula. For example, imagine we only have two features and a bunch of examples of relevant (R) and non-relevant (N) documents from feedback. Feature weights will then define a direction of a separating line, where the score above a given threshold would capture most relevant items:

Given examples of relevant (R) and non-relevant (N) responses, you could train a classifier to separate the two classes

When stated this way, it is a classification problem: we are looking to classify documents into relevant and irrelevant based on their features. In a more complicated formulation, we want to not only estimate if documents are relevant, but rank them by relevance like in the above example. This could be viewed as a problemn of ordinal regression. For web search, a popular learning to rank algorithm is a so-called Ranking SVM.

One problem with using ML in this case is that even the most basic methods (e.g. Naive Bayes to classify documents into relevant and irrelevant) require quite a lot of data. If you just started gathering feedback or you don’t have a lot of users to learn from, they might not work. In the last bit of post we show a very simplistic approach to use feedback without using much data.

Nearest neighbour feedback as a feature

If you are given a particular query (e.g. 'who makes chatbots'), an intuitive thing to do when you have feedback data is to look if this query was asked before. If it did, we would have some data on which documents appeared relevant or irrelevant to other users before. We would then score relevant documents higher and irrelevant documents lower.

To combine this logic with our previous approach of scoring based on tf-idf similarity, let us define feedback features: positive and negative. Positive feedback feature would equal the number of times a given document appeared relevant for a given query. Negative feedback feature would be the number of times a document appeared irrelevant. Both features would be 0 if we don’t have feedback for a given query-document pair, so it doesn’t impact our score.

But what if the exact given query has not been asked before? For example, we need to score documents for a query 'who is making chatbots information'. We could find the nearest neighbor query – one closest (in terms of tf-idf, as always) query we do have feedback for. Then, we could weight our two feedback features by the similarity. This way, if we find a query which is very similar to the given one, feedback features impact the overall score a lot, but otherwise the features are close to zero.

For example, consider the positive feedback feature. For query q let us find a closest feedback query q’ , i.e. the query with the highest cosine similarity ρ(q, q’). For query q’, we have seen a set of documents {d[j]} with positive judgments.

We give positive feature values to documents d[1] and d[2]— they appeared relevant (R) for a query similar to q

Let us score each document with a proportion of postive feedback they received for q’. Then, let’s weigh all the scores by the similarity between the original query q and the nearest neighbour query q’:

Let’s quickly implement the positive feedback feature it and see it works. For a given query, find a nearest neighbor:

import numpy as np

query = 'who is making chatbots information'
feedback_queries = list(feedback.keys())

similarity = cosine_similarity(vectorizer.transform([query]), vectorizer.transform(feedback_queries))
similarity

array([[0.70710678, 0.        ]])

max_idx = np.argmax(similarity)
feedback_queries[max_idx]

'who makes chatbots'

Look at all documents that received positive feedback

pos_feedback_doc_idx = [idx for idx, feedback_value 
                        in feedback[feedback_queries[max_idx]] 
                        if feedback_value == 1.]
pos_feedback_doc_idx

[0, 1, 0]

What is the proportion of positive feedback for different documents?

from collections import Counter

counts = Counter(pos_feedback_doc_idx)
counts

Counter({0: 2, 1: 1})

pos_feedback_proportions = {
        doc_idx: count / sum(counts.values()) for doc_idx, count in counts.items()
}
pos_feedback_proportions

{0: 0.6666666666666666, 1: 0.3333333333333333}

Finally, let’s scale with a similarity between the original query and nearest neighbor query, and output a feature vector:

nn_similarity = np.max(similarity)
pos_feedback_feature = [nn_similarity * pos_feedback_proportions.get(idx, 0.) 
                        for idx, _ in enumerate(docs)]
pos_feedback_feature

[0.4714045207910317, 0.23570226039551584, 0.0]

That is our feature vector! Positive feedback feature has the biggest value for the document most people thought was relevant for the nearest neighbour query. Makes sense.

Scorer

To combine the positive feedback feature with the tf-idf feature we developed before, let’s summarize our code in a Scorer class that can take a collection of documents and improve scoring based on positive feedback.

class Scorer():
    """ Scores documents for a search query based on tf-idf
        similarity and relevance feedback
        
    """
    def __init__(self, docs):
        """ Initialize a scorer with a collection of documents, fit a 
            vectorizer and list feature functions
        
        """
        self.docs = docs
        
        self.vectorizer = TfidfVectorizer(tokenizer=tokenize_and_stem, 
                                          stop_words='english')
        self.doc_tfidf = self.vectorizer.fit_transform(docs)
        
        self.features = [
            self._feature_tfidf,
            self._feature_positive_feedback,
        ]
        self.feature_weights = [
            1.,
            2.,
        ]
        
        self.feedback = {}
        
    def score(self, query):
        """ Generic scoring function: for a query output a numpy array
            of scores aligned with a document list we initialized the
            scorer with
        
        """
        feature_vectors = [feature(query) for feature 
                           in self.features]
        
        feature_vectors_weighted = [feature * weight for feature, weight
                                    in zip(feature_vectors, self.feature_weights)]
        return np.sum(feature_vectors_weighted, axis=0)
    
    def learn_feedback(self, feedback_dict):
        """ Learn feedback in a form of `query` -> (doc index, feedback value).
            In real life it would be an incremental procedure updating the
            feedback object.
        
        """
        self.feedback = feedback_dict
        
    def _feature_tfidf(self, query):
        """ TF-IDF feature. Return a numpy array of cosine similarities
            between TF-IDF vectors of documents and the query
        
        """
        query_vector = vectorizer.transform([query])
        similarity = cosine_similarity(query_vector, self.doc_tfidf)
        return similarity.ravel()
    
    def _feature_positive_feedback(self, query):
        """ Positive feedback feature. Search the feedback dict for a query
            similar to the given one, then assign documents positive values
            if there is positive feedback about them.
        
        """
        if not self.feedback:
            return np.zeros(len(self.docs))
        
        feedback_queries = list(self.feedback.keys())
        similarity = cosine_similarity(self.vectorizer.transform([query]),
                                       self.vectorizer.transform(feedback_queries))
        nn_similarity = np.max(similarity)
        
        nn_idx = np.argmax(similarity)
        pos_feedback_doc_idx = [idx for idx, feedback_value in
                                self.feedback[feedback_queries[nn_idx]]
                                if feedback_value == 1.]
        
        feature_values = {
                doc_idx: nn_similarity * count / sum(counts.values()) 
                for doc_idx, count in Counter(pos_feedback_doc_idx).items()
        }
        return np.array([feature_values.get(doc_idx, 0.) 
                         for doc_idx, _ in enumerate(self.docs)])

Let’s see if it produces scores we expect before we show it any feedback:

scorer = Scorer(docs)
query

'who is making chatbots information'

scorer.score(query)

array([0.        , 0.22847492, 0.25685987])

Indeed, ‘chatbots’ and ‘information’ match the two last documents in the collection. Based on these scores we would return the last document:

docs[scorer.score(query).argmax()]

'Filament Chat. A framework for building and maintaining a scalable chatbot capability'

Now let’s give feedback to the scorer and see if changes the scores:

scorer.learn_feedback(feedback)
scorer.score(query)

array([0.94280904, 0.69987944, 0.25685987])

docs[scorer.score(query).argmax()]

'About us. We deliver Artificial Intelligence & Machine Learning solutions to solve business challenges.'

As expected, we now score the first document the highest, as it received some positive feedback. Note that our feature weights attribute larger importance to feedback, this is why the resulting scores changed so much. If valued the tf-idf feature more, scoring would change:

scorer.feature_weights = [0.6, 0.4]
scorer.score(query)

array([0.18856181, 0.23136585, 0.15411592])

docs[scorer.score(query).argmax()]

'Contact information. Email [martin davtyan at filament dot ai] if you have any questions'

Summary

In this post we showed how to hack together a simple text search engine that:

  1. Works based on document contents
  2. Learns from feedback

Obviously, for the sake of simplicity we excluded a lot of details that could improve the quality but, as you can see, to quickly come up with something that works is easy if you have a basic idea 🙂

If you have any questions about the material or you want to chat your NLP/ML project, feel free to email me — martin.davtyan@filament.ai.

Thanks for reading!