Search Engineering 101

Search is about providing users with information that they find useful for a variety of reasons. It can be something as simple as the contact details of a restaurant that someone needs to make a booking to something more life-defining such as exploring for new career opportunities. Setting up a system that does it well in terms of giving the users what they want is a matter of search engineering and search quality. This field of work is as much an intellectual pursuit as it is a technological one.

engine-1526233.jpg
Engineers design, evaluate, develop, test, modify, install, inspect and maintain a wide variety of products or systems.

Often, people think about building a search system purely from a technical perspective. This however provides for only half of what is actually required to engineer a user-centric search system. In this post, I will describe the steps taken to set up a non-production ready search system  for the purpose of demonstrating Continuous Measurable Improvement for search quality. As discussed in Relevance, Like Beauty Lies In The Eyes of The Beholder, different users are after different things from a search system. Trackingmeasurementmonitoring and evaluation form a big part in setting up and maintaining a search system in order to learn more about the things that matter to the users and to use the understanding to tweak the system to better serve them. These are described in Steps 7-10 which are covered in the second half of the post. We begin with Steps 1-6 by looking at the foundation work required to get a search system up and running. For this exercise, I use Elasticsearch, MySQL, PHP and Perl as my tools of choice to build a basic search system for job ads crawled from a free-to-post website.

Step 1 – Prepare the content

The first step is to get the content ready and available for indexing by Elasticsearch. I put together quickly a web crawler using Perl and the LWP and JSON modules. The crawler works in two stages: crawling and localising the HTML content, and extracting pieces of information of interest and store it into a database. For the purpose of this exercise, I picked a free job posting site called postjobfree.com. I had a read through the terms of use to ensure that the site does not prohibit crawlers from downloading the job ads and that they can be used for non-commercial, demo purpose. I point the crawler to the site and downloaded 3,090 job postings between 4-5 October 2016. The extractor then parsed the HTML and picked up five main attributes: doctitle, doctext, doccompany, doclocation, docdate. These values are stored in a MySQL database. A unique identifier is generated by hashing the full HTML file. At this stage, we have some content for the next step.

Step 2- Set up an Elasticsearch instance

Next, I downloaded Elasticsearch v2.4.1 and unzipped the files into a folder. As I am on a Windows  machine, I run the elasticsearch.bat in the bin folder. I also made sure I have the right version of JDK installed, which is 1.8. The environment variable JAVA_HOME was configured to point to the JDK root folder. If you need an interface to manage the index, you can try ElasticHQ. The source can be downloaded and run as a web app. You have to first configure the elasticsearch.yml file with the following settings: “http.cors.allow-origin”: “*”, “http.cors.enabled”: “true” and “node.master”: “true”.

There are two main parts in the JSON object for creating an index: the settings and the mappings of the properties for a type. As I am developing locally, the Elasticsearch host resides at http://127.0.0.1 where the default port for the endpoints is 9200. I created an index using the PUT method as shown below with the index name jobs. As part of that, I defined the type job with five properties covering the title, text, date, company and location of the documents.

PUT /jobs
{
      "settings": {
         ...
      },
      "mappings": {
         "job": {
            "properties": {
               "docdate": {
                  "type": "date",
                  "format": "yyyy-MM-dd"
               },
               "doctitle": {
                  "type": "string"
               },
               "doctext": {
                  "type": "string"
               },
               "doccompany": {
                  "type": "string",
                  "index": "not_analyzed"
               },
               "doclocation": {
                  "type": "string"
               }
            }
         }
      }
}

If you do not want a field to be searchable, set “index” : “no”. The default setting for fields of type string is to apply what Elasticsearch refers to as analyzer. What an analyser does is to perform basic morphological and lexical transformations on the text to improve searchability. Things such as tokenising and stemming are done here, and a wide range of languages are supported. If you do not want the text in the field to be analysed, you can set “index”: “not_analyzed”. As for properties of type date, you can define the format that suits your data and more info can be found here.

I find it useful to have a console for interacting with the REST API of Elasticsearch. It is called Sense, which you can either have it as a Chrome plugin, or install it locally as a standalone app. There is also a wide range of clients that you can use to programatically interact with Elasticsearch.

Step 3 – Write a feeder to index documents

At this stage, we have an Elasticsearch instance with an empty index and a database of content. I created a feeder to send content off to the index. Documents are indexed using the POST method to the endpoint http://127.0.0.1:9200. Each document requires a unique identifier, which can be provided explicitly or will be generated one if not provided. There are meta-fields or properties which are created by default and one of them is _id. This field will be populated with the unique identified if it is provided, so there is no need to create a field just for unique identifiers for each document in the index. Below is an example call to the endpoint with reference to the index jobs, the type job and an explicit identifier fb4b50b8fdf075db3ce859038d88fe7d for the document.

POST /jobs/job/fb4b50b8fdf075db3ce859038d88fe7d
{
  "docdate": "2016-10-04",
  "doctext": "Summit Management is currently seeking a highly..."
  "doclocation": "New Britain, Connecticut, United States",
  "doccompany": "Summit Management Corp.",
  "doctitle": "Firearm Industry Sales Representative"
}

As for the feeder, I wrote a PHP script that reads from the MySQL table of structured job content. The values are organised into JSON format as above, which is then sent to Elasticsearch. I coded in the ability to recognise documents which have already been indexed so that the next run of the feeder will not attempt to push the same content to the index again. This is to allow incremental addition of documents to the index.

Step 4 – Write an API to interact with Elasticsearch for keyword search

Next, I created a PHP  back-end called search.php which receives a number of parameters from a yet-to-be-created UI and construct the JSON query for Elasticsearch. The parameters that I work with initially are keywords and location. This is when I experiment with the various ways of constructing the query for keyword search for Elasticsearch in JSON. The queries are sent to the endpoint http://127.0.0.1:9200/jobs/job  as such.

POST /_search
{
  "from": 0,"size": 5,
  "query": {
    "match": {
      "doctitle": "project manager"
    }
  }
}

I started off simple, with just keywords. Assuming that the 2-word search term “project manager” comes through from the UI, how would I want to structure the query. We start using the match query as above and target the doctitle field. This would return documents that match any of the query words. In other words, the default operator is OR between words in the query, which can be re-written as below. In our corpus of 3,090 documents, this query returned 243. We limit each search to return results in increment of 5.

POST /_search
{
  "from": 0,"size": 5,
  "query": {
    "match": {
      "doctitle": {
        "query": "project manager",
        "operator": "or"
      }
    }
  }
}

The query above can also be expressed differently using bool query and the should clause. There are four types of clauses for a bool query – must, should, filter and must_not. They are pretty self-explanatory. It is worth noting that in cases where a bool query does not contain must or filter, one or more should clauses must match. In other cases, the should would be optional, just contributing to the score.

POST /_search
{
  "from": 0,"size": 5,
  "query": {
    "bool": {
      "should": [
        {"term": {"doctitle": "project"}},
        {"term": {"doctitle": "manager"}}
      ]
    }
  }
}

We know that this approach to search, especially in a vertical search context, is suboptimal. It tends to return many irrelevant results. Instead of any words matching, we need to make sure that only documents that contain both words are returned. I revised the JSON to use the must clause instead of should, as shown below. Alternatively, we can set the operator to AND if the query was written using the match clause. The results count became 16.

POST /_search
{
  "from": 0,"size": 5,
  "query": {
    "bool": {
      "must": [
        {"term": {"doctitle": "project"}},
        {"term": {"doctitle": "manager"}}
      ]
    }
  }
}

In order to find more potentially relevant results, we want to broaden the search to also include other fields. This time, I combined the bool and the match queries as shown below. What the query says is that I want to look for documents that contain both the words (in any order or distance apart at this stage) in either the doctext or the doctitle field. When I ran the query below in Sense, the retrieval size increased to 143.

POST /_search
{
  "from": 0,"size": 5,
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "doctext": {
              "query": "project manager",
              "operator": "and"
            }
          }
        },
        {
          "match": {
            "doctitle": {
              "query": "project manager",
              "operator": "and"
            }
          }
        }
      ]
    }
  }
}

What if we want to add another mandatory field to lookup. For instance, say that we want to not only look for documents containing both “project” and “manager” but they have to be in a certain location. In this case, we cannot just append a must clause after the should clause in the JSON above. The reason is, as mentioned earlier, with the presence of the must clause, the matches to the doctitle and doctext fields become optional. We need to instead wrap another bool query around the JSON above and group the existing bool query with a new match query that points to the doclocation field. As shown below, using “california” as an example location, sticking to the same two words “project manager”, the result size reduced to 19.

POST /_search
{
  "from": 0,"size": 5,
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "match": {
                  "doctext": {
                    "query": "project manager",
                    "operator": "and"
                  }
                }
              },
              {
                "match": {
                  "doctitle": {
                    "query": "project manager",
                    "operator": "and"
                  }
                }
              }
            ]
          }
        },
        {
          "match": {
            "doclocation": {
              "query" : "california",
              "operator" : "and"
            } 
          }
        }
      ]
    }
  }
}

The inner bool query above that match “project manager” to the doctitle and doctext fields can be rewritten as shown below using the multi_match query. The most_fields type is set so that if the two words match multiple fields, the scores across those fields are added.

POST /_search
{
  "from": 0,"size": 5,
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "project manager",
            "fields": ["doctitle","doctext"],
            "type": "most_fields",
            "operator": "and"
          }
        },
        {
          "match": {
            "doclocation": {
              "query": "california",
              "operator": "and"
            }
          }
        }
      ]
    }
  }
}

There are times especially with vertical search that we would want to have a view of how the results are distributed across certain dimensions. These dimensions are informative to the users and can be offered as facets for filtering. In the case of our corpus, assume that we need to use the doccompany field as a facet for filtering. To make this happen, we add an aggregations or aggs object to the JSON above to return the unique values in the designated field and the count of documents next to them. I gave the aggregation the name doccompanies.

POST /_search
{
  "from": 0,"size": 5,
  "query" : {
    ...
  },
  "aggs": {
    "doccompanies": {
      "terms": {
        "field": "doccompany"
      }
    }
  }
}

The response returned would contain the aggregations appearing as the below. These key-doc_count pairs can be used by the UI to show options to the users for filtering. Note that we previously explained the purpose of the analyzer in Elasticsearch. If we initially set the doccompany field as analyzed, the buckets that come back would have been grouped in terms of the individual words, instead of the complete strings in the field.

{
  ...
  "aggregations": {
    "doccompanies": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Stanford University",
          "doc_count": 5
        },
        {
          "key": "JPL Talent",
          "doc_count": 3
        },
        ...
      ]
    }
  }
}

Step 5 – Create UI for keyword search via the API

At this stage, we have a basic API that receives keywords and location, constructs the corresponding query in JSON and accepts the response from Elasticsearch. The response at this stage contains the individual hits and an aggregation around the doccompany field, as shown below.

{
  "took": 20,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "failed": 0
  },
  "hits": {
    "total": 19,
    "max_score": 2.9697855,
    "hits": [
      {
        "_index": "jobs",
        "_type": "job",
        "_id": "aac343be30bfa57d2908881b59b47734",
        "_score": 2.9697855,
        "_source": {
          "docdate": "2016-10-05",
          "doctext": "Henkels & McCoy, Inc. is a leading utility...",
          "doclocation": "Pomona, California, United States",
          "doccompany": "Henkels & McCoy",
          "doctitle": "Associate Project Manager"
        }
      },
      ...
      {
        "_index": "jobs",
        "_type": "job",
        "_id": "dc5918087ba6aecaddd83a9dc1221ab0",
        "_score": 0.19875972,
        "_source": {
          "docdate": "2016-10-05",
          "doctext": "The General Manager is the leader with...",
          "doclocation": "San Mateo, California, United States",
          "doccompany": "Gap Inc.",
          "doctitle": "General Manager - Gap Hillsdale"
        }
      }
    ]
  },
  "aggregations": {
    "doccompanies": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 2,
      "buckets": [
        {
          "key": "Stanford University",
          "doc_count": 5
        },
        {
          "key": "JPL Talent",
          "doc_count": 3
        }
        ...
      ]
    }
  }
}

Next, I created a PHP home page called index.php with two input boxes organised horizontally that uses AJAX methods of jQuery. On the submit event, these parameters are sent to the API. At this stage, we extend the API to parse the JSON response from Elasticsearch. The API loops through the individual hits, storing the key fields that I want to display in the UI. Similarly the key-doc_count pairs in the aggregation are parsed and return to the UI, shown as the COMPANIES facet as shown below. Notice that pagination has to also be dealt by adding page number as another parameter between the UI, the API and Elasticsearch.

searchui1.PNG
Figure 1. The results for a search using the keywords “project manager” and “california” as location.

Step 6 – Extend the UI and the API for filtering and sorting

So far, we are only using keyword and location to determine what is retrieved and how to score (using out-of-the-box configuration) based on keyword matching. There will be cases where certain search criteria are only used for slicing and dicing the results and do not contribute to scoring and hence ranking. In this exercise, we use the values from doccompany field for filtering which was offered to the users via the COMPANIES facet. To do this, we further extend the parameters by adding the company filter. We revise the JSON object by wrapping the existing bool query with a filtered query as shown below. If the “stanford university” option in the company filter was selected, it will be added as the third parameter to the search criteria, in addition to “project manager” and “california”.

POST /_search
{
  "from": 0,"size": 5,
  "query": {
    "filtered": {
      "query": {
        "bool": {
          ...
        }
      },
      "filter": {
        "term": {
          "doccompany": "Stanford University"
        }
      }
    }
  },
  "aggs": {
    ...
  }
}

I ran the JSON above in Sense and it yielded the results shown below.

searchui2.PNG
Figure 2. The results for a search using the keywords “project manager” and “california” as location with the filter “stanford university”.

So far, the results are sorted by a score that is computed based on out-of-the-box configuration for keyword matching. The users may want to re-sort the results differently. In the case of our content, the users may want to see newer documents first. This is when the sort parameter comes in. I appended at the end of the latest query shown above with the sort parameter and sort the results using the _score meta-field (which is keyword relevance). If you need to see the results sorted by date, use the field docdate instead.

POST /_search
{
  "from": 0,"size": 5,
  "query": {
    "filtered": {
      "query": {
        "bool" : {
          ...
        }
      },
      "filter": {
        ...
      }
    }
  },
  "aggs": {
    ...
  },
  "sort": {
    "docdate": {
      "order": "_score"
    }
  }
}

At this stage, we have keyword, location, company, page number and sort mode as parameters. Adding a sort by option as shown below in the UI and then all the way through to the API and Elasticsearch would allow the users to toggle between keyword matches and date.

searchui3.PNG
Figure 3. The date-sorted results for a search using “project manager” keywords and “california” as location with the filter “stanford university”.

Step 7 – Add tracking the search and results

Many people would consider search as done by now but this could not be further from the truth. Looking at the screenshot above, we are plagued by quality-related concerns or questions such as:

  • The results for “project manager” in “california” by “stanford university” do not look overly relevant.
  • How do we know if the users find these results relevant or useful?
  • Do we know if users are seeing most of the results that they are meant to see?

In order to come up with answers to the questions above and more, it is important that we introduce proper tracking. Getting the right tracking should not be difficult though. There are probably Javascript-based tools out there that can be used to track. Instead, I implemented my own tracker. As tracking is an integral part of search and the things that we want to track are very fine-grained, I did not bother experimenting with whatever is out there. From my experience, many of the free tools out there are designed for SEO purposes. Hence, they are only good for aggregated metrics regarding visits, users, page views and sessions.

Broadly speaking, there are four types of data that I track in this exercise. First is the search criteria, second the results which are shown to the users, third is the unique identifiers for the user and the search, and lastly the interactions with the results. The idea is that the majority of the tracking is done in the back-end by the API (i.e., search.php). At the moment, the search criteria, the results which are shown, and the identifiers are all recorded (where some are generated as well) in the API when a search is performed. Since the interactions are front-end events, they can only come from the UI. Whenever someone clicks on a result, an event is fired. This will trigger a request off to another API to log them. I created an API called logger.php, which is a jQuery method that sends requests off asynchronously together with the document identifier, the position and the timestamp. Currently, I store these in a MySQL table. Obviously, like most things set up for this exercise, they are not fit for production.

Step 8 – Introduce basic metrics to gauge success

After having the right tracking in place, we can now look at how well the results that we present to the users perform. As discussed in length in Can’t We Just Talk?, search is about helping users satisfy information needs. Whether we are doing that or not varies depending on the users. Until the users tell us explicitly what they are after, most of the times it is a guessing game on our end when it comes to figuring out what success looks like for our users. However, with properly tracked data and metrics, we can make assumptions about successes, advance our understanding of the users and continuously improve search.

For this exercise, I simulated the information needs of three users as well as their interactions with the results. All three of them submitted the same query with the keywords “project manager” in “california”. Each of the user performed a single search. The three users are in the head space of looking for project manager roles in technology. I use the latest JSON query that we composed above to perform the search. I consider it as the base algorithm. With the tracking in place, we collected the results that were presented and the clicks that took place for all three users.

Referring to Figure 1 above, the users either clicked on the Cisco job in position 4 or the one by Stanford University in position 2, and all 3 of them paginated almost to the end. The figure below summarises the impression and the click data for the three users and the calculation of four metrics which offer different views of success. These metrics are looked at by people in the Web Analytics and Information Retrieval communities for various purposes.

stats1.PNG
Figure 4. Summary of impressions, clicks and metrics for the base algorithm.

The first metric is impression depth (ID), which tells us of all the results returned for a search, how many did the users have the chance to paginate through and/or see. As with a lot of the metrics, ID can be read different ways. A low ID can mean that a large majority of the results returned are irrelevant. However, more likely, a low ID is the effect of the user’s appetite for pagination and the associated effort, and this varies across users. The second metric is clickthrough rate (CTR), not in the traditional online advertising sense. CTR captures, in a search, how many results were clicked on out of those that were seen. A low CTR can mean that the results that we’re returning are not relevant. It can also mean that the users have already seen and make use of the majority of the results and hence, they’re no longer useful and do not attract clicks.

The third metric is the average click position (ACP). Unlike the first two metrics, we want the ACP to be as close to 1 as possible. If a result is very relevant to a user query, it has to appear at the top. Hence, results that attract clicks at the top end indicate that the ranking algorithm is going good. Lastly, precision at lowest click (PLC) is a hybrid of CTR and ACP. This metric uses the lowest click as a proxy to indicate that the users have seen and are serious about making use of all the results between the 1st position and the lowest click. If the PLC is low, it may indicate that the user does not find the top results useful and are not worth clicking on.

The table above says that, on average, the users paginated up till the third page (i.e., 14.67 impressions). If this search returns 19 results in total, that’s an impression depth (ID) of 77%. In other words, the 23% were not seen by the users. The next metric is the clickthrough rate (CTR) which tells us of all the results which were ‘seen’ by the users, how many of them were clicked on. For the base algorithm, it is a low 11%. The average click position (ACP) on the other hand is sitting pretty at 3.33. Another metric which is used widely in the research community is precision at lowest click (PLC). Using the position of the lowest click, this metric captures how many results between the 1st and the lowest position took place. For the base algorithm, the metric is 42%. The way of reading this is, slightly less than half (i.e., 42%) of all the results that the users are very likely to have the opportunity to assess were actually clicked on.

Step 9 – Improve precision of retrieval

With tracking and a basic form of measurement in place, we switch our focus to the results that we are currently returning to the users for the keywords “project manager” and location “california”. The screenshot below shows that if the users were to refine the results with the company filter, they will encounter irrelevant results. In this case, the user intent was to look for project manager jobs and the result that was shown to the user when they select “CIOX Health” was “Client service Representative II”.

searchui4.PNG
Figure 5. The results for the search using the keywords “project manager” and “california” as location with the filter “CIOX Health”.

Would you say that the result above is relevant? Why was this document returned in the first place? To answer this, we have to look back at the bool query which we constructed to retrieve documents based on keyword matches on the doctext and doctitle fields, as shown below.

searchui5
Figure 6. A snippet of text from the document with the title “Client Service Representative II”. Note the distance between the occurrences of the words “manager” and “project.

Clearly, this document came back not because the two words exist in the doctitle field. When I looked into the doctext field as shown below, there was one occurrence of the word “project” and multiple occurrences of “manager”. We can see that the word matches were out of context. When the user provides “project manager” as keywords, they clearly do not see them as just a bag of words. However, our search logic does and this resulted in irrelevant results. This was discussed in detail in Faceted Search Needs Precise Retrieval.

The way to fix this is to restrict how far apart the word matches can be. In Elasticsearch, we change the type of the multi_match query to phrase, and use slop instead of the AND operator. I revised the JSON as shown below and set the slop operator to 1.

{
  "from": 0,"size": 5,
  "query": {
    "filtered": {
      "query": {
        "bool": {
          "must": [
            {
              "multi_match": {
                "query": "project manager",
                "fields": ["doctitle","doctext"],
                "type": "phrase",
                "slop": 1
              }
            },
            {
              "match": {
                "doclocation": {
                  "query": "california",
                  "operator": "and"
                }
              }
            }
          ]
        }
      },
      ...
    }
  },
  "aggs": {
    ...
  },
  "sort": {
    ...
  }
}

I ran the search again for “project manager” in “california”. Instead of 19 jobs, 3 were returned. As you can see in the company filter, the option of “CIOX Health” is no longer present, which means that the “Client Service Representative II” document was not retrieved. Depending on the content and the vertical that the search operates in, you will need to think about the right balance between being overly restrictive and potentially missing out on relevant results, and too relaxed and bringing back irrelevant ones.

searchui6.PNG
Figure 7. The results for a search using the keywords “project manager” and “california” as location after restricting the distance between keyword matches.

To quantify the impact of that change, we replay the simulation of the three users to generate impressions and clicks with the revised algorithm. User 1 and 3 would still click on the Stanford University job and as the job “Senior Program Manager” job by Cisco is now gone, User 2 did not click on any. The ID increased to 100%. This simply says that all the results returned for the searches were all seen. This happened because our change to the algorithm cut the retrieval from 19 to 3. Having 100% of the results returned seen by the users is not enough, but it is a good start. The results have to be clicked on. The CTR is now 22%, which has risen from the base algorithm’s 11%. Similarly, the ACP and PLC improved from 3.33 to 1 and from 42% to 67%.

stats2
Figure 8. Summary of impressions, clicks and metrics for the revised algorithm with tightening of word proximity.

Using the metrics, it seems that the tweak to the algorithm have done the right things by the users.

Step 10 – Improve recall via synonym expansion

Different people refer to things differently. The search engine however by default does not know this. As a result, the search may not return documents that might be relevant to the query. In our exercise, we assume that “project manager”, “program manager” and “project coordinator” all refer to the same thing. These synonyms can be curated by domain experts or learned from logs. Referring to Figure 7 above and 9 below, there are 3 jobs returned for each search respectively.

searchui7
Figure 9. The results for a search using the keywords “program manager” and “california” as location.

In addition, there is 1 job for “project coordinator” as shown in Figure 10. As you can see, these three sets of results are non-overlapping in this case

searchui10.PNG
Figure 10. The results for a search using the keywords “project coordinator” and “california” as location.

The intent now is that we add these three phrases as synonyms that can be used by Elasticsearch to expand keywords that users provide to bring back more results. We need to make a change to the analyzer and the filter components of the index. Starting with the analyzer, the JSON below is sent to the endpoint http://127.0.0.1:9200, stating that we will add a new token filter called jobsynonym which will use the content of the file synonym.csv. As part of this JSON, a new analyser called jobanalyzer is also created to apply the jobsynonym filter during analysis to perform synonym expansion.

POST /jobs/_close
PUT /jobs
{
  "settings": {
    "analysis": {
      "analyzer": {
        "jobanalyzer": {
          "tokenizer": "whitespace",
          "filter": [
            "standard",
            "lowercase",
            "kstem",
            "jobsynonym"
          ]
        }
      },
      "filter": {
        "jobsynonym": {
          "type": "synonym",
          "synonyms_path": "synonym.csv",
          "ignore_case": true
        }
      }
    }
  }
}
POST /jobs/_open

In order to use jobanalyzer for the specific fields that we want to be synonym expanding on the fly, we need to change the mappings. The following JSON is sent to the _mapping endpoint.

PUT /jobs/job/_mapping
{
  "job": {
    "properties": {
      "doctext": {
        "type": "string",
        "analyzer": "standard",
        "search_analyzer": "jobanalyzer"
      },
      "doctitle": {
        "type": "string",
        "analyzer": "standard",
        "search_analyzer": "jobanalyzer"
      }
    }
  }
}

In the synonym.csv file, I have the following line “project manager, program manager, project coordinator”. We submit again the same keywords “project manager” with location “california” via index.php, and this time, 7 results were returned as shown below. As you can see, the results matching “program manager” and “project coordinator” are now retrieved as part of a “project manager” search. Assuming that the additional 4 results for “program manager” and “project coordinator” are all relevant, we are potentially improving the recall. For more information about the concept of recall, read 8 Out of 10 (Brown) Cats.

searchui11.PNG
Figure 11. The results for a search using keywords “project manager” and “california” as location with synonym expansion.

The extension of the synonym file is an ongoing process. As new synonyms are added, we want them to take effect immediately. This can be achieved using the following queries.

POST /jobs/_close
PUT /jobs/_settings
{
  "index": {
    "analysis.filter.jobsynonym.synonyms_path": "synonym.csv"
  }
}
POST /jobs/_open

We now look at what kind of impact synonym expansion have on user interaction with our results and the metrics. Not surprisingly, bringing back more results have improved the CTR. In other words, of all the result seen, more of them were now clicked on, which is a good thing. Potentially, this is a sign that recall has improved. ACP on the other hand suffered a bit. The clicks are now happening lower down in the rank. This is not surprising as synonym expansion can often bring back results which are ranked higher but not as relevant to the query. As a result, they do not attract as much clicks. The same is reflected in PLC, where it has taken a dip to a point where it is even lower than the base algorithm. What this means is that there are rooms for improvement in terms of the ranking of synonym expanded matches.

stats3.PNG
Figure 12. Summary of impressions, clicks and metrics for the revised algorithm with synonym expansion.

Conclusion

Engineering a search system that is set up for success goes beyond just having the right technology. In this post, we discussed the steps for setting up a basic search engine for job ads using Elasticsearch. More importantly, we looked at the tracking of essential data on the search system to calculate basic metrics and enhance our understanding of how users interact with the results. We started off with a base algorithm and investigated two improvements. We discussed how these two changes moved the metrics as summarised below.

stats4.PNG
Figure 13. Summary of the metrics for the past 3 versions of the algorithm.

Beyond this, the intent is to look for ways to optimise the ranking for different queries. In other words, results at the top which are less likely to attract click should be pushed down, or inversely, results which are more likely to be clicked on are boosted. Further improvement to the ranking can be achieved with techniques that range from tweaking of field weights or weights of synonym matches to boosting of certain documents based on user behaviours, which was briefly introduced in Using Clicks In Ranking.