The majority of search engines today provide users with results in a ranked order. This order reflects the underlying algorithm’s best guess at what the users’ needs might be and the documents that are most likely to be relevant to the needs. If there is a document that the algorithm thinks, based on a variety of factors, might be the answer, it has to come up on position 1, followed by another document with the next highest likelihood of being relevant, and so on.
We discussed that relevance is subjective and the algorithm has to consider many factors to produce the optimal order. The actual factors vary depending on the vertical that the search operates in such as Web, news, e-commerce, real estate and jobs. As mentioned in a previous post, topicality, novelty and authority are some of the fundamental ones. Increasingly, user interactions with the results in the form of clicks are being explored and used as another type of ranking factor. In this post, we elaborate on these different factors, including past clicks, and how they can be used to influence ranking to yield the desired outcomes.
Where do we start?
Let us use the example of a search using the keywords green curry. The algorithm first looks for all documents that contain both the words green and curry in them. For this discussion, we assume that a set of five documents is produced as shown below. The algorithm then needs to rank the results in the set for them to be shown to the users.
The question is, what are the factors to consider when deciding on the rank order? Let us go back to the basics and think about it this way. Would you be pleased to see the five results returned to you in the descending order above for your query green curry?
The importance of staying on topic
Is there anything wrong at all with the order above? We know that relevance is subjective. Depending on the person you ask, the answers may vary from one extreme, which is the order is totally out of wack, to another extreme, which is the order is perfect. In this particular scenario, however, I would say that the majority of people would ask why the topic of sports would ever come up for a search for green curry.
This is to do with the topicality aspect of relevance. What we mean by that is, when someone issues a query, they already have some topics in mind that they associate the keywords with. If the topic of the documents from a search deviates significantly from what the users had in mind, then the relevance of the results is poor. In this example, the likelihood that the query green curry pertains to food is far higher than sports. The ranking aside, some of us might even question why the first result on NBA was retrieved in the first place. If you want an explanation on this, please refer to a previous post.
The first step to improve the topicality aspect of ranking is to think through how we use the keywords from the users. Instead of treating all word matches in the documents equally, we can give preference to word matches that appear next to one another in the text. For instance, the keywords green and curry matching on “green curry takeaway & delivery restaurants – menulog” can receive a higher weight compared to a match on “thai green chicken curry recipe – taste.com.au“. This is because, in the former, the two words are immediately next to one another, and one word apart in the latter. Another common technique is to weigh the matches in different fields differently. In this example, assume that each document only has three fields: the title, the URL and the description fields. We can assign higher weights to query words that match the title field, followed by the URL and the description.
By combining these two techniques, we changed the order of the results from one that is led by a sports-related document to one where all the documents related to food come up first, as shown above. What do you think about the order of the results now ?
The tendency to go for results from reputable sources
Another factor that has been investigated at a lot is the reputation, credibility or reliability of the source. Assuming all else being equal, users tend to gravitate towards information from sources that are perceived as more reputable.
In the context of Web search, information about the hyperlinks between documents are used infer authority. A document or an entire site that gets a high number of mentions or references from other documents is highly prized. This is the essence of something like PageRank, which is one of the many factors going into Google’s search ranking.
Let us assume that taste.com.au has the highest authority of all the five sites based on a measure similar to PageRank, followed by coles.com.au and menulog.com.au. By incorporating this as the next factor into ranking, the algorithm would produce an order that looks something like the above.
The preference for things which are new or have not been seen, sometimes
Depending on the verticals, the temporal aspect of results can weigh in quite heavily. Users who search for green curry recipes may not require the latest results, but those who are looking for the latest news on Stephen Curry and Draymond Green do. For instance, the ordering below would be preferred for a query on Stephen Curry where a more recent news article is ranked higher than a static Wikipedia page.
Depending on the query, the ranker can decide to upweight on the temporal feature or not. The types of queries where freshness has typically been highly valued are those that pertain to recent or recurring events in sports, politics, entertainment and the economy, or content that gets updated frequently such as product reviews.
Better understanding of users through their clicks
Increasingly, past clicks are being used as a signal in ranking. This type of signal is an interesting one as it begs the question, “what does it provide for?” or “what does it solve?“. As discussed in Using Clicks in Ranking, clicks as a concept carries with it many things. For one, users click (or not click) based on what they see on the SERPs. More specifically, clicks reflect things such as the quality of the snippets, the presence of keyword highlighting or other forms of presentation bias. In addition to the lack of attractiveness of certain results on SERPs, a non-click can also mean other things. For example, a result that was not clicked on that sits in between two clicked results can mean that the user has already seen the result before. If there were five further results shown to the user after the lowest clicked position, the user may have accomplished what they set out to do, instead of these five results being irrelevant.
However, if used wisely, past clicks can provide for valuable insights into user preferences. These preferences can in turn be used as a ranking factor. For instance, if the majority of searches for green curry attract clicks on recipes instead of product pages from online retailers, we can infer a relationship based on this. This association between the query and certain types of documents can be used in search ranking.
In this post, we discussed with examples some common factors in ranking search results. We highlighted the importance of having clear motivation behind each factor and clarity on the role it plays in achieving the ranking that the users expect. We re-iterated that unlike trialled and tested concepts such as topicality, authoritativeness and novelty, the use of clicks requires thinking through.