Drupal

Search engine and Drupal: Relevance of results and SolR boosting

Published on 23 February 2023
Photo @chdugue - Photo illustrating, at Saint Charles station, the display of the first results.
Second part of our dive into Drupal's Solr Search API. Here, we discuss handling and configuring result relevance as well as SolR boosting.

Relevance and Scoring

Solr’s scoring algorithm is known as the tf.idf model (a statistical measure that allows evaluating the importance of a word relative to a document within a collection).

Lucene combines the Boolean model with the vector space model: 

  • Boolean Model of Information Retrieval (BM)
    The search is based on whether documents do or do not contain the query terms. These four factors are combined in a scoring formula.
  • Vector Space Model (VSM)
    This is an algebraic model that represents textual documents as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing, and relevance ranking.

Lucene combines both models – the documents approved by the BM are assigned a score by the VSM.

This scoring model involves a number of scoring factors, including:
r = query, d = document, and t = search term

  • tf(t in d): Term frequency. The frequency at which a term appears in a document. For a search query, the higher the term frequency, the higher the document’s score.
  • idf(t): Inverse document frequency. The rarer a term is across all documents in the index, the greater its contribution to the score.
  • coord(r, d): Coordination factor. The more query terms that appear in a document, the higher its score.
  • fieldNorm: Field length. The more words a field contains, the lower its score. This factor penalizes documents with longer fields.
  • queryNorm(r): A normalization factor used to make scores comparable between queries. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but simply tries to make scores from different queries (or even different indexes) comparable.

These factors and models lead us to the practical scoring function used by SOLR:
score(r, d) = coord(r, d) * queryNorm(r) * Σ [ tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ]
Scores are also always normalized so that they fall between 0 and 1.0.

norm(t,d) encapsulates some boost and length factors (at index time):

  • Document boost – set by calling doc.setBoost() before adding the document to the index.
  • Field boost – set by calling field.setBoost() before adding the field to a document.
  • lengthNorm – calculated when the document is added to the index based on the number of tokens in this field in the document, so that shorter fields contribute more to the score. The length norm is calculated by the Similarity class in effect at the time of indexing.

The method computeNorm(java.lang.String, org.apache.lucene.index.FieldInvertState) is responsible for combining all these factors into a single floating-point value.

When a document is added to the index, all the above factors are multiplied together. If a document has multiple fields with the same name, all their boosts are multiplied together.

What is boosting?

t.getBoost() is a query-time boost for term t in query q as specified in the query text (see query syntax), or as set by application calls to setBoost(). Note that there isn’t really a direct API to access the boost of a term in a multi-term query, but rather, multiple terms are represented in a query as multiple TermQuery objects, and thus the boost of a term in the query is accessible by calling the sub-query’s getBoost().

Boosting is used to change the score of documents retrieved during a search. There is boosting at indexing time and boosting at query/search time. Index-time boosting is used to permanently boost document fields. Once a document is boosted during indexing, the boost is stored with the document. Therefore, when the search is performed and the relevance is calculated, the stored boost is taken into account. Boosting at query or search time is dynamic. Some fields can be boosted in the query, allowing certain documents to achieve a higher relevance score than others. For example, you can boost book scores by adding the cat:book^4 parameter in the query. This boosting will make books score relatively higher than other items in the index.

When indexing, users can specify that certain documents are more important than others by assigning them a boost. For this, each document’s score is also multiplied by its boost value doc-boost(d).

Solr search is field-based, so each query term applies to a single field, the document length normalization is based on the field length at issue, and in addition to the document boost, there are also boosts for document fields.

The same field can be added several times to a document during indexing, so the boost of that field is the multiplication of the boosts of the separate additions (or parts) of that field in the document.

At query time, users can specify boosts for each query, sub-query, and each query term. Thus, the contribution of a query term to a document’s score is multiplied by the boost of that query term query-boost(q).

A document can match a multi-term query without containing all the terms in that query (this is correct for some query types), and users can further reward documents matching more query terms by a coordination factor, which is generally higher when more terms are matched: coordination factor(q,d).

Boosting gives greater relevance to some documents over others. The boost factor is multiplied by the relevance score. If you have a base relevance score of 1.5 and you boost by a factor of 1, the relevance score remains 1.5; if you choose to boost by 2, the relevance score becomes 3 (1.5 * 2 = 3). It’s important to note that if you set the boost to 0, the relevance score will be zeroed out: 1.5 * 0 = 0.

Query-time boosts allow you to specify which terms/clauses are “more important”. The higher the boost factor, the more relevant the term, and thus the higher the scores of the matching documents.

A typical boosting technique is to assign higher boosts to title matches than to body content matches:
(title:foo OR title:bar)^1.5 (body:foo OR body:bar)

You should carefully examine the explanation output to determine suitable boost weights.
The official documentation for query parser syntax is here: http://lucene.apache.org/java/3_5_0/queryparsersyntax.html
The query syntax hasn’t changed significantly since Lucene 1.3 (now it’s 3.5.0).

Date Boosting

Sometimes it’s desirable to prioritize the result list by date, without overlooking the main search parameters that have been specified.

Scenarios where this is useful:

  • When there are no competing search parameters, and it is important for the results to be listed by date (for example, no term entered, all results listed).
  • When it is helpful to favor items with more recent dates (for example, in news article lists).

In Drupal, this functionality is handled by a processor (see below – Drupal Processors). For more advanced use, this article explains how to set up custom queries to vary this feature: https://www.solrtutorial.com/boost-documents-by-age.html.

What is partial search?

If the user enters "ballon" as a query, the search engine will consider a document as matching if it contains "volleyball" or "beachvolleyball," etc. The searched term therefore must be contained within the word.

Indexing

To make data available for search, we must record them in the index. 

On the index configuration page, you can also find indexing options:    

  • Read-only: Allow only reading of already indexed data, without allowing any changes on the server.
  • Index items immediately: New items are indexed and updated immediately. If you are indexing a large number of items, you might want to leave this option off to save waiting time. If the content to be indexed does not need to be displayed to users at the exact moment it is published, it is advisable not to activate this option. Otherwise, for content that needs immediate publication, such as news briefs or press releases, this option can be used but preferably by creating another specific index for that type of content.
  • Track changes in referenced entities: Automatically queue items for reindexing if any of the indexed field values from referenced entities are changed. Enabling this can cause performance issues on sites when saving certain types of entities. However, if disabled, the fields of referenced entities may become outdated in the index.
  • Select the Cron batch size: This is the number of items to index with each Cron run. The higher the number, the more resources will be consumed.

After creating and configuring the index, you must add fields to it: /admin/config/search/search-api/index/index_name/fields

On this page, you can add fields created in Drupal belonging to the content types chosen earlier in the “Data sources and content types” section. You must add all fields you need to create your search page (fields to display, fields for sorting, fields for use in conditions, etc.). 

In the “Type” column, you can find the data types available from Solr (outlined in a previous paragraph). Note that to use boost, you must choose the “fulltext” value. In practice, it’s advisable to do this for fields containing the most important data, such as: 

  • Title x 13.0 ;
  • Body x 5.0 ;
  • Text fields in paragraphs x 5.0 ;
  • Other important text fields x 1.0.

Partial search is performed on this kind of field.

For Drupal date fields, choose the “date” type and for any field that isn’t relevant to scoring relevance, it’s better to use “storage-only” (images, media, publication status, content type, etc.). These will still be available for use in the search page’s view but will not be part of the fields analyzed in the index. 

Drupal Processors

Processors configure data at the time of indexing and search. The Solr server already has built-in pre- and post-processing data processors that are active. But you can also add/enable others related to Drupal’s data structure from its back office (/admin/config/search/search-api/index/index_name/processors).

Some of the most important:

  • Access control: Adds access controls for nodes and comments. In other words, it takes entity access permissions into account when active.
  • Entity status: Excludes inactive users and unpublished entities (those with “unpublished” status) from being indexed.
  • Boost recent dates: Favors more recent documents and penalizes older documents.
  • Highlight: Highlights or emphasizes returned field results.
  • Type-specific boosting by Drupal content type: Allows higher scores for some content types than others.

The order of processors is important in processing results. There are three processing phases:

  • Pre-processing before indexing;
  • Pre-processing before sending the query;
  • Post-processing of the query.

For each enabled processor, you can choose in which phase (and in what order within the same phase) to place it. For example, the “Highlight” processor should always be last since it acts on the content itself and modifies the data (fields and HTML attributes). Finally, some processors can be configured in the “Processor Settings” section.

Autocomplete

With the Search API Autocomplete module, you can configure result suggestions for manual entry fields in search (free text search). This can be enabled by search view created (/admin/config/search/search-api/index/index_name/autocomplete).

Suggestions can be displayed using Drupal’s display modes and configured by content type or by field. So, you just need to create a display mode in Drupal for the content type to be shown in suggestions, configure its template, and then select it on the autocomplete configuration page.

Example search engine
On https://www.cgt.fr/, we implemented a search based on SOLR.

  • Drupal version: 9.3.22;
  • SOLR version: 6.6.6;
  • Module version: 4.2.7.

Search results page for the term “salary”
 
IMAGE

Caching and Performance

Let’s quickly recall what cache memory is. Computers have cache memory that temporarily stores the most frequently used data. It’s a great way to reuse data, since retrieval is very fast. Computers also have memory, but retrieving data from there is more costly. However, cache memory is limited in size and there has to be a way to manage which data should be removed from cache to make room for new data. 

An example cache management algorithm, LRU (Least Recently Used) -  This is a cache replacement algorithm that removes the least recently used data to free up space for new data.

If you have an application that displays .png images as the page loads. You want to store them in your cache so the images load quickly for your users. Now, your user interacts with the page and their actions trigger the loading of new images on screen. You still want to provide fast image loading, but you’re running out of cache space. The old is replaced by the new! That’s where you can use a cache replacement algorithm to remove an old image from your cache in favor of a new one.

FilterCache

The filter cache is used by filters. It controls how filter queries are handled to maximize performance. The main advantage of FilterCache is that when a new search is launched, its caches can be pre-filled using data from the old search’s caches. Thus, this will certainly help maximize performance. 

Example configuration:

      

QueryResultCache 

The query result cache contains the results of previous searches: ordered lists of result IDs based on a query, sort, and the requested result range.
Example configuration:

   

DocumentCache

This cache contains the result objects (the stored fields for each result). In other words, it contains the details from the QueryResultCache.
Example configuration:

The last two caches often go together. They provide better performance in scenarios that are mainly read-only. Take the example of an article page. An article can contain information and a comments section. For the information, it makes sense to enable these caches on those fields because there are far more database reads than writes in that case. 

But if, mainly, the use cases are write-only, it is better to disable these two caches because with each data reload, these caches are emptied and will not impact performance much. So, keeping in mind the article example above, it’s advised to disable these caches for the comments field.

We hope we have clarified some concepts. Most of all, we hope that this article will be useful to you. Using Solr and Drupal together has always allowed us to fully meet our clients’ and users’ needs for advanced searching on their Drupal sites. May it continue!

References
Lucidworks.com - Scaling Lucene and Solr
bluedrop.fr - Code tip for grouping different labels in the sort offered by the Drupal facets module
bluedrop.fr - Going further with Search API SolR for Drupal – the intersection of stemming and lemmatization
Medium - Configuring Solr for Optimum Performance
Opensenselabs.com - HowTo: Use Apache Solr with Drupal 8
SolR Apache - https://solr.apache.org/features.html
Ostraining.com - How to use Search API Solr Search in Drupal 8
Slideshare - Using Search API, Search API Solr and Facets in Drupal 8
Sematext.com - Getting Started with Apache Solr
Medium - What is an LRU Cache?
Solr Tutorial - https://www.solrtutorial.com/
Lucene Apache - Class Similarity
Lucene Apache - Class TFIDFSimilarity

Read more articles on Drupal