Drupal

Search Engine and Drupal: Added Value and Configuration of Apache SolR

Published on 22 February 2023
Illustration of the challenge of searching through a large amount of information
A deep dive into research on a Drupal site using Search API Solr. The first step of this journey is to assess the added value brought by SolR and revisit the main configurations to be made. These are the result of our internal procedures, summarized by Elie the almighty.

Default Search in Drupal

Drupal offers a default search engine that may be acceptable for basic searches but has many limitations as soon as we wish to provide users with certain advanced search features. Among these, we note difficulties in:

  • Indexing content from .doc, .pdf, or .pp files;
  • Configuring search results outside of the native model;
  • Supporting a high load of indexed objects (large amounts of content).

Community members have also listed pitfalls to avoid when working with Drupal native search:
https://www.drupal.org/docs/8/modules/search-api/getting-started/common-pitfalls

It is important for us to find an alternative solution. Where our CMS does not perform, it is adaptable and can be linked to other solutions. For internal search, SolR is increasingly emerging as a low-cost, highly reliable solution. It offers:

  • Better search control;
  • Caching, replication, distributed search;
  • Faster indexing/search — indexes can be merged or optimized;
  • An excellent administration interface accessible via HTTP.

Introduction to SOLR

SolR is a robust open source search API platform. Initially developed for and by CNET Networks, this project based on Java was later donated to the Apache Software Foundation.

Drupal Apache SolR is the best solution for fast, reliable, and impressive search applications. SolR is very reliable, scalable, and fault-tolerant. It offers distributed indexing, replication and load-balanced querying, automatic failover and recovery, and centralized configuration. SolR powers the search and navigation functions of many of the world’s largest websites. Big names such as Netflix, Instagram, and Twitter, as well as various e-commerce sites and CMS, use Apache SOLR for their search features.

SolR is powered by Lucene, a powerful open source full-text search library, hence its acronym "Searching On Lucene with Replication". The relationship between SolR and Lucene is similar to that of a car to its engine.

Added Value of SOLR with Drupal

Drupal SolR enables full-text search, providing accurate results thanks to its near real-time indexing and search capabilities. Indexing with Drupal Apache SolR is not only faster, but can also be merged and optimized.

The solution offers faceted navigation, allowing users to add multiple filters to help them easily browse through stacks of information. Facets are navigation elements that can be queried.
For more about facets and their configurations:

The Hit Highlight feature lets you highlight searched words or phrases for easy identification. The dynamic grouping feature allows you to group search results and offer related searches or recommendations. Additionally, there are processors that enable spell check and autocomplete suggestions for a better search API experience.

Requirements

First, you need to prepare and install an Apache SolR server –
https://solr.apache.org/downloads.html.

The modules:
Search Api SolR: https://www.drupal.org/project/search_api_solr
Search Api SolR Autocomplete: https://www.drupal.org/project/search_api_autocomplete
To use facets – Facets: https://www.drupal.org/project/facets

Then configure them with the prepared SOLR server ▲
/admin/config/search/search-api
Configure the SOLR server
Configure the SOLR index

Data Representation

In SolR, a document (which represents the content) is the unit of search and index. An index is made up of one or more documents, and a document consists of one or more fields. In database terminology, a Document corresponds to a table row, and a Field corresponds to a table column.

SOLR Indexing

Definition

SolR collects, stores, and indexes documents from different sources and makes them searchable in near real-time. It follows a three-step process: indexing, querying, and finally ranking the results.

Generally, indexing is a systematic ranking of documents. Indexing allows users to locate information within a document.

  • Indexing collects, analyzes, and stores documents;
  • The purpose of indexing is to increase the speed and performance of a search query to find a required document.

By adding content to an index, we make it searchable by SolR. A SolR index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from database tables, and files in common formats like Microsoft Word or PDF.

In Drupal, by going to the page /admin/config/search/search-api/index/index_name/edit, we can choose the desired data source and configure the types of content we want to index.
 

Analysis Phase

When data is added to SolR, it goes through a series of transformations before being added to the index. This is known as the analysis phase. Examples of transformations include lowercasing and stemming (data processors). The final result of the analysis is a series of “tokens” that are then added to the index. It’s these tokens, and not the original text, that are searched when you perform a search query.

Data Storage

When we display search results to users, they usually expect to see the original document, not the machine-processed tokens (which may look very different from the source text). This is the purpose of the “stored” attribute: to tell SolR to save the original text somewhere in the index.

Sometimes, certain fields are not searched but need to be displayed in search results. To do this, simply set the field attributes to stored=true and indexed=false. In Drupal, you just enter the value “storage-only” in the “type” column on the index field configuration page (Page admin/config/search/search-api/index/solr_index_name/fields).

Why not always store all fields?
Because storing fields increases the size of the index, and the bigger the index, the slower the search. In physical computing terms, you could say that a larger index requires more disk lookups to reach the same amount of data. In Drupal, we often set media, content type, or even taxonomy fields to “storage-only” mode.

Data Types

The SolR schema refers to a configuration file that tells SolR how to index and search each field, which fields are required, and their types.

The SolR example comes with several predefined field types, which are well documented. You can also use them as templates to create new field types. The list is present at the bottom of the page /admin/config/search/search-api/index/index_name/fields in Drupal.

The most commonly used are:

  • Full text
    Full text (fulltext) fields are analyzed fields made available for full-text searching. This data type should be used for all fields (often with free text input by users) where you want to search for individual words. It uses WordDelimiterFilter to allow for word splitting and matching by case, alphanumeric bounds, and non-alphanumeric characters, so a query for "wifi" or "wi fi" could match a document containing "Wi-Fi". Synonyms and stop words are customized by external files, and truncation is enabled.
  • Storage field (storage-only)
    A storage-only field. You can store any string and retrieve it from the index (in the Drupal view to display it on the site), but it cannot be searched.
  • String
    UTF-8 or Unicode encoded, it is useful when you have a text field that you do not want to “tokenize” (turn into index tokens), such as identifiers. Documentation describes it as: The StrField type is not analyzed, but indexed/stored as is. These have a strict limit of less than 32K, so this type is meant for small fields.
  • Date
    Useful for dates. The format of this date field is 1995-12-31T23:59:59Z, and is a stricter form of the canonical dateTime representation http://www.w3.org/TR/xmlschema-2/#dateTime.

Once the configuration principles are understood and set up, we can move on to relevance and boosting concepts. We will cover these topics in the next post dedicated to SolR search in Drupal.

References
Lucidworks.com - Scaling Lucene and Solr
bluedrop.fr - Code tip for grouping different labels in the sorting provided by the Drupal facets module
bluedrop.fr - Further with Search API SolR for Drupal – stemming and lemmatization
Medium - Configuring Solr for Optimum Performance
Opensenselabs.com - HowTo: Use Apache Solr with Drupal 8
SolR Apache - https://solr.apache.org/features.html
Ostraining.com - How to use Search API Solr Search in Drupal 8
Slideshare - Using Search API, Search API Solr and Facets in Drupal 8
Sematext.com - Getting Started with Apache Solr
Medium - What is an LRU Cache?
Solr Tutorial - https://www.solrtutorial.com/
Lucene Apache - Class Similarity
Lucene Apache - Class TFIDFSimilarity

Read more articles on Drupal