Drupal

Further in the SolR search API for Drupal – the meeting of stemming and lemmatization

Published on 14 December 2021
image
On the occasion of setting up the Drupal search API with SolR, our contact asked us an interesting question about the algorithmic functioning of the search. We share here the outcome of our research.

The question

We have identified some inconsistencies in the results provided by the search using Search API Solr during the testing phase of a large-scale project. The project team asked us to help understand the display of certain results.

Example: When searching, in the English version of the site, for the term "carbone", the results return an article with the word "carbon". The same thing happens when searching for "impacte" or "vegetariane", since the results page returns articles containing the terms "impact" and "vegetarian".
However, if we search for the term "usefule", the results do not return any articles with the word "useful". 

It therefore seems that the "e" at the end of the word is ignored in some cases, but not in others. We wanted to understand why. 

Definitions

Stemming (racinisation in French) aims to retain the root of the word, that is, to truncate it of any declension, agreement (inflections), or derivations. When done automatically (in French and English), it usually consists of removing part of the end of the term, even if it means removing too much or too little.

Lemmatization consists of bringing a term, regardless of its agreement or declension, back to its simplest form (for French, infinitive/masculine singular).

How SolR indexing works

The Search API Solr module applies the "Stemming / Lemmatization" method to the search. This means it performs a search on the root of the word, then, in a second step, launches the search. It does to words what we 
In English, to search on the root of the word, it removes valid suffixes such as (-ful, -s, -fully, -e, -es, -tion, -ism, -ing, -ization, -ize, -ed, -ly, etc.)

Examples: 

carbon -
The results of a search on the word "carbon" will include responses containing the terms [carbonization, carboning, carbons, carbone, carbonful, carbones, carbonfully, carbonation, etc.]
If we run the search with "carbone", the results will be correct (removal of the "e" suffix to perform the search on the root word "carbon").

Limit -
This works and returns [limite, limits, limites, limitful, limitfully, limitation, limitism, etc.]

Useful -
Has the "ful" suffix. That is why you cannot add an "e" at the end and get a valid result.

Maintaining an enriched configuration

The state of the art in text analysis goes well beyond eliminating superficial differences between terms to solve more complex problems such as language-specific syntactic analysis, part-of-speech tagging, and lemmatization. Solr has a comprehensive framework for carrying out basic text analysis tasks, such as removing very common words called "stopwords", and for carrying out more complex analysis tasks. Solr comes with preconfigured field types in its sample schema.xml.

Ensuring SolR Upgrades -

SolR upgrade notes are available here. The usual approach is to upgrade each Solr node one by one.

Step 1: Stop Solr
Start by stopping the Solr node you want to upgrade. 

Step 2: Install Solr as a service
Please follow the instructions for installing Solr as a service in the SolR production documentation. Use the -n parameter to prevent the installation script from automatically starting Solr. You must update the /etc/default/solr.in.sh file.

Step 3: Set environment variable overrides
Open /etc/default/solr.in.sh with a text editor and make sure the following variables are correctly set:

  • ZK_HOST=
  • SOLR_HOST=
  • SOLR_PORT=
  • SOLR_HOME=

Make sure the authorized user has permissions on the SOLR_HOME directory. 

Step 4: Start Solr
You are now ready to start the updated SolR node - sudo service solr start

Step 5: Run the health check
You must run the Solr healthcheck command for all hosted collections before proceeding to upgrade the next node. 

Finally, repeat steps 1 to 5 for all nodes.

Read more articles on Drupal