×

Rule-Based Matching with spacy

In this article, we are going to learn about Rule-Based Matching features in NLP.

Unlike the regular expression where we get an output for a fixed pattern matching, this helps us to match a word, phrases, or sometimes sentences according to given a predefined pattern.

There are 3 kinds of matching available

  • Token matcher
  • Phrase matcher
  • Entity ruler

We are going to see all 3 methods in the below article.

Token Matcher

As the name suggests it is a token-based matching given as Matcher, operating over tokens. It uses the word-level features of spaCy such as

  • LOWER
  • LENGTH
  • LEMMA
  • SHAPE
  • flags

How does this work? It is an object of the matcher class that is created using nlp.vocab which returns an object of it. Now we will add a pattern to be matched along with a unique id and a callback function. The pattern which we give is in a form of a dictionary where each dictionary represents the token.  

Note: If there is not a callback function None has to be provided.

Example 1:

Let’s say we want to find a combination of three tokens. We can specify the conditions like

  • A token whose lowercase will match e.g. good or Good
  • A token whose is_punct flag is set to True
  • A token whose lowercase will match e.g. hello or Hello

Now we have specified the condition to match a pattern. Let’s see how to write the pattern for the above condition.

Firstly before starting we will import the library

import spacy
from spacy.matcher import Matcher 
from spacy.tokens import Span 

We will be using a smaller model for the spacy of the English language. Loading the smaller model into nlp variable by using load() function.

nlp = spacy.load("en_core_web_sm")

Instantiate an object for the Matcher class with the ‘vocab’ object from the Language created.

matcher = Matcher(nlp.vocab)

Now, we will define a pattern that we have specified the conditions earlier.

pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "good"}]

After this, we add the pattern to the matcher object as below. The matcher must always share the same vocab with the documents it will operate. Now we use matcher.add() function with a matching_id, and a list of patterns.

matcher.add(“Matching”, None, pattern)

Parameters

  • The first parameter in matcher is match_id
  • The second parameter is a callback function
  • The third parameter is a pattern in which you specify the different conditions to match a sentence, phrase, or word.

Taking a string and storing it into the doc variable

doc = nlp("Hello, Good morning How was your day!")
print(doc)

# Output:
Hello, Good morning How was your day!

Now, call the matcher object the document object and it will return a match_id, start and stop indexes of the matched words (tokens).

matches = matcher(doc)

Print the matched results and extract out the results using the below function.

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end]  
    print(match_id, string_id, start, end, span.text)

# Output:
6895354335150655416 Matching 0 3 Hello, Good

Here as we see the first output is match_id, then it is starting and ending values of the string, and at last, we print the text. The return match_id is an integer hash value so that we can convert it back to Unicode format using nlp.vocab.string.

The complete code

import spacy
from spacy.matcher import Matcher 
from spacy.tokens import Span 

nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "good"}]
matcher.add("Matching", None, pattern)
doc = nlp("Hello, Good morning How was your day!")
matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end]  
    print(match_id, string_id, start, end, span.text)

# Output:
6895354335150655416 Matching 0 3 Hello, Good

Now let’s see another example where there will be 2 sets of conditions for one string only.

Example 2

Let’s say we want to find a combination of three tokens. We can specify the conditions like

  • A token whose lowercase will match e.g., good or Good
  • A token whose is_punct flag is set to True
  • A token whose lowercase will match e.g., hello or Hello

Adding other conditions like

  • A token whose lowercase will match e.g., good or Good
  • A token whose lowercase will match e.g., hello or Hello

We can add multiple conditions into patterns to find the suitable output we want. Here we are combining two different sets of conditions. Now we have specified the condition to match a pattern.

Let’s see how to write these patterns.

patterns1 = [
    [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],
    [{"LOWER": "hello"}, {"LOWER": "world"}]
]

Matcher will search and find the matches and won’t do anything else. This is most helpful when you want to write your own custom functions and pattern-specific logic.

Some of the examples where we can use this can be finding a phone number from the whole document or finding a name that starts from some adjective like Mr. or Miss, etc.

matcher = Matcher(nlp.vocab)

pattern = [
    [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "good"}],
    [{"LOWER": "hello"}, {"LOWER": "good"}]
]
matcher.add("Matching", None, pattern)

doc = nlp("Hello Good morning How was your day!. Hello, Good Morning")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end]  
    print(match_id, string_id, start, end, span.text)