How to lemmatize text using NLTagger

Swift version: 5.6

Paul Hudson    @twostraws   

Apple’s NaturalLanguage framework is able to lemmatize text for us, which is the process of converting words to the forms you would find in a dictionary – making plural nouns singular, finding the root forms of conjugated verbs, and so on, while also taking into account the context in which they are used.

To do this, first create an instance of NLTagger enabling its .lemma scheme, then call enumerateTags() on it to find all the root word forms. This will pass you the tag (the root word) if it exists, plus the range of the original text in the string.

So, you could lemmatize a whole sentence like this:

import NaturalLanguage

let text = "This is text with plurals such as geese, people, and millennia."
let tagger = NLTagger(tagSchemes: [.lemma])
tagger.string = text

tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lemma) { tag, range in
    let stemForm = tag?.rawValue ?? String(text[range])
    print(stemForm, terminator: "")
    return true

Text lemmatized in this way will be lowercase, preserving any punctuation. So, that snippet will output “this be text with plural such as goose, person, and millennium.”

If you intend to lemmatize text frequently, consider making it an extension on String like this:

extension String {
    func lemmatized() -> String {
        let tagger = NLTagger(tagSchemes: [.lemma])
        tagger.string = self

        var result = [String]()

        tagger.enumerateTags(in: self.startIndex..<self.endIndex, unit: .word, scheme: .lemma) { tag, tokenRange in
            let stemForm = tag?.rawValue ?? String(self[tokenRange])
            return true

        return result.joined()

With that in place you can now lemmatize text easily:

let text = "This is text with plurals such as geese, people, and millennia."
Available from iOS 12.0

