Advanced regular expression matching with NSRegularExpression

Match text using flexible search criteria

Paul Hudson December 29th 2018 @twostraws

Previously I wrote an article about how to use regular expressions in Swift, but I want to go a step further and discuss how to get more fine-grained control over your regexes by customizing the options used.

Whenever you create a regex you get an optionset to work with: NSRegularExpression.Options. In this article we’ll be looking at what control each of those options gives us, with practical code examples along the way.

Sponsor Hacking with Swift and reach the world's largest Swift community!

Setting up

We need a sandbox to work with, so please create a new playground in Xcode and give it this code:

// look for the exact word "the"
let pattern = "the"

// we're starting with no options for creating the regex
let regexOptions: NSRegularExpression.Options = []
let regex = try NSRegularExpression(pattern: pattern, options: regexOptions)

// a nice multi-line string to work with
let testString = """
The cat
sat on
the mat
"""

// check whether the string matches, and print one of two messages
if let index = regex.firstMatch(in: testString, range: NSRange(location: 0, length: testString.utf8.count)) {
    print("Match!")
} else {
    print("No match.")
}

Our regex searches for “the” anywhere in the string, which will be found because it exists on the third line – hopefully Xcode should print out “Match!”, otherwise the rest of this article will be very confusing indeed.

allowCommentsAndWhitespace

This option allows your regex to match even when any amount of whitespace gets in the way. This is particularly helpful when parsing user-entered text, because whitespace can be anywhere. As an example, look at this function signature in Swift:

func getUsername(from: [String: String]) -> String

There are lots of ways of writing that, but even if you discount the extreme options you could still see code like this:

func getUsername ( from : [String : String]) -> String

Using the option .allowCommentsAndWhitespace means whitespace is automatically matched anywhere in the regex. So, this will match:

let pattern = "t h e"
let regexOptions: NSRegularExpression.Options = [.allowCommentsAndWhitespace]

As for the “comments” part of .allowCommentsAndWhitespace, once you ignore whitespace you can start to use comments inside your regular expression. These start with a # symbol, and everything afterwards is ignored. Comments are tied to ignoring whitespace because they are usually written across lines to make them easier to read.

So, this will match:

let pattern = """
t # look for a T
[a-z] # then any lowercase letter
e # then an e
"""
let regexOptions: NSRegularExpression.Options = [.allowCommentsAndWhitespace]

anchorsMatchLines

The ^ and $ metacharacters allow us to match the start and end of lines, but this often doesn’t work quite as you’d expect.

Regexes were originally designed to handle one line of text at a time, but nowadays it’s much more common to parse hundreds or even thousands at a time. To preserve backwards compatibility, most programmatic regex engines (i.e., ones you use in code) consider the start and end of the line to be the start and end of your whole text no matter how many line breaks it has.

To demonstrate the problem, try using these settings:

let pattern = "^sat"
let regexOptions: NSRegularExpression.Options = []

That looks for “sat” at the start of a line, and we can see that our text string has just that – but it won’t match, because by default ^ and $ match the start and end of the whole string.

To fix the problem we need to use the .anchorsMatchLines option, like this:

let pattern = "^sat"
let regexOptions: NSRegularExpression.Options = [.anchorsMatchLines]

And that will now match correctly.

caseInsensitive

This is probably the most commonly used regular expression option, and unless you’re working with very large strings it doesn’t have much of a performance impact.

Right now, this will match because we have the substring “the” in our test string:

let pattern = "the"
let regexOptions: NSRegularExpression.Options = []

However, this will not match, because regexes are case-sensitive by default:

let pattern = "THE"
let regexOptions: NSRegularExpression.Options = []

If you want to search for “the”, “THE”, “tHe” and all other case variations, you can collapse the case by using the .caseInsensitive option like this:

let pattern = "THE"
let regexOptions: NSRegularExpression.Options = [.caseInsensitive]

That will match, because the regex treats “THE” and “the” as the same.

Although this is common, you might prefer to be clear about which case variations are allowed. For example, you might want to match precisely “The” with a capital T, but then any three-letter word after it regardless of case:

let pattern = "The [A-Za-z]{3}"
let regexOptions: NSRegularExpression.Options = []

dotMatchesLineSeparators

By default, the . metacharacter matches any single character except for line breaks, and is commonly used with quantifiers like * and ? to match ranges of unknown text.

Because it doesn’t match line breaks, these settings won’t match anything:

let pattern = "The.+cat.+sat"
let regexOptions: NSRegularExpression.Options = []

That will match “The” followed by anything except a line break, “cat” followed by anything except a line break, then “sat”, but in our test string “sat” appears on a new line and so . won’t work.

To fix this and make the test string match, add the .dotMatchesLineSeparators option, like this:

let pattern = "The.+cat.+sat"
let regexOptions: NSRegularExpression.Options = [.dotMatchesLineSeparators]

ignoreMetacharacters

Metacharacters are any characters that don’t have their explicit meaning, e.g. . matches any character that isn’t a line break, * is the zero-or-more quantifier, and \d matches any digit.

Very rarely – perhaps if you were mixing regexes with non-regexes – you might want to treat your pattern string as a literal sequence of characters, ignoring the special meaning of any metacharacters. To do that, add the .ignoreMetacharacters option to your regex, like this:

let pattern = "The.+cat.+sat"
let regexOptions: NSRegularExpression.Options = [.ignoreMetacharacters]

Because we’re ignoring the meanings of . and +, that pattern won’t match “The cat sat” or “The cat sat”, but will match the string “The.+cat.+sat”.

useUnicodeWordBoundaries

Regular expressions were first used in code 50 years ago, and although they had a formal mathematical definition it took quite some time to add a formal lexical definition.

One gray area for a long time was word boundaries: what constitutes the start and end of a word? As an example, consider this test string:

let testString = """
The child's cat
sat on
the mat
"""

You can search for the word “child” in that string by using the word boundary metacharacter, \b:

let pattern = "\\bchild\\b"
let regexOptions: NSRegularExpression.Options = []

That will match our new test string. But should it match? Our test string has “child’s”, so if you were looking specifically for the string “child” as a standalone word it would match incorrectly.

Fortunately, the Unicode Consortium got busy doing their usual excellent work of studying language, and wrote a formal definition of what constitutes a word boundary. The result is called Unicode TR#29, and you can enable it with your regular expressions by adding the .useUnicodeWordBoundaries option like this:

let pattern = "\\bchild\\b"
let regexOptions: NSRegularExpression.Options = [.useUnicodeWordBoundaries]

That will no longer match, because “child” doesn’t appear as a standalone word in the test string.

useUnixLineSeparators

This is a more esoteric option for most of us, but if you’re working in a cross-platform environment it's more helpful.

Historically line breaks have been represented in a number of ways, and regexes are designed to work with them all. For example, Unix and macOS line breaks are written as \n, but Windows line breaks are written as \r\n.

If you specifically want to limit your regexes so they match only Unix/macOS line breaks you should use the .useUnixLineSeparators option, like this:

let regexOptions: NSRegularExpression.Options = [.useUnixLineSeparators]

Where next?

We’ve covered the full range of NSRegularExpression.Options here, but if you want even more control you might want to investigate NSRegularExpression.MatchingOptions as well – these let you manipulate specific match calls rather than the regular expression itself.

You can also mix together most of the options listed above: NSRegularExpression.Options is a Swift option set, which means you can specify them as single items:

let regexOptions: NSRegularExpression.Options = .caseInsensitive

…or as arrays:

let regexOptions: NSRegularExpression.Options = [.caseInsensitive, .useUnicodeWordBoundaries]

Do whichever feels most natural for you.

Sponsor Hacking with Swift and reach the world's largest Swift community!

Swift breaks down barriers between ideas and apps, and I want to break down barriers to learning it. I’m proud to make hundreds of tutorials that are free to access, and I’ll keep doing that whatever happens. But if that’s a mission that resonates with you, please support it by becoming a HWS+ member. It’ll bring you lots of benefits personally, but it also means you’ll directly help people all over the world to learn Swift by freeing me up to share more of my knowledge, passion, and experience for free! Become Hacking with Swift+ member.

RSS feed

Advanced regular expression matching with NSRegularExpression

Setting up

allowCommentsAndWhitespace

anchorsMatchLines

caseInsensitive

dotMatchesLineSeparators

ignoreMetacharacters

useUnicodeWordBoundaries

useUnixLineSeparators

Where next?

Shipping a visionOS app for launch

Take on visionOS at Unwrap Live 2024

Build your first app with SwiftUI and SwiftData

Introducing Inferno: Metal shaders for SwiftUI

I screwed up one key accessibility behavior, and now I'm on a mission to do better

What’s new in SwiftUI for iOS 17

Hacking with Swift Live 2023

What’s new in Swift 5.9?