Match text using flexible search criteria
Previously I wrote an article about how to use regular expressions in Swift, but I want to go a step further and discuss how to get more fine-grained control over your regexes by customizing the options used.
Whenever you create a regex you get an optionset to work with: NSRegularExpression.Options
. In this article we’ll be looking at what control each of those options gives us, with practical code examples along the way.
SPONSORED Build a functional Twitter clone using APIs and SwiftUI with Stream's 7-part tutorial series. In just four days, learn how to create your own Twitter using Stream Chat, Algolia, 100ms, Mux, and RevenueCat.
Sponsor Hacking with Swift and reach the world's largest Swift community!
We need a sandbox to work with, so please create a new playground in Xcode and give it this code:
// look for the exact word "the"
let pattern = "the"
// we're starting with no options for creating the regex
let regexOptions: NSRegularExpression.Options = []
let regex = try NSRegularExpression(pattern: pattern, options: regexOptions)
// a nice multi-line string to work with
let testString = """
The cat
sat on
the mat
"""
// check whether the string matches, and print one of two messages
if let index = regex.firstMatch(in: testString, range: NSRange(location: 0, length: testString.utf8.count)) {
print("Match!")
} else {
print("No match.")
}
Our regex searches for “the” anywhere in the string, which will be found because it exists on the third line – hopefully Xcode should print out “Match!”, otherwise the rest of this article will be very confusing indeed.
This option allows your regex to match even when any amount of whitespace gets in the way. This is particularly helpful when parsing user-entered text, because whitespace can be anywhere. As an example, look at this function signature in Swift:
func getUsername(from: [String: String]) -> String
There are lots of ways of writing that, but even if you discount the extreme options you could still see code like this:
func getUsername ( from : [String : String]) -> String
Using the option .allowCommentsAndWhitespace
means whitespace is automatically matched anywhere in the regex. So, this will match:
let pattern = "t h e"
let regexOptions: NSRegularExpression.Options = [.allowCommentsAndWhitespace]
As for the “comments” part of .allowCommentsAndWhitespace
, once you ignore whitespace you can start to use comments inside your regular expression. These start with a #
symbol, and everything afterwards is ignored. Comments are tied to ignoring whitespace because they are usually written across lines to make them easier to read.
So, this will match:
let pattern = """
t # look for a T
[a-z] # then any lowercase letter
e # then an e
"""
let regexOptions: NSRegularExpression.Options = [.allowCommentsAndWhitespace]
The ^
and $
metacharacters allow us to match the start and end of lines, but this often doesn’t work quite as you’d expect.
Regexes were originally designed to handle one line of text at a time, but nowadays it’s much more common to parse hundreds or even thousands at a time. To preserve backwards compatibility, most programmatic regex engines (i.e., ones you use in code) consider the start and end of the line to be the start and end of your whole text no matter how many line breaks it has.
To demonstrate the problem, try using these settings:
let pattern = "^sat"
let regexOptions: NSRegularExpression.Options = []
That looks for “sat” at the start of a line, and we can see that our text string has just that – but it won’t match, because by default ^
and $
match the start and end of the whole string.
To fix the problem we need to use the .anchorsMatchLines
option, like this:
let pattern = "^sat"
let regexOptions: NSRegularExpression.Options = [.anchorsMatchLines]
And that will now match correctly.
This is probably the most commonly used regular expression option, and unless you’re working with very large strings it doesn’t have much of a performance impact.
Right now, this will match because we have the substring “the” in our test string:
let pattern = "the"
let regexOptions: NSRegularExpression.Options = []
However, this will not match, because regexes are case-sensitive by default:
let pattern = "THE"
let regexOptions: NSRegularExpression.Options = []
If you want to search for “the”, “THE”, “tHe” and all other case variations, you can collapse the case by using the .caseInsensitive
option like this:
let pattern = "THE"
let regexOptions: NSRegularExpression.Options = [.caseInsensitive]
That will match, because the regex treats “THE” and “the” as the same.
Although this is common, you might prefer to be clear about which case variations are allowed. For example, you might want to match precisely “The” with a capital T, but then any three-letter word after it regardless of case:
let pattern = "The [A-Za-z]{3}"
let regexOptions: NSRegularExpression.Options = []
By default, the .
metacharacter matches any single character except for line breaks, and is commonly used with quantifiers like *
and ?
to match ranges of unknown text.
Because it doesn’t match line breaks, these settings won’t match anything:
let pattern = "The.+cat.+sat"
let regexOptions: NSRegularExpression.Options = []
That will match “The” followed by anything except a line break, “cat” followed by anything except a line break, then “sat”, but in our test string “sat” appears on a new line and so .
won’t work.
To fix this and make the test string match, add the .dotMatchesLineSeparators
option, like this:
let pattern = "The.+cat.+sat"
let regexOptions: NSRegularExpression.Options = [.dotMatchesLineSeparators]
Metacharacters are any characters that don’t have their explicit meaning, e.g. .
matches any character that isn’t a line break, *
is the zero-or-more quantifier, and \d
matches any digit.
Very rarely – perhaps if you were mixing regexes with non-regexes – you might want to treat your pattern string as a literal sequence of characters, ignoring the special meaning of any metacharacters. To do that, add the .ignoreMetacharacters
option to your regex, like this:
let pattern = "The.+cat.+sat"
let regexOptions: NSRegularExpression.Options = [.ignoreMetacharacters]
Because we’re ignoring the meanings of .
and +
, that pattern won’t match “The cat sat” or “The cat sat”, but will match the string “The.+cat.+sat”.
Regular expressions were first used in code 50 years ago, and although they had a formal mathematical definition it took quite some time to add a formal lexical definition.
One gray area for a long time was word boundaries: what constitutes the start and end of a word? As an example, consider this test string:
let testString = """
The child's cat
sat on
the mat
"""
You can search for the word “child” in that string by using the word boundary metacharacter, \b
:
let pattern = "\\bchild\\b"
let regexOptions: NSRegularExpression.Options = []
That will match our new test string. But should it match? Our test string has “child’s”, so if you were looking specifically for the string “child” as a standalone word it would match incorrectly.
Fortunately, the Unicode Consortium got busy doing their usual excellent work of studying language, and wrote a formal definition of what constitutes a word boundary. The result is called Unicode TR#29, and you can enable it with your regular expressions by adding the .useUnicodeWordBoundaries
option like this:
let pattern = "\\bchild\\b"
let regexOptions: NSRegularExpression.Options = [.useUnicodeWordBoundaries]
That will no longer match, because “child” doesn’t appear as a standalone word in the test string.
This is a more esoteric option for most of us, but if you’re working in a cross-platform environment it's more helpful.
Historically line breaks have been represented in a number of ways, and regexes are designed to work with them all. For example, Unix and macOS line breaks are written as \n
, but Windows line breaks are written as \r\n
.
If you specifically want to limit your regexes so they match only Unix/macOS line breaks you should use the .useUnixLineSeparators
option, like this:
let regexOptions: NSRegularExpression.Options = [.useUnixLineSeparators]
We’ve covered the full range of NSRegularExpression.Options
here, but if you want even more control you might want to investigate NSRegularExpression.MatchingOptions
as well – these let you manipulate specific match calls rather than the regular expression itself.
You can also mix together most of the options listed above: NSRegularExpression.Options
is a Swift option set, which means you can specify them as single items:
let regexOptions: NSRegularExpression.Options = .caseInsensitive
…or as arrays:
let regexOptions: NSRegularExpression.Options = [.caseInsensitive, .useUnicodeWordBoundaries]
Do whichever feels most natural for you.
SPONSORED Build a functional Twitter clone using APIs and SwiftUI with Stream's 7-part tutorial series. In just four days, learn how to create your own Twitter using Stream Chat, Algolia, 100ms, Mux, and RevenueCat.
Sponsor Hacking with Swift and reach the world's largest Swift community!
Link copied to your pasteboard.