How to use regular expressions in Swift

Match a variety of text using NSRegularExpression.

Paul Hudson March 29th 2019 @twostraws

Regular expressions allow us to run complex search and replace operations across thousands of text files in just a handful of seconds, so it’s no surprise they have been popular for over 50 years. Apple provides support for regular expressions on all of its platforms – iOS, macOS, tvOS, and even watchOS – all using the same class, NSRegularExpression. It's an extremely fast and efficient way to search and replace complex text tens of thousands of times, and it's all available for Swift developers to use.

In this tutorial you'll learn how to create regular expressions using NSRegularExpression, and how to match a variety of regular expressions using the most important syntax. For more advanced regular expression work, I've written a separate article that follows on from this one: Advanced regular expression matching with NSRegularExpression.

First, the basics

Let’s start with a couple of easy examples for folks who haven’t used regular expressions before. Regular expressions – regexes for short – are designed to let us perform fuzzy searches inside strings. For example, we know that "cat".contains("at") is true but what if we wanted to match any three-letter word that ends in “at”?

This is what regexes are designed to solve, although they use slightly clumsy syntax thanks to their Objective-C roots.

First you define the string you want to check:

let testString = "hat"

Next you create an NSRange instance that represents the full length of the string:

let range = NSRange(location: 0, length: testString.utf16.count)

That uses the utf16 count to avoid problems with emoji and similar.

Next you create an NSRegularExpression instance using some regex syntax:

let regex = try! NSRegularExpression(pattern: "[a-z]at")

[a-z] is regex’s way of specifying any letter from “a” through “z”. This is a throwing initializer because you might attempt to provide an invalid regular expression, but here we have hard-coded a correct regex so there’s no need to try to catch errors.

Finally, you call firstMatch(in:) on the regex you created, passing the string to search, any special options, and what range of the string to look inside. If your string matches the regex then you’ll get data back, otherwise nil. So, if you just want to check that the string matched at all, compare the result of firstMatch(in:) against nil, like this:

regex.firstMatch(in: testString, options: [], range: range) != nil

The use of NSRange is unfortunate, but sadly required right now – this API was designed for NSString and bridges poorly to Swift. The Swift String Manifesto does mention possible replacements, but these seem quite far away at the time of writing.

The regex “[a-z]at” will successfully match “hat”, as well as “cat”, “sat”, “mat”, “bat”, and so on – we focus on what we want to match, and NSRegularExpression does the rest.

Sponsor Hacking with Swift and reach the world's largest Swift community!

Making NSRegularExpression easier to use

We’ll look at more regex syntax in a moment, but first let’s see if we can make NSRegularExpression a little friendlier.

Right now our code takes three lines of non-trivial Swift to match a simple string:

let range = NSRange(location: 0, length: testString.utf16.count)
let regex = try! NSRegularExpression(pattern: "[a-z]at")
regex.firstMatch(in: testString, options: [], range: range) != nil

We can improve this in a variety of ways, but probably the most sensible is to extend NSRegularExpression to make creating and matching expressions easier.

First, this line:

let regex = try! NSRegularExpression(pattern: "[a-z]at")

As I said, creating an instance of NSRegularExpression can throw errors because you might try to provide an illegal regular expression – [a-zat, for example, is illegal because we haven’t closed the ].

However, most of the time your regular expressions will be hard-coded by you: there are specific things you want to search for, and they are either correct or incorrect at compile time. If you make a mistake with these it’s not something you should really try to correct for at runtime, because you’re just hiding a coding error.

As a result, it’s common to create NSRegularExpression instances using try!. However, that can cause havoc with linting tools like SwiftLint, so a better idea is to create a convenience initializer that either creates a regex correctly or creates an assertion failure when you’re developing:

extension NSRegularExpression {
    convenience init(_ pattern: String) {
        do {
            try self.init(pattern: pattern)
        } catch {
            preconditionFailure("Illegal regular expression: \(pattern).")
        }
    }
}

Note: If your app relies on regular expressions that were entered by your user, you should stick with the regular NSRegularExpression(pattern:) initializer so you can handle the inevitable errors gracefully.

Second, these lines:

let range = NSRange(location: 0, length: testString.utf16.count)
regex.firstMatch(in: testString, options: [], range: range) != nil

The first one creates an NSRange encompassing our entire string, and the second looks for the first match in our test string. This is clumsy: most of the time you’re going to want to search the entire input string, and using firstMatch(in:) along with a nil check really muddies your intent.

So, let’s replace that with a second extension that wraps those lines up in a single matches() method:

extension NSRegularExpression {
    func matches(_ string: String) -> Bool {
        let range = NSRange(location: 0, length: string.utf16.count)
        return firstMatch(in: string, options: [], range: range) != nil
    }
}

If we put these two extensions together we can now make and check regexes much more naturally:

let regex = NSRegularExpression("[a-z]at")
regex.matches("hat")

We could take things further by using operator overloading to make Swift’s contains operator, ~=, work with regular expressions:

extension String {
    static func ~= (lhs: String, rhs: String) -> Bool {
        guard let regex = try? NSRegularExpression(pattern: rhs) else { return false }
        let range = NSRange(location: 0, length: lhs.utf16.count)
        return regex.firstMatch(in: lhs, options: [], range: range) != nil
    }
}

That code lets us use any string on the left and a regex on the right, all in one:

"hat" ~= "[a-z]at"

Note: There is a cost to creating an instance of NSRegularExpression, so if you intend to use a regex repeatedly it’s probably better to store the NSRegularExpression instance.

A tour of regular expression syntax

We already used [a-z] to mean “any letter from “a” through “z”, and in regex terms this is a character class. This lets you specify a group of letters that should be matched, either by specifically listing each of them or by using a character range.

Regex ranges don’t have to be the full alphabet if you prefer. You can use [a-t] to exclude the letters “u” through “z”. On the other hand, if you want to be specific about the letters in the class just list them individually like this:

[csm]at

Regexes are case-sensitive by default, which means “Cat” and “Mat” won’t be matched by “[a-z]at”. If you wanted to ignore letter case then you can either use “[a-zA-Z]at” or create your NSRegularExpression object with the flag .caseInsensitive.

As well as the uppercase and lowercase ranges, you can also specify digit ranges with character classes. This is most commonly [0-9] to allow any number, or [A-Za-z0-9] to allow any alphanumerical letter, but you might also use [A-Fa-f0-9] to match hexadecimal numbers, for example.

If you want to match sequences of character classes you need a regex concept called quantification: the ability to say how many times something ought to appear.

One of the most common is the asterisk quantifier, *, which means “zero or more matches.” Quantifiers always appear after the thing they are modifying, so you might write something like this:

let regex = NSRegularExpression("ca[a-z]*d")

That looks for “ca”, then zero or more characters from “a” through “z”, then “d” – it matches “cad”, “card”, “camped”, and more.

As well as *, there are two other similar quantifiers: + and ?. If you use + it means “one or more”, which is ever so slightly different from the “zero or more” of *. And if you use ? it means “zero or one.”

These quantifiers are really fundamental in regexes, so I want to make sure you really understand the difference between them. So, consider these three regexes:

c[a-z]*d
c[a-z]+d
c[a-z]?d

I want you to look at each of those three regexes, then consider this: when given the test string “cd” what will each of those three match? What about when given the test string “camped”?

The first regex c[a-z]*d means “c then zero or more lowercase letters, then d,” so it will match both “cd” and “camped”.

The second regex c[a-z]+d means “c then one or more lowercase letters, then d,” so it won’t match “cd” but will match “camped”.

Finally, the third regex c[a-z]?d means “c then zero or one lowercase letters, then d,” so it will match “cd” but not “camped”.

Quantifiers aren’t just restricted to character classes. For example, if you wanted to match the word “color” in both US English (“color”) and International English (“colour”), you could use the regex colou?r. That is, “match the exact string ‘colo’, the match zero or one ‘u’s, then an ‘r’.”

If you prefer, you can also be more specific about your quantities: “I want to match exactly three characters.” This is done using braces, { and }. For example [a-z]{3} means “match exactly three lowercase letters.”

Consider a phone number formatted like this: 111-1111. We want to match only that format, and not “11-11”, “1111-111”, or “11111111111”, which means a regex like [0-9-]+ would be insufficient. Instead, we need to a regex like this: [0-9]{3}-[0-9]{4}: precisely three digits, then a dash, then precisely four digits.

You can also use braces to specify ranges, either bounded or unbounded. For example, [a-z]{1,3} means “match one, two, or three lowercase letters”, and [a-z]{3,} means “match at least three, but potentially any number more.”

Finally, meta-characters are special characters that regexes give extra meaning, and at least three of them are used extensively.

First up, the most used, most overused, and most abused of these is the . character – a period – which will match any single character except a line break. So, the regex c.t will match “cat” but not “cart”. If you use . with the * quantifier it means “match one or more of anything that isn’t a line break,” which is probably the most common regex you’ll come across.

The reason for its use should be easy to recognize: you don’t need to worry about crafting a specific regex, because .* will match almost everything. The problem is, being specific is sort of the point of regular expressions: you can search for precise variations of text in order to apply some processing – too many people rely on .* as a crutch, without realizing it can introduce subtle mistakes into their expressions.

As an example, consider the regex we wrote to match phone numbers like 555-5555: [0-9]{3}-[0-9]{4}. You might think “maybe some people will write “555 5555” or “5555555”, and try to make your regex looser by using .* instead, like this: [0-9]{3}.*[0-9]{4}.

But now you have a problem: that will match “123-4567”, “123-4567890”, and even “123-456-789012345”. In the first instance, the .* will match “-“; in the second, it will match “-456“; and in the third it will match “-456-78901” – it will take as much as needed in order for the [0-9]{3} and [0-9]{4} matches to work.

Instead, you can use character classes with quantifiers, for example [0-9]{3}[ -]*[0-9]{4} means “find three digits, then zero or more spaces and dashes, then four digits.” or negated character classes. You can also use negated character classes to match anything that isn’t a digit, so [0-9]{3}[^0-9]+[0-9]{4} will match a space, a dash, a slash, and more – but it won’t match numbers.

Where next?

This has been a brief introduction to regular expressions, but I hope it’s given you a taste of what they are capable of – and how to make them a little friendlier in Swift.

If you’re keen to learn more about regular expression functionality – including grouping, lookahead, and lazy matching – you should check out my book Beyond Code. It covers all those and a heck of a lot more, and includes a massive collection of videos demonstrating techniques.

I also have an article that goes into more detail on how the options for NSRegularExpression work: Advanced regular expression matching with NSRegularExpression.

Sponsor Hacking with Swift and reach the world's largest Swift community!