NEW! Master Swift design patterns with my latest book! >>

How to use regular expressions in Swift

Paul Hudson       @twostraws

Regular expressions allow us to run complex search and replace operations across thousands of text files in just a handful of seconds, so it’s no surprise they have been popular for over 50 years.

In fact, I consider regular expressions such an important skill for developers that they form one of four components in a book I wrote to help teach critical meta-coding skills, along with Unix commands, Git source control, and Scrum.

You can find out more about that book here, but in this article I want to give you a primer in using regular expressions in Swift. I wrote about the basic technique previously, and here we're going to go further and walk through some of the most important regular expression syntax along with some useful extensions to make them more convenient.

 First, the basics

Let’s start with a couple of easy examples for folks who haven’t used regular expressions before. Regular expressions – regexes for short – are designed to let us perform fuzzy searches inside strings. For example, we know that "cat".contains("at") is true but what if we wanted to match any three-letter word that ends in “at”?

This is what regexes are designed to solve, although they use slightly clumsy syntax thanks to their Objective-C roots.

First you define the string you want to check:

let testString = "hat"

Next you create an NSRange instance that represents the full length of the string:

let range = NSRange(location: 0, length: testString.utf16.count)

That uses the utf16 count to avoid problems with emoji and similar.

Next you create an NSRegularExpression instance using some regex syntax:

let regex = try! NSRegularExpression(pattern: "[a-z]at")

[a-z] is regex’s way of specifying any letter from “a” through “z”. This is a throwing initializer because you might attempt to provide an invalid regular expression, but here we have hard-coded a correct regex so there’s no need to try to catch errors.

Finally, you call firstMatch(in:) on the regex you created, passing the string to search, any special options, and what range of the string to look inside. If your string matches the regex then you’ll get data back, otherwise nil. So, if you just want to check that the string matched at all, compare the result of firstMatch(in:) against nil, like this:

regex.firstMatch(in: testString, options: [], range: range) != nil

The use of NSRange is unfortunate, but sadly required right now – this API was designed for NSString and bridges poorly to Swift. The Swift String Manifesto does mention possible replacements, but these seem quite far away at the time of writing.

The regex “[a-z]at” will successfully match “hat”, as well as “cat”, “sat”, “mat”, “bat”, and so on – we focus on what we want to match, and NSRegularExpression does the rest.

Making NSRegularExpression easier to use

We’ll look at more regex syntax in a moment, but first let’s see if we can make NSRegularExpression a little friendlier.

Right now our code takes three lines of non-trivial Swift to match a simple string:

let range = NSRange(location: 0, length: testString.utf16.count)
let regex = try! NSRegularExpression(pattern: "[a-z]at")
regex.firstMatch(in: testString, options: [], range: range) != nil

We can improve this in a variety of ways, but probably the most sensible is to extend NSRegularExpression to make creating and matching expressions easier.

First, this line:

let regex = try! NSRegularExpression(pattern: "[a-z]at")

As I said, creating an instance of NSRegularExpression can throw errors because you might try to provide an illegal regular expression – [a-zat, for example, is illegal because we haven’t closed the ].

However, most of the time your regular expressions will be hard-coded by you: there are specific things you want to search for, and they are either correct or incorrect at compile time. If you make a mistake with these it’s not something you should really try to correct for at runtime, because you’re just hiding a coding error.

As a result, it’s common to create NSRegularExpression instances using try!. However, that can cause havoc with linting tools like SwiftLint, so a better idea is to create a convenience initializer that either creates a regex correctly or creates an assertion failure when you’re developing:

extension NSRegularExpression {
    convenience init(_ pattern: String) {
        do {
            try self.init(pattern: pattern)
        } catch {
            preconditionFailure("Illegal regular expression: \(pattern).")
        }
    }
}

Note: If your app relies on regular expressions that were entered by your user, you should stick with the regular NSRegularExpression(pattern:) initializer so you can handle the inevitable errors gracefully.

Second, these lines:

let range = NSRange(location: 0, length: testString.utf16.count)
regex.firstMatch(in: testString, options: [], range: range) != nil

The first one creates an NSRange encompassing our entire string, and the second looks for the first match in our test string. This is clumsy: most of the time you’re going to want to search the entire input string, and using firstMatch(in:) along with a nil check really muddies your intent.

So, let’s replace that with a second extension that wraps those lines up in a single matches() method:

extension NSRegularExpression {
    func matches(_ string: String) -> Bool {
        let range = NSRange(location: 0, length: string.utf16.count)
        return firstMatch(in: string, options: [], range: range) != nil
    }
}

If we put these two extensions together we can now make and check regexes much more naturally:

let regex = NSRegularExpression("[a-z]at")
regex.matches("hat")

We could take things further by using operator overloading to make Swift’s contains operator, ~=, work with regular expressions:

extension String {
    static func ~= (lhs: String, rhs: String) -> Bool {
        guard let regex = try? NSRegularExpression(pattern: rhs) else { return false }
        let range = NSRange(location: 0, length: lhs.utf16.count)
        return regex.firstMatch(in: lhs, options: [], range: range) != nil
    }
}

That code lets us use any string on the left and a regex on the right, all in one:

"hat" ~= "[a-z]at"

Note: There is a cost to creating an instance of NSRegularExpression, so if you intend to use a regex repeatedly it’s probably better to store the NSRegularExpression instance.

A tour of regular expression syntax

We already used [a-z] to mean “any letter from “a” through “z”, and in regex terms this is a character class. This lets you specify a group of letters that should be matched, either by specifically listing each of them or by using a character range.

Regex ranges don’t have to be the full alphabet if you prefer. You can use [a-t] to exclude the letters “u” through “z”. On the other hand, if you want to be specific about the letters in the class just list them individually like this:

[csm]at

Regexes are case-sensitive by default, which means “Cat” and “Mat” won’t be matched by “[a-z]at”. If you wanted to ignore letter case then you can either use “[a-zA-Z]at” or create your NSRegularExpression object with the flag .caseInsensitive.

As well as the uppercase and lowercase ranges, you can also specify digit ranges with character classes. This is most commonly [0-9] to allow any number, or [A-Za-z0-9] to allow any alphanumerical letter, but you might also use [A-Fa-f0-9] to match hexadecimal numbers, for example.

If you want to match sequences of character classes you need a regex concept called quantification: the ability to say how many times something ought to appear.

One of the most common is the asterisk quantifier, *, which means “zero or more matches.” Quantifiers always appear after the thing they are modifying, so you might write something like this:

let regex = NSRegularExpression("ca[a-z]*d")

That looks for “ca”, then zero or more characters from “a” through “z”, then “d” – it matches “cad”, “card”, “clamped”, and more.

As well as *, there are two other similar quantifiers: + and ?. If you use + it means “one or more”, which is ever so slightly different from the “zero or more” of *. And if you use ? it means “zero or one.”

These quantifiers are really fundamental in regexes, so I want to make sure you really understand the difference between them. So, consider these three regexes:

  1. ca[a-z]*d
  2. ca[a-z]+d
  3. ca[a-z]?d

I want you to look at each of those three regexes, then consider this: when given the test string “cd” what will each of those three match? What about when given the test string “clamped”?

The first regex ca[a-z]*d means “ca then zero or more lowercase letters, then d,” so it will match both “cd” and “clamped”.

three match? What about when given the test string “clamped”?

The second regex ca[a-z]+d means “ca then one or more lowercase letters, then d,” so it won’t match “cd” but will match “clamped”.

Finally, the third regex ca[a-z]?d means “ca then zero or one lowercase letters, then d,” so it will match “cd” but not “clamped”.

Quantifiers aren’t just restricted to character classes. For example, if you wanted to match the word “color” in both US English (“color”) and International English (“colour”), you could use the regex colou?r. That is, “match the exact string ‘colo’, the match zero or one ‘u’s, then an ‘r’.”

If you prefer, you can also be more specific about your quantities: “I want to match exactly three characters.” This is done using braces, { and }. For example [a-z]{3} means “match exactly three lowercase letters.”

Consider a phone number formatted like this: 111-1111. We want to match only that format, and not “11-11”, “1111-111”, or “11111111111”, which means a regex like [0-9-]+ would be insufficient. Instead, we need to a regex like this: [0-9]{3}-[0-9]{4}: precisely three digits, then a dash, then precisely four digits.

You can also use braces to specify ranges, either bounded or unbounded. For example, [a-z]{1,3} means “match one, two, or three lowercase letters”, and [a-z]{3,} means “match at least three, but potentially any number more.”

Finally, meta-characters are special characters that regexes give extra meaning, and at least three of them are used extensively.

First up, the most used, most overused, and most abused of these is the . character – a period – which will match any single character except a line break. So, the regex c.t will match “cat” but not “cart”. If you use . with the * quantifier it means “match one or more of anything that isn’t a line break,” which is probably the most common regex you’ll come across.

The reason for its use should be easy to recognize: you don’t need to worry about crafting a specific regex, because .* will match almost everything. The problem is, being specific is sort of the point of regular expressions: you can search for precise variations of text in order to apply some processing – too many people rely on .* as a crutch, without realizing it can introduce subtle mistakes into their expressions.

As an example, consider the regex we wrote to match phone numbers like 555-5555: [0-9]{3}-[0-9]{4}. You might think “maybe some people will write “555 5555” or “5555555”, and try to make your regex looser by using .* instead, like this: [0-9]{3}.*[0-9]{4}.

But now you have a problem: that will match “123-4567”, “123-4567890”, and even “123-456-789012345”. In the first instance, the .* will match “-“; in the second, it will match “-456“; and in the third it will match “-456-78901” – it will take as much as needed in order for the [0-9]{3} and [0-9]{4} matches to work.

Instead, you can use character classes with quantifiers, for example [0-9]{3}[ -]*[0-9]{4} means “find three digits, then zero or more spaces and dashes, then four digits.” or negated character classes. You can also use negated character classes to match anything that isn’t a digit, so [0-9]{3}[^0-9]+[0-9]{4} will match a space, a dash, a slash, and more – but it won’t match numbers.

Where next?

This has been a brief introduction to regular expressions, but I hope it’s given you a taste of what they are capable of – and how to make them a little friendlier in Swift.

If you’re keen to learn more about regular expression functionality – including grouping, lookahead, and lazy matching – you should check out my book Beyond Code. It covers all those and a heck of a lot more, and includes a massive collection of videos demonstrating techniques.

 

MASTER SWIFT NOW
Buy Practical iOS 12 Buy Pro Swift Buy Swift Design Patterns Buy Practical iOS 11 Buy Swift Coding Challenges Buy Server-Side Swift (Vapor Edition) Buy Server-Side Swift (Kitura Edition) Buy Hacking with macOS Buy Advanced iOS Volume One Buy Hacking with watchOS Buy Hacking with tvOS Buy Hacking with Swift Buy Dive Into SpriteKit Buy Swift in Sixty Seconds Buy Objective-C for Swift Developers Buy Beyond Code

About the author

Paul Hudson is the creator of Hacking with Swift, the most comprehensive series of Swift books in the world. He's also the editor of Swift Developer News, the maintainer of the Swift Knowledge Base, and Mario Kart world champion. OK, so that last part isn't true. If you're curious you can learn more here.

Was this page useful? Let me know!

Average rating: 5.0/5

Click here to visit the Hacking with Swift store >>