Swift 4.2 improves Hashable with a new Hasher struct

The new implementation is faster, simpler, and more secure too.

Paul Hudson March 29th 2019 @twostraws

Swift 4.2 improved the Hashable protocol to make it faster, simpler, and more secure, all at the same time. If you currently generate hashes by hand you will almost certainly want to migrate to the new API, but even if you don’t you might still find the behavior of your code changes.

Hashable is one of Swift’s most important protocols, but it often goes unnoticed. You use it every time you create dictionaries or sets, for example, because those types create hashes of your data to figure out where it should be stored.

Before Swift 4.1, conforming to Hashable was complex because you needed to calculate a hashValue property by hand. In Swift 4.1 this improved so that hashValue could be synthesized on your behalf. So, it turned this kind of code:

struct Jedi: Hashable {
    var name: String
    var darkSideUsage: Int

    var hashValue: Int {
        return name.hashValue ^ darkSideUsage.hashValue &* 16777619
    }
}

Into this:

struct Jedi: Hashable {
    var name: String
    var darkSideUsage: Int
}

However, if you wanted your own hashing implementation – for example, if your type had many properties but you knew that one of them was enough to identify it uniquely – you still needed to implement a hashValue method using whatever algorithm you thought was best.

Swift 4.2 improved this situation further, thanks to its implementation of SE-0206: Hashable Enhancements. This introduces a new Hasher struct that provides a randomly seeded, universal hash function to make all our lives easier.

If you’re just relying on automatic synthesis of Hashable your code won’t change at all, although the end result might – more on that later. However, if you’re generating your own hashes then your code gets a lot simpler:

struct Jedi: Hashable {
    var name: String
    var darkSideUsage: Int

    func hash(into hasher: inout Hasher) {
        hasher.combine(name)
        hasher.combine(darkSideUsage)
    }
}

You can also use Hasher as a standalone hash generator: just provide it with whatever values you want to hash, then call finalize() to generate the final value. For example:

let obiwan = Jedi(name: "Obi-wan Kenobi", darkSideUsage: 0)
let anakin = Jedi(name: "Anakin Skywalker", darkSideUsage: 90)

var hasher = Hasher()
hasher.combine(obiwan)
hasher.combine(anakin)
let hash = hasher.finalize()

Note: You must not call finalize() if you’re writing a custom hash(into:) method for your types.

Sponsor Hacking with Swift and reach the world's largest Swift community!

Hash flooding and you

Now, all this might seem academic because the end result is a hash value integer just like we had before, but there is one important difference that might break your code: the Hasher struct uses a random seed to generate its hash values, which means that a given object’s hash value will change between runs of your program.

Random hash values are an important feature, because having a predictable hash value gives a potential opening to attackers. In particular, there’s a well-known attack called hash flooding: if I know your hashing algorithm ahead of time, I can easily generate a lot of different objects that have the same hash.

It’s helpful to look at a real example, albeit a deliberately simplified one. We might decide that our Jedi struct should be hashed based just on someone’s name, like this:

struct Jedi: Hashable {
    var name: String
    var darkSideUsage: Int

    var hashValue: Int {
        return name.hashValue &* 16777619
    }
}

An attacker could look at that code and realize they can generate any number of Jedi objects using the same name and a different darkSideUsage property. These objects are unique enough to added multiple times to the same set or dictionary, but similar enough to generate the same hash.

For example:

for _ in 1...100 {
    set.insert(Jedi(name: "Obi-wan Kenobi", darkSideUsage: Int(arc4random_uniform(UInt32.max))))
}

That will generate 100 Jedi called Obi-wan Kenobi, each with their own random darkSideUsage property.

Now, this matters because of how sets and dictionaries work. It’s a common misconception that hash values must always be unique, but that isn’t the case – it’s strongly preferable that they be unique, but sets and dictionaries are perfectly capable of holding two different objects that have the same hash value.

In our case, we just inserted 100 unique Jedi objects with the same hash value – I did a quick benchmark, and it took 0.002 seconds on my MacBook Pro. But what if were to insert 10,000 elements in the set – how long would that take?

You might guess 0.002 x 100, i.e. 0.2 seconds, but you’d be wrong: it actually takes 1.92 seconds. And inserting 30,000 of these objects takes 21.48 seconds – that’s 11 times longer for 3 times the objects.

A regular set or dictionary has O(1) look up and insertion, which means it can insert new objects and find them again at a constant speed regardless of how many objects you insert. Here, though, we have flooded our set with identical hashes – hence the name “hash flood attack” – which means that the O(1) behavior is no longer possible, and instead we get O(n²) performance. This is what causes the massive time lag: the set can no longer insert objects quickly because they collide.

Now imagine someone inserting 100,000 pseudo-identical objects – this is the kind of thing that could crash Swift apps or take down Swift-based web servers. This is one of the problems that Swift 4.2’s Hashable struct solves: it uses a pseudorandom function called SipHash that was specifically designed so that attackers can no longer predict what the hash value for a given object will be ahead of time.

Now that you understand the basics of why hash flooding is dangerous, I want to give you a slightly more real-world example so you can see the threat is real.

After a busy day of using the force, Jedi naturally want to go home and watch Netflix just like everyone else. So, we’re going to update the Jedi struct so they have an address:

struct Jedi: Hashable {
    var name: String
    var darkSideUsage: Int
    var address1: String
    var address2: String
    var city: String
    var state: String
    var postCode: String
    var country: String

    var hashValue: Int {
        return name.hashValue ^
            address1.hashValue ^
            address2.hashValue ^
            city.hashValue ^
            state.hashValue ^
            postCode.hashValue ^
            country.hashValue &* 16777619
    }
}

As you can see, that generates the hash value for the Jedi using their name and each of their address properties.

However, that hash algorithm is trivial for attackers to flood: they can just create Jedi instances with the same text strings in different places, like this:

let test1 = Jedi(name: "test1", darkSideUsage: 0, address1: "1", address2: "2", city: "3", state: "4", postCode: "5", country: "6")
let test2 = Jedi(name: "test1", darkSideUsage: 0, address1: "2", address2: "1", city: "3", state: "4", postCode: "5", country: "6")
let test3 = Jedi(name: "test1", darkSideUsage: 0, address1: "3", address2: "2", city: "1", state: "4", postCode: "5", country: "6")

That first Jedi uses 1, 2, 3 for its first three address values, the second uses 2, 1, 3, and the third uses 3, 2, 1. Because the actual data being stored is the same, the hash value will also be the same – it doesn’t matter that they appear in different properties, because a ^ b ^ c yields identical results to c ^ a ^ b.

With seven string properties there are over 5000 such combinations of our properties. If you use eight properties for your hash there are over 40,000, and if you use ten you’ll have over 3.5 million – all trivial for attackers to figure out, because it’s just using the same strings in different places.

The new Hasher protocol avoids this by maintaining a hash state. This is why you generate hashes using a combine() method: each time you add a new property into your hash, it’s combined with whatever the previous state was, ensuring that the order you add things matters.

So, we could rewrite our hashing code to use the new Swift 4.2 hash(into:) method, like this:

func hash(into hasher: inout Hasher) {
    hasher.combine(name)
    hasher.combine(address1)
    hasher.combine(address2)
    hasher.combine(city)
    hasher.combine(state)
    hasher.combine(postCode)
    hasher.combine(country)
}

And now each Jedi will generate completely different hash values even though the strings they store are the same.

Side effects of randomness

As you’ve seen, Hasher introduces a random seed every time it hashes an object. This change means the hash value for any object is effectively guaranteed to be different between runs of your app.

The documentation for the Hashable protocol has always warned that this might be the case:

Hash values are not guaranteed to be equal across different executions of your program. Do not save hash values to use during a future execution.

In Swift 4.2 and later this is now definitely the case, and this has an important side effect: elements that you add to a set or a dictionary are highly likely to have a different order each time you run your app.

Now, realistically you should never have relied on sets and dictionaries having the same order between runs, but if you did then now is the time to update your code.

That being said, you can force Swift to use predictable hashing for testing purposes only: set the environment variable SWIFT_DETERMINISTIC_HASHING to 1, and Swift will replace its random hashing seed with a constant value. This will not work in production, but it could in theory be helpful for some very specific tests.

If you’re excited by the Hasher struct, you should read my article detailing all the other new features in Swift 4.2 – there’s a lot of them!

Sponsor Hacking with Swift and reach the world's largest Swift community!