On-device machine learning went from “extremely hard to do” to “quite possible, and surprisingly powerful” in iOS 11, all thanks to one Apple framework: Core ML. A year later, Apple introduced a second framework called Create ML, which added “easy to do” to the list, and then a second year later Apple introduced a Create ML app that made the whole process drag and drop. As a result of all this work, it’s now within the reach of anyone to add machine learning to their app.
Core ML is capable of handling a variety of training tasks, such as recognizing images, sounds, and even motion, but in this instance we’re going to look at tabular regression. That’s a fancy name, which is common in machine learning, but all it really means is that we can throw a load of spreadsheet-like data at Create ML and ask it to figure out the relationship between various values.
Machine learning is done in two steps: we train the model, then we ask the model to make predictions. Training is the process of the computer looking at all our data to figure out the relationship between all the values we have, and in large data sets it can take a long time – easily hours, potentially much longer. Prediction is done on device: we feed it the trained model, and it will use previous results to make estimates about new data.
Let’s start the training process now: please open the Create ML app on your Mac. If you don’t know where this is, you can launch it from Xcode by going to the Xcode menu and choosing Open Developer Tool > Create ML.
The first thing the Create ML app will do is ask you to create a project or open a previous one – please click New Document to get started. You’ll see there are lots of templates to choose from, but if you scroll down to the bottom you’ll see Tabular Regressor; please choose that and press Next. For the project name please enter BetterRest, then press Next, select your desktop, then press Create.
This is where Create ML can seem a little tricky at first, because you’ll see a screen with quite a few options. Don’t worry, though – once I walk you through it isn’t so hard.
The first step is to provide Create ML with some training data. This is the raw statistics for it to look at, which in our case consists of four values: when someone wanted to wake up, how much sleep they thought they liked to have, how much coffee they drink per day, and how much sleep they actually need.
I’ve provided this data for you in BetterRest.csv, which is in the project files for this project. This is a comma-separated values data set that Create ML can work with, and our first job is to import that.
So, in Create ML look under Data Inputs and select Choose under the Training Data title. When you press Select File it will open a file selection window, and you should choose BetterRest.csv.
Important: This CSV file contains sample data for the purpose of this project, and should not be used for actual health-related work.
The next job is to decide the target, which is the value we want the computer to learn to predict, and the features, which are the values we want the computer to inspect in order to predict the target. For example, if we chose how much sleep someone thought they needed and how much sleep they actually needed as features, we could train the computer to predict how much coffee they drink.
In this instance, I’d like you to choose “actualSleep” for the target, which means we want the computer to learn how to predict how much sleep they actually need. Now press Select Features, and select all three options: wake, estimatedSleep, and coffee – we want the computer to take all three of those into account when producing its predictions.
Below the Select Features button is a dropdown button for the algorithm, and there are five options: Automatic, Random Forest, Boosted Tree, Decision Tree, and Linear Regression. Each takes a different approach to analyzing data, and although this isn’t a book about machine learning I want to explain what they do briefly.
Linear regressors are the easiest to understand, because it’s pretty much exactly how our brain works. They attempt to estimate the relationships between your variables by considering them as part of a linear function such as
applyAlgorithm(var1, var2, var3). The goal of a linear regression is to be able to draw one straight line through all your data points, where the average distance between the line and each data point is as small as possible.
Decision tree regressors form a natural tree structure letting us organize information as a series of choices. Try to envision this almost like a game of 20 questions: “are you a person or an animal? If you’re a person, are you alive or dead? If you’re alive, are you young or old?” And so on – each time the tree can branch off depending on the answer to each question, until eventually there’s a definitive answer.
Boosted tree regressors work using a series of decision trees, where each tree is designed to correct any errors in the previous tree. For example, the first decision tree takes its best guess at finding a good prediction, but it’s off by 20%. This then gets passed to a second decision tree for further refinement, and the process repeats – this time, though, the error is down to 10%. That goes into a third tree where the error comes down to 8%, and a fourth tree where the error comes down to 7%.
The random forest model is similar to boosted trees, but with a slight difference: with boosted trees every decision in the tree is made with access to all available data, whereas with random trees each tree has access to only a subset of data.
This might sound bizarre: why would you want to withhold data? Well, imagine you were facing a coding problem and trying to come up with a solution. If you ask a colleague for ideas, they will give you some based on what they know. If you ask a different colleague for ideas, they are likely to give you different ideas based on what they know. And if you asked a hundred colleagues for ideas, you’ll get a range of solutions.
Each of your colleagues will have a different background, different education, and a different job history than the others, which is why you get a range of suggestions. But if you average out the suggestions across everyone – go with whatever most people say, regardless of what led them to that decision – you have the best chance of getting the right solution.
This is exactly how random forest regressors work: each decision tree has its own view of your data that’s different to other trees, and by combining all their predictions together to make an average you stand a great chance of getting a strong result.
Helpfully, there is an Automatic option that attempts to choose the best algorithm automatically. It’s not always correct, and in fact it does limit the options we have quite dramatically, but for this project it’s more than good enough.
When you’re ready, click the Train button in the window title bar. After a couple of seconds – our data is pretty small! – you’ll see some result metrics appear. The value we care about is called Root Mean Squared Error, and you should get a value around about 180. This means on average the model was able to predict suggested accurate sleep time with an error of only 180 seconds, or three minutes.
Even better, if you look in the top-right corner you’ll see an MLModel icon saying “Output” and it has a file size of 438 bytes or so. Create ML has taken 180KB of data, and condensed it down to just 438 bytes – almost nothing.
Now, 438 bytes sounds tiny, I know, but it’s worth adding that almost all of those bytes are metadata: the author name is in there, as is the default description of “A machine learning model that has been trained for regression”. It even encodes the names of all the fields: wake, estimatedSleep, coffee, and actualSleep.
The actual amount of space taken up by the hard data – how to predict the amount of required sleep based on our three variables – is well under 100 bytes. This is possible because Create ML doesn’t actually care what the values are, it only cares what the relationships are. So, it spent a couple of billion CPU cycles trying out various combinations of weights for each of the features to see which ones produce the closest value to the actual target, and once it knows the best algorithm it simply stores that.
Now that our model is trained, I’d like you to drag that icon from Create ML to your desktop, so we can use it in code.
Tip: If you want to try training again – perhaps to experiment with the various algorithms available to us – click Make A Copy in the bottom-right corner of the Create ML window.
Link copied to your pasteboard.