# How to perform regression analysis using Create ML

With Create ML anyone can train machine learning models with only a little code.

Paul Hudson       @twostraws Regression analysis is one of several areas of machine learning, and is useful when you want to estimate values based on a previous set of values. For example, if we gave Create ML average house prices for the last year along with the size and condition of those houses, it should be able to do a good job predicting values for houses that haven’t been sold yet.

In this article I'll walk you through the different model types Create ML gives us, then create and improve a model that can then be used in a Core ML app – I think you'll be surprised how easy it is!

### Choosing a model

Create ML gives us four concrete types of regression models: linear, decision trees, boosted trees, and random forests. Each of those arrive at their conclusions in different ways, so you’ll find different tasks work better with different models.

Linear regressors look at all the values in your equations – how many rooms the house has, how many bathrooms, whether or not it has a garage, etc – then tries to estimate the relationships between them as part of a linear function. So, it might guess that the house price is calculated by doubling the number of rooms, adding the number of bathrooms, then multiplying by 10,000. It’s called “linear” because the goal of a linear regression is to be able to draw one straight line through all your data points, where the average distance between the line and each data point is as small as possible.

Decision tree regressors form a tree structure letting us organize information as a series of choices: “does it have four legs or two? If it has four legs, does it have a wet nose? If it has a wet nose, does it have long fur?” Each time the tree branches off depending on the answer, until eventually there’s a definitive prediction.

Boosted tree regressors work using a series of decision trees, where each tree is designed to correct any errors in the previous tree. For example, the first decision tree makes its best prediction, but it’s off by 20%. That result then gets passed to a second tree for refinement, and the error comes down to 10%. That goes into a third tree where the error comes down to 6%, and a fourth tree where the error comes down to 5%. This technique is called model ensembling, and lets our combined model make better predictions. The term “boosting” just means “each tree depends on the tree that came before it.”

Random forests also use an ensemble of decision trees, but with an important difference: with boosted trees every decision in the tree is made with access to all available data, whereas with random trees each tree has access to only a subset of data. The trees then pool their predictions together to figure out which one is most likely. This is similar to asking your colleagues how to solve a coding problem: you’d get a variety of solutions based on their education and experience, but if you take the most common solution you stand the best chance of doing the right thing.

Helpfully, Create ML also provides a general-purpose regressor that looks at our data and tries to decide which concrete regression model is the best choice. While this won’t always be correct, it does give you a good starting point.

### Training the model

Let’s put your new-found machine learning knowledge into practice. I’ve generated some example data for us to work with, which is a collection of data about football players: how many appearances they’ve made, how many goals they scored, how many penalties they scored, how many set ups they made (helping someone else score), how many red and yellow cards they received, along with the amount a team last paid for them.

Note: This is just example data; don’t try to use it in a real app.

Here’s a snippet:

``````[
{
"appearances": 65,
"goals": 24,
"penalties": 6,
"setUps": 31,
"redCards": 2,
"yellowCards": 0,
"value": 15810000
},
{
"appearances": 151,
"goals": 70,
"penalties": 8,
"setUps": 99,
"redCards": 7,
"yellowCards": 1,
"value": 22540000
}
]``````

Each thing inside braces is a single player, and the whole thing is wrapped inside square brackets, making this a JSON array.

Let’s train a model with that data. First, you need to create a new macOS playground, because Create ML isn’t available on iOS. Now add an import for the CreateML framework:

``import CreateML``

The next step is to load our player JSON into an instance of `MLDataTable`, which is responsible for parsing the JSON into something Create ML can work with.

So, put this into the playground:

``let data = try MLDataTable(contentsOf: URL(fileURLWithPath: "/Users/twostraws/Desktop/players.json"))``

Warning: That path points to my Desktop directory. Make sure you change it to wherever you put players.json.

In order to check that Create ML did a good job, we’re going to have it split our data into two parts: 80% will be training data that it can use to try to find correlations in our data, and the remaining 20% will be testing data that it can use to evaluate the results of its training.

``let (trainingData, testingData) = data.randomSplit(by: 0.8)``

Now that we have data for Create ML to train on, we’re going to create an instance of `MLRegressor`: this is the general-purpose class that attempts to figure out which model makes most sense for our data.

``let playerPricer = try MLRegressor(trainingData: trainingData, targetColumn: "value")``

That passes in our training data, and tells it that “value” is the column we want to predict.

Once the model has been trained, the next job is to test it to make sure Create ML can make accurate predictions. This will take our testing data, look at all the player data, and attempt to predict the value. It can then look at the actual value to see how far off it was, thus giving us accurate of how good the model is.

To test the model, add this code to the playground:

``````let evaluationMetrics = playerPricer.evaluation(on: testingData)
print(evaluationMetrics.rootMeanSquaredError)
print(evaluationMetrics.maximumError)``````

The first of those outputs is calculated by looking at how much each prediction varied from the actual result, squaring that, calculating the mean of all the squared errors, then square rooting the result. This technique assigns more weight to bigger errors compared to a simple mean error, making it more useful: an error of 5 has a weight of 25 in its calculation, but an error of 10 has a weight of 100 – four times as much rather than twice as much.

The second of those outputs is the largest error across all predictions. This is interesting to know, but in practice the root mean squared error (RMSE) is more useful.

All being well, you should have the values 1010842.3852471417 and 3554860.0 printed out.

### Improving the model

Let’s take another look at the results so far:

• The average error was \$1,010,842
• The largest error was \$3,554,860

Like I said, the largest error is interesting but not critical. What matters is the average error (RMSE), which is over \$1 million. Our values here are pretty small, so that number isn’t great.

To fix this problem we’re going to use one of the specific regressors I mentioned earlier – linear, boosted tree, and so on. When you create a specific regressor instance you can provide it with custom parameters that control how it behaves. The one we care about here is called maximum iterations, and it controls how many times the algorithm can update its parameters to reflect its finding so far. The higher the number, the more chances the model has to get a good answer.

By default, Create ML uses 10 iterations for creating models, which is low – it’s common to start with something like 500 when performing regression analysis, then tweaking it upwards or downwards from there.

If you look in the output pane of your playground, you should see “BoostedTreeRegressor” next to the `let playerPrice =` line – that’s Swift telling us that `MLRegressor` looked at all our data and decided that a boosted tree model was the best choice.

So, let’s try using a boosted tree model using 500 iterations:

``````let params = MLBoostedTreeRegressor.ModelParameters(maxIterations: 500)
let playerPricer = try MLBoostedTreeRegressor(trainingData: trainingData, targetColumn: "value", parameters: params)``````

This time we get the following results:

• The average error was \$348,517 Swiftovian dollars
• The largest error was \$1,257,868 Swiftovian dollars.

So, we’ve taken our RMSE down by two thirds just by making a small change – a huge improvement, particularly when you remember this is weighted towards larger errors as discussed earlier.

Boosted tree regressors have lots of other options you can tweak if you have specific needs - just try looking at the initializer for its parameters! – but for our purposes we now have a model that is able to predict our data accurately enough.

### Saving the model

Now that we have a model that works reasonably well, the final step is to write the model to disk so we can use it in an iOS Core ML app.

``let metadata = MLModelMetadata(author: "Paul Hudson", shortDescription: "A model trained to predict player values.", version: "1.0")``

``try playerPricer.write(to: URL(fileURLWithPath: "/Users/twostraws/Desktop/PlayerValues.mlmodel"), metadata: metadata)``

Warning: Again, that path points to my Desktop directory. Make sure you change it to your own desktop.

Run your playground now, and it should write out PlayerValues.mlmodel to your desktop. That model is all set to go with Core ML – good job!

LEARN SWIFTUI FOR FREE I have a massive, free SwiftUI video collection on YouTube teaching you how to build complete apps with SwiftUI – check it out! @twostraws

Paul Hudson is the creator of Hacking with Swift, the most comprehensive series of Swift books in the world. He's also the editor of Swift Developer News, the maintainer of the Swift Knowledge Base, and a speaker at Swift events around the world. If you're curious you can learn more here. RSS feed