Last Thursday I attended a Cloudera Breakfast Briefing where Sean Owen was speaking about Spark and the examples were related to building decision trees and random forests. It was a good session in general.

Sean started his talk with an example using the Iris dataset using R, in particular the “party” library. He then moved on to talk about Spark and MLlib.

For the rest of the talk he used the “Covertype” data set that contains 581,012 data points describing trees using 54 features (elevation, slope, soil tye, etc,) predicting forest cover type (spruce, aspen, etc.). A very apt dataset for the construction of random forests, right? I was very pleased to see a new (for me) dataset being used!

Sean want over some bits and pieces about using Spark, highlighting the compactness of the code. He also turned his attention to the tuning of hyper-parameters and its importance.

There are different ways to approach this, but it is always about finding a balance, a trade-off. For a tree we can play with the depth of the tree, the maximum number of bins (i.e. the number of different decision rules to be tried), the amount of impurity (Gini or Entropy measures).

If we don’t know the right values for the hyperparameters, we can try several ones. Particularly if you have enough room on your cluster.

- Building a random forest: let various trees see only a subset of the data, then combine. Another approach is to let the trees see a subset of the features. The latter is a nice idea as this may be a more reasonable approach for large clusters, where communication among nodes is kept to a minimum -> good for Spark or Hadoop.

Sean finished with some suggestions of things one can try:

- Try SVM and LogisticRegression in MLlib
- Real-time scoring with Spark Streaming
- Use random decision forests for regression

Nonetheless, the best bit of this all was that after asking a couple of questions I managed to get my hands in a “Tofu Scientist” T-Shirt! Result!