Installing Spark 1.6.1 on a Mac with Scala 2.11

I have recently gone through the process of installing Spark in my mac for testing and development purposes. I also wanted to make sure I could use the installation not only with Scala, but also with PySpark through a Jupyter notebook.

If you are interested in doing the same, here are the steps I followed. First of all, here are the packages you will need:

  • Python 2.7 or higher
  • Java SE Development Kit
  • Scala and Scala Build Tool
  • Spark 1.6.1 (at the time of writing)
  • Jupyter Notebook


You can chose the best python distribution that suits your needs. I find Anaconda to be fine for my purposes. You can obtain a graphical installer from I am using Python 2.7 at the time of writing.

Java SE Development Kit

You will need to download Oracle Java SE Development Kit 7 or 8 at Oracle JDK downloads page. In my case, at the time of writing I am using 1.7.0_80. You can check the version you have by opening a terminal and typing

java -version

You also have to make sure that the appropriate environment variable is set up. In your


  add the following lines:

export JAVA_HOME=$(/usr/libexec/java_home)

Scala and Scala Build Tool

In this case, I found it much easier to use Homebrew to install and manage the Scala language. I f you have never used Homebrew, I recommend that you take a look. To install it you have to type the following in your terminal:

ruby -e "$(curl -fsSL"

Once you have Hombrew you can install Scala and the Scala Build Tool as follows:

> brew install scala
> brew install bst

You may want to create appropriate environments in your 



export SCALA_HOME=/usr/local/bin/scala

Spark 1.6.1

Obtain Spark from

Note that for building Spark with Scala 2.11 you will need to download the Spark source code and build it appropriately.


Once you have downloaded the tgz file, unzip it into an appropriate location (your home directory for example) and navigate to the unzipped folder (for example



To build Spark with Scala 2.11 you need to type the following commands:

> ./dev/
> build/sbt clean assembly

This may take a while, so sit tight! When finished, you can check that everything is working by launching either the Scala shell:

> ./bin/spark-shell

or the Python shell:

> ./bin/pyspark

Once again there are some environment variables that are recommended:

export SPARK_PATH=~/spark-1.6.1
export PYSPARK_DRIVER_PYTHON="jupyter" 
alias sparknb='$SPARK_PATH/bin/pyspark --master local[2]'

The last line is an alias that will enable us to launch a Jupyter notebook with PySpark. Totally optional!

Jupyter Notebook

If all is working well you are ready to go. Source your


  and  launch a Jupyter notebook:

> sparknb

Et voilà!

Strata+Hadoop World – London 2016

I had the opportunity to attend the Strata+Hadoop World conference in London last week on the 2nd and 3rd of June. It was held in the ExCeL Centre in East London. Given the size of the venue, I had the expectation that it was going to be a massive event… Don’t take me wrong, it was indeed big, but I thought it was going to be even bigger. Colleagues that have attended other editions in San Jose did also remark that this one was on the smaller side of things.

In any case, I had the opportunity to talk to a lot of very interesting and engaging people, and heard about the work that large and small companies in the scene are doing. I had the chance to present a demo on the use of Spark in Bluemix and I think it went really well.

Who knows, I may even come to the next one…

Cloudera Breakfast Briefing and Tofu Scientists

Last Thursday I attended a Cloudera Breakfast Briefing where Sean Owen was speaking about Spark and the examples were related to building decision trees and random forests. It was a good session in general.

Sean started his talk with an example using the Iris dataset using R, in particular the “party” library. He then moved on to talk about Spark and MLlib.


For the rest of the talk he used the “Covertype” data set that contains 581,012 data points describing trees using 54 features (elevation, slope, soil tye, etc,) predicting forest cover type (spruce, aspen, etc.). A very apt dataset for the construction of random forests, right? I was very pleased to see a new (for me) dataset being used!

Sean want over some bits and pieces about using Spark, highlighting the compactness of the code. He also turned his attention to the tuning of hyper-parameters and its importance.

There are different ways to approach this, but it is always about finding a balance, a trade-off. For a tree we can play with the depth of the tree, the maximum number of bins (i.e. the number of different decision rules to be tried), the amount of impurity (Gini or Entropy measures).

If we don’t know the right values for the hyperparameters, we can try several ones.  Particularly if you have enough room on your cluster.


  • Building a random forest: let various trees see only a subset of the data, then combine. Another approach is to let the trees see a subset of the features. The latter is a nice idea as this may be a more reasonable approach for large clusters, where communication among nodes is kept to a minimum -> good for Spark or Hadoop.


Sean finished with some suggestions of things one can try:

  • Try SVM and LogisticRegression in MLlib
  • Real-time scoring with Spark Streaming
  • Use random decision forests for regression

Nonetheless, the best bit of this all was that after asking a couple of questions I managed to get my hands in a “Tofu Scientist” T-Shirt! Result!

Tofu Scientist 1