Skip to content

CoreML – Boston Prices exploration

In the previous post of this series we described some of the basics of linear regression, one of the most well-known models in machine learning. We saw that we can relate the values of input parameters x_i to the target variable y to be predicted. In this post we are going to create a linear regression model to predict the price of houses in Boston (based on valuations from 1970s). The dataset provides information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE), average number of rooms (RM) as well as the median value of homes in $1000s (MEDV) as well as other attributes.

Let us start by exploring the data. We are going to use Scikit-learn and fortunately the dataset comes with the module. The input variables are included in the data method and the price is given by the target. We are going to load the input variables in the dataframe boston_df and the prices in the array y:

from sklearn import datasets
import pandas as pd 
boston = datasets.load_boston() 
boston_df = pd.DataFrame(boston.data)
boston_df.columns = boston.feature_names
y = boston.target

We are going to build our model using only a limited number of inputs. In this case let us pay attention to the average number of rooms and the crime rate:

X = boston_df[['CRIM', 'RM']]
X.columns = ['Crime', 'Rooms']
X.describe()

The description of these two attributes is as follows:

            Crime       Rooms
count  506.000000  506.000000
mean     3.593761    6.284634
std      8.596783    0.702617
min      0.006320    3.561000
25%      0.082045    5.885500
50%      0.256510    6.208500
75%      3.647423    6.623500
max     88.976200    8.780000

As we can see the minimum number of rooms is 3.5 and the maximum is 8.78, whereas for the crime rate the minimum is 0.006 and the maximum value is 88.97, nonetheless the median is 0.25. We will use some of these values to define the ranges that will be provided to our users to find price predictions.

Finally, let us visualise the data:

We shall bear these values in mind when building our regression model in subsequent posts.

You can look at the code (in development) in my github site here.