Random thoughts about random subjects… From science to literature and between manga and watercolours, passing by data science and rugby; including film, physics and fiction, programming, pictures and puns.
Right!!! It is early December and this post has been in the inkwell for a few months now. Earlier in the year I received the comments and suggestions from reviewers and the final approval from the excellent team at CRC Press for my 4th book.
After a few weeks of frank procrastination and a few more on structuring the thoughts proposed a bit more, I have got a clear head to start writing. So I am pleased to announce that I am officially starting to write “Statistics and Data Visualisation with #Python”.
“Statistics and Data Visualisation with Python” builds from the ground up the basis for statistical analysis underpinning a number of applications and algorithms in business analytics, machine learning and applied machine learning. The book will cover the basics of programming in python as well as data analysis to build a solid background in statistical methods and hypothesis testing useful in a variety of modern applications.
Anyone interested in creating their own data visualizations should be giddy with delight with the quickly growing number of tools available to create them without any need for programming skills, and in most cases for free: Tableau, Flourish, Datawrapper, RawGraphs, Chartbuilder or QGIS (for mapping) are some of the best, and the list goes on and on. I’m convinced in a relatively short time drag and drop tools with be as powerful and flexible as D3.js and other developer tools, making data visualization accesible to everyone.
The exciting news is seeing two software giants entering the field with new web-based tools: Adobe launched Data Illustrator a few months ago in a collaboration with the Georgia Institute of Technology, and Microsoft Research is behind the just released Charticulator. Both work very intuitively, allowing the author to bind multiple attributes of data to graphical elements. They are indeed powered by D3.js, among other libraries.
Both offer introduction videos in their hope pages. Here is Data Illustrator:
And here is Charticulator:
The tools offer tutorial sections and multiple step-by-step videos in their galleries; and they link to the research papers describing the tools, which are worth reading (Data Illustrator,Charticulator).
Creating complex visualizations like the chord diagram below seems ridiculously simple in Charticulator, and the same can be said of Data Illustrator’s visualizations. See the video:
This is not a review as I have just started playing with them, but on first look both tools are impressive. It’s still really early in their development, but if Adobe and Microsoft throw their mighty resources to support and improve them, we can expect great things in the near future. Perhaps one day Data Illustrator could be embedded within Adobe Illustrator, allowing designers to work fluidly and easily between D3 and Illustrator without leaving the graphical interface. And Charticulator could integrate into PowerPoint. Stay tuned!
I recently came across Flourish, a data visualisation tool that makes things easy and can be used even if your programming skills are a bit rusty. The tool is the brainchild of studio Kiln, who have made the tool entirely web-based and they even offer a free public version.
Starting up is easy as you are encouraged to use templates and can upload your data from a CSV or Excel. Some of the templates offer the usual scatterplots and bar charts, but you also have things like Sankey diagrams or 3D globe maps. If you are interested you can also create your own custom templates.
Flourish’s free version allows you to publish and share visualisations, or to embed them in your website. Beware that the data will be visible to everyone once you publish. Give it a go and let me know what you think.
In the previous post of this series we described some of the basics of linear regression, one of the most well-known models in machine learning. We saw that we can relate the values of input parameters to the target variable to be predicted. In this post we are going to create a linear regression model to predict the price of houses in Boston (based on valuations from 1970s). The dataset provides information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE), average number of rooms (RM) as well as the median value of homes in $1000s (MEDV) as well as other attributes.
Let us start by exploring the data. We are going to use Scikit-learn and fortunately the dataset comes with the module. The input variables are included in the data method and the price is given by the target. We are going to load the input variables in the dataframe boston_df and the prices in the array y:
from sklearn import datasets
import pandas as pd
boston = datasets.load_boston()
boston_df = pd.DataFrame(boston.data)
boston_df.columns = boston.feature_names
y = boston.target
We are going to build our model using only a limited number of inputs. In this case let us pay attention to the average number of rooms and the crime rate:
X = boston_df[['CRIM', 'RM']]
X.columns = ['Crime', 'Rooms']
The description of these two attributes is as follows:
count 506.000000 506.000000
mean 3.593761 6.284634
std 8.596783 0.702617
min 0.006320 3.561000
25% 0.082045 5.885500
50% 0.256510 6.208500
75% 3.647423 6.623500
max 88.976200 8.780000
As we can see the minimum number of rooms is 3.5 and the maximum is 8.78, whereas for the crime rate the minimum is 0.006 and the maximum value is 88.97, nonetheless the median is 0.25. We will use some of these values to define the ranges that will be provided to our users to find price predictions.
Finally, let us visualise the data:
We shall bear these values in mind when building our regression model in subsequent posts.
Yesterday I had the pleasure to give a community talk at Campus London as part of the events organised by General Assembly London. The place was fully packed and I was quite pleased to see that the audience was very engaged as they asked questions, made comments and great remarks.
As expected, the audience was quite varied from students interested to break into the field, to seasoned analysts and startup entrepreneurs. The questions were all very pertinent and I hope that the answers provided were useful to all of them.
The talk was effectively an introduction to what data science is, the tools used and opportunities and challenged in the field. You can find a handout of the slides here.
Last week I had the opportunity not only of hosting, but also of speaking at the Data+Visual Meetup organised by Eric Hannell. The occasion was well attended and not just by those interested in data visualisation, but also in big data as the Big Data Developers in London meetup took place concurrently.
In 2016 Andy Cotgreave will be joining me on the weekly #MakeoverMonday series so that we can compare how we each quickly take a foreign data set and turn it into a more meaningful visualisation. We’re also very curious to see our different approaches.
As for my talk, well, I wanted to use it as a reminder of the uses that visualisation of data and information has in every-day life and of the best practices that one should bear in mind when putting together a visual. Since data visualisation can be used (among other things) for:
See data in context
Support graphical calculation
Present an argument
Tell a story
we should take into account how the visualisation is going to be consumed, the audience and the message that we want to transmit. During the talk I showed some examples where data visualisation has been used effectively, but also some where it has’t (and how they could be improved). The aim is not to criticise (no-one deliberately goes out of their way to make a bad visual), but to learn.
It has been nearly 12 months in development almost to the day, and I am very please to tell you that the first full draft of my new book entitled “Data Science and Analytics with Python” is ready.
The book is aimed at data enthusiasts and professionals with some knowledge of programming principles as well as developers and business people interested in learning more about data science and analytics The proposed table of contents is as follows:
The Trials and Tribulations of a Data Scientist
Firsts Slithers with Python
The Machine that Goes “Ping”: Machine Learning and Pattern Recognition
The Relationship Conundrum: Regression
Jackalopes and Hares, Unicorns and Horses: Clustering and Classification
Decisions, Decisions: Hierarchical Clustering, Decision Trees and Ensemble Techniques
Dimensionality Reduction and Support Vector Machines
At the moment the book contains 53 figures and 18 tables, plus plenty of bits and pieces of code ready to be tried.
The next step is to start the re-reading, re-draftings and revisions in preparation for the final version and submission to my publisher CRC Press later in the year. I will keep you posted as how things go.