A collection of Data Science and Data Visualisation related posts, pics and thoughts. Take a look and enjoy.
Data Science & Augmented Intelligence - Reblog from "Data Science: a new discipline to change the world" by Alan Wilson
This is a reblog of the post by Alan Wilson that appeared in the EPSRC blog. You can see the original here.
Data science - the new kid on the block
I have re-badged myself several times in my research career: mathematician, theoretical physicist, economist (of sorts), geographer, city planner, complexity scientist, and now data scientist. This is partly personal idiosyncrasy but also a reflection of how new interdisciplinary research challenges emerge. I now have the privilege of being the Chief Executive of The Alan Turing Institute - the national centre for data science. 'Data science' is the new kid on the block. How come?
First, there is an enormous amount of new 'big' data; second, this has had a powerful impact on all the sciences; and thirdly, on society, the economy and our way of life. Data science represents these combinations. The data comes from wide-spread digitisation combined with the 'open data' initiatives of government and extensive deployment of sensors and devices such as mobile phones. This generates huge research opportunities.
In broad terms, data science has two main branches. First, what can we do with the data? Applications of statistics and machine learning fall under this branch. Second, how can we transform existing science with this data and these methods? Much of the second is rooted in mathematics. To make this work in practice, there is a time-consuming first step: making the data useable by combining different sources in different formats. This is known as 'data wrangling', which coincidentally is the subject of a new Turing research project to speed up this time-consuming process. The whole field is driven by the power of the computer, and computer science. Understanding the effects of data on society, and the ethical questions it provokes, is led by the social sciences.
All of this combines in the idea of artificial intelligence, or AI. While the 'machine' has not yet passed the 'Turing test' and cannot compete with humans in thought, in many applications AI and data science now support human decision making. The current buzz phrase for this is 'augmented intelligence'.
I can illustrate the research potential of data science through two examples, the first from my own field of urban research; the second from medicine - with recent AI research in this field learned, no doubt imperfectly, from my Turing colleague Mihaela van der Schaar.
There is a long history of developing mathematical and computer models of cities. Data arrives very slowly for model calibration - the census, for example, is critical. A combination of open government data and real-time flows from mobile phones and social media networks has changed this situation: real-time calibration is now possible. This potentially transforms both the science and its application in city planning. Machine learning complements, and potentially integrates with, the models. Data science in this case adds to an existing deep knowledge base.
Medical diagnosis is also underpinned by existing knowledge - physiology, cell and molecular biology for example. It is a skilled business, interpreting symptoms and tests. This can be enhanced through data science techniques - beginning with advances in imaging and visualisation and then the application of machine learning to the variety of evidence available. The clinician can add his or her own judgement. Treatment plans follow. At this point, something really new kicks in. 'Live' data on patients, including their responses to treatment, becomes available. This data can be combined with personal data to derive clusters of 'like' patients, enabling the exploration of the effectiveness of different treatment plans for different types of patients. This combination of data science techniques and human decision making is an excellent example of augmented intelligence. This opens the way to personalised intelligent medicine, which is set to have a transformative effect on healthcare (for those interested in finding out more, reserve a place for Mihaela van der Schaar's Turing Lecture on 4 May).
An exciting new agenda
These kinds of developments of data science, and the associated applications, are possible in almost all sectors of industry. It is the role of the Alan Turing Institute to explore both the fundamental science underpinnings, and the potential applications, of data science across this wide landscape.
We currently work in fields as diverse as digital engineering, defence and security, computer technology and finance as well as cities and health. This range will expand as this very new Institute grows. We will work with and through universities and with commercial, public and third sector partners, to generate and develop the fruits of data science. This is a challenging agenda but a hugely exciting one.Read me...
Listening to O'Reilly Data Show - O'Reilly Media Podcast (Becoming a machine learning engineer): https://www.oreilly.com/ideas/becoming-a-machine-learning-engineer
The O’Reilly Data Show Podcast: Aurélien Géron on enabling companies to use machine learning in real-world products.
In this episode of the Data Show, I spoke with https://www.linkedin.com/in/aur%25C3%25A9lien-g%25C3%25A9ron-02720b83/, a serial entrepreneur, data scientist, and author of a popular, new book entitled Hands-on Machine Learning with Scikit-Learn and TensorFlow. Géron’s book is aimed at software engineers who want to learn machine learning and start deploying machine learning models in real-world products.
As more companies adopt big data and data science technologies, there is an emerging cohort of individuals who have strong software engineering skills and are experienced using machine learning and statistical techniques. The need to build data products has given rise to what many are calling “machine learning engineers”: individuals who can work on both data science prototypes and production systems.
Géron is finding strong demand for his services as a consulting machine learning engineer, and he hopes his new book will be an important resource for those who want to enter the field.
Here are some highlights from our conversation:
From product manager to machine learning engineer
I decided to join Google. They offered me a job as the lead product manager of YouTube's video classification team. The goal is to create a system that can automatically find out what each video is about. Google has a huge knowledge graphfor hundreds of millions of topics in it, and the goal is to actually connect each video with all the topics in the knowledge graph covered in the video.
... I was a product manager, and I had always been a software engineer. I felt a little bit far from the technical aspects; I wanted to code again. That was the first thing. The second thing is, TensorFlow came out and there was a lot of communication internally at Google. I began using TensorFlow, and loved it. I knew TensorFlow would become popular, and I felt it would make for a good book.
Writing a machine learning book for engineers
I had gone through all the classes I could; there are internal classes at Google for learning machine learning, and they had great teachers there. I also learned as much as I could from books, from Andrew Ng's Coursera class, and everything you can think of to learn machine learning. I was a bit frustrated by the books. The books are really good, but a lot of them are from researchers and they don't feel hands-on. I'm a software engineer; I wanted to code. That's when I decided that I wanted to write a book about TensorFlow that was really hands-on, with examples of code and things that engineers would pick up and start using right away. The other thing is that while there were a few books targeted at engineers, they really stayed as far away from the underlying math as possible. In addition, many of the existing books relied on toy functions, toy examples of code, and that was also a bit frustrating because I wanted to have production-ready code. That's how the idea grew: write a book about TensorFlow for engineers, with production-ready examples.
Business metrics are distinct from machine learning metrics
You can spend months tuning a great classifier that will detect with 98% precision a particular set of topics, but then you launch it and it really doesn't affect your business metrics whatsoever.
The first step is to really understand what the business metrics, or objectives, are. How are you going to measure them? Then, go and see if you have a chance at improving things. An interesting technique is to try to manually achieve the task. Have a human try to achieve the task and see if that has an impact. It's not always possible, but if you can do that, it might be worth spending months building an architecture to do it automatically. If a human cannot improve things, it might be challenging for a machine to do better. It might still be possible, but it might be tougher.
Make sure you know what the business objective is and never to lose track of it. I've seen people start improving models, but they don't really have metrics to see whether or not things have improved. It sounds stupid but one of the very first things you need to do is to make sure you have clear metrics that everybody agrees on. It's very tempting to say, ‘I feel this architecture is going to work better’ and try to then work on it, but it hasn't improved anything because you're working without metrics.
- Hands-on Machine Learning with Scikit-Learn and TensorFlow (Aurélien Géron’s new book)
- What is Hardcore Data Science - in practice? (Mikio Braun on how to bring data science into production)
- The Deep Learning video collection (Strata Data 2016)
- Fundamentals of Deep Learning
- Advanced Analytics with Spark
"What is machine learning?" is a question a lot of us often encounter. From email filtering to recommendation engines, machine learning is used is many of our daily activities.
Here is a video from the Oxford Sparks outreach program of the University of Oxford with a two-minute explanation. Enjoy
Well, I am very pleased to show you the cover that will be used for "Data Science and Analytics with Python" book. Not long to publication day!Read me...
Very pleased to have given an intro talk on Data Science and Analytics at General Assembly yesterday.
I have now received comments and corrections for the proofreading of my “Data Science and Analytics with Python” book.
Two weeks and counting to return corrections and comments back to the editor and project manager.
During the weekend I got a member of the team getting in touch because he was unable to get a Python package working for him . He had just installed Python in his machine, but things were not quite right... For example pip was not working and he had a bit of a bother setting some environment variables... I recommended to him having a look at installing Python via the Anaconda distribution. Today he was up and running with his app.
Given that outcome, I thought it was a great coincidence that the latest episode of Talk Python To Me that started playing on my way back home happened to be about Conda and Conda-Forge. I highly recommend listening to it. Take a loook:
Talk Python To Me - Python conversations for passionate developers - #94 Guarenteed packages via Conda and Conda-Forge
Have you ever had trouble installing a package you wanted to use in your Python app? Likely it contained some odd dependency, required a compilation step, maybe even using an uncommon compiler like Fortran. Did you try it on Windows? How many times have you seen "Cannot find vcvarsall.bat" before you had to take a walk?
If this sounds familiar, you might want to check conda the package manager, Anaconda, the distribution, conda forge, and conda build. They dramatically lower the bar for installing packages on all the platforms.
This week you'll meet Phil Elson, Kale Franz, and Michael Sarahan who all work on various parts of this ecosystem.
Links from the show:
Anaconda distribution: continuum.io/anaconda-overview
I have been thinking of making a post about CRISP-DM... in the meantime here is one from Steph Locke.
The Cross Industry Standard Process for Data Mining (CRISP-DM) was a concept developed 20 years ago now. I’ve read about it in various data mining and related books and it’s come in very handy over the years. In this post, I’ll outline what the model is and why you should know about it, even if it has that terribly out of vogue phrase data mining in it!
Data / R people. Do you know what the CRISP-DM model is?
— Steph Locke (@SteffLocke) January 8, 2017
The model splits a data mining project into six phases and it allows for needing to go back and forth between different stages. I’d personally stick a few more backwards arrows but it’s generally fine. The CRISP-DM model applies equally well to a data science project.
CRISP-DM Process diagram by Kenneth Jensen (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
Typical activities in each phase
In Data Mining Techniques in CRM, a very readable book, they outline in Table 1.1. some typical activities within each phase:
- Business Understanding
- Understanding the business goal
- Situation assessment
- Translating the business goal in a data mining objective
- Development of a project plan
- Data understanding
- Considering data requirements
- Initial data collection, exploration, and quality assessment
- Data preparation
- Selection of required data
- Data acquisition
- Data integration and formatting […]
- Data cleaning
- Data tranformation and enrichment […]
- Selection of appropriate modeling technique
- […] Splitting of the dataset into training and testing subsets for evaluation purposes
- Development and examination of alternative modeling algorithms and parameter settings
- Fine tuning of the model settings according to an initial assessment of the model’s performance
- Model evaluation
- Evaluation of the model in the context of the business success criteria
- model approval
- Create a report of findings
- Planning and development of the deployment procedure
- Deployment of the […] model
- distribution of the model results and integration in the organisation’s operational […] system
- Development of a maintenance / update plan
- Review of the project
- Planning the next steps
The CRISP-DM process outlines the steps involved in performing data science activities from business need to deployment, and most importantly it indicates how iterative this process is and that you never get things perfectly right.
Within a given project, we know that at the beginning of our first ever project we may not have a lot of domain knowledge, or there might be problems with the data or the model might not be valuable enough to put into production. These things happen, and the really nice thing about the CRISP-DM model is it allows for us to do that. It’s not a single linear path from project kick-off to deployment. It helps you remember not to beat yourself up over having to go back a step. It also equips you with something upfront to explain to managers that sometimes you will need to bounce between some phases, and that’s ok.
All models are wrong but some are useful (George Box)
We also know that our model is not going to be perfect. By the end of the project, our model’s value is already deteriorating! We get new customers, people change, the world changes, the business changes. Everything is conspiring against your model. This means it requires regular TLC for it remain of value. We might need to just regular adjust slightly for the latest view of the world (re-calibration) or we might need to take another tilt at modelling the problem again. The big circle around the process shows this fact of a data scientist’s life.
Working from the expectation that we will be iterative, we can start planning cycles of work. These might start with a short, small, simple model cycle to get a basic model quickly. Then further iterations can develop stronger models. The business gets some immediate benefit and it can then continue getting additional benefit from further cycles, or people could be moved onto building the next quick and simple model.
This gives the business a better high-level view of where data scientists are adding value and it means if the company is evolving the processes and data engineering capabilities at the same time, then a broad range of simple models can be first developed and implemented, giving learning experiences for all involved.
Estimation of project work and scoping is often difficult for data science projects, and that does need to change. One thing we can do is take the CRISP-DM phases and typical activities and build checklists and process frameworks around them. We can start moving each “bespoke” activity into a “cookie-cutter” activity.
One simple way of doing this is to start with a checklist. I am a big fan on checklists, more so after reading The Checklist Manifesto. You can build a manual checklist for people to work through to make sure important tasks are completed, that considerations from past projects are addressed, and you can ensure that ethical, regulatory, and legal considerations are considered at the right points in the development cycle.
The Microsoft Team Data Science Process is a developing framework that broadly follows the CRISP-DM model and is bringing in templates and tools to help data scientists. It’s proving quite interesting and I would recommend it as follow up reading.
Thinking about how we work
I read a lot of productivity, project management, and framework books. I’m always interested in how we can do our jobs better. Usually, this boils down to making things simpler and helping ensure we do the right things at the right time. The CRISP-DM is one simple thing that has helped me put that structure onto what often seems a chaotic process. I hope it could offer you some benefit and I’d be really interested to hear your thoughts, experiences, and tips for building better data science workflows.Read me...
I am very pleased to tell you about some news I received a couple of weeks ago from my editor: my book "Data Science and Analytics with Python" has been transferred to the production department so that they can begin the publication process!
The book has been assigned a Project Editor who will handle the proofreading and handle all aspects of the production process. This was after clearing the review process I told you about some time ago. The review was lengthy but it was very positive and the comments of the reviewers have definitely improved the manuscript.
As a result of the review, the table of contents has changed a bit since the last update I posted. Here is the revised table:
- The Trials and Tribulations of a Data Scientist
- Python: For Something Completely Different!
- The Machine that Goes “Ping”: Machine Learning and Pattern Recognition
- The Relationship Conundrum: Regression
- Jackalopes and Hares: Clustering
- Unicorns and Horses: Classification
- Decisions, Decisions: Hierarchical Clustering, Decision Trees and Ensemble Techniques
- Less is More: Dimensionality Reduction
- Kernel Trick Under the Sleeve: Support Vector Machines
Each of the chapters is intended to be sufficiently self-contained. There are some occasions where reference to other sections is needed, and I am confident that it is a good thing for the reader. Chapter 1 is effectively a discussion of what data science and analytics are, paying particular attention to the data exploration process and munging. It also offers my perspective as to what skills and roles are required to get a successful data science function.
Chapter 2 is a quick reminder of some of the most important features of Python. We then move into the core of machine learning concepts that are used in the rest of the book. Chapter 4 covers regression from ordinary least squares to LASSO and ridge regression. Chapter 5 covers clustering (k-means for example) and Chapter 6 classification algorithms such as Logistic Regression and Naïve Bayes.
In Chapter 7 we introduce the use of hierarchical clustering, decision trees and talk about ensemble techniques such as bagging and boosting.
Dimensionality reduction techniques such as Principal Component Analysis are discussed in Chapter 8 and Chapter 9 covers the support vector machine algorithm and the all important Kernel trick in applications such as regression and classification.
The book contains 55 figures and 18 tables, plus plenty of bits and pieces of Python code to play with.
I guess I will have to sit and wait for the proofreading to be completed and then start the arduous process of going through the comments and suggestions. As ever I will keep you posted as how things go.
Ah! By the way, I will start a mailing list to tell people when the book is ready, so if you are interested, please let me know!
Keep in touch!
PS. The table of contents is also now available at CRC Press here.
A few weeks ago I was invited by General Assembly to give a short intro to Data Science to a group of interested (and interesting) students. They all had different backgrounds, but they all shared an interest for technology and related subjects.
While I was explaining some of the differences between supervised and unsupervised machine learning, I used my example of an alien life trying to cluster (and eventually classify) cats and dogs. If you are interested to know more about this, you will probably have to wait for the publication of my "Data Science and Analytics with Python" book.. I digress...
So, Ed Shipley - one of the admissions managers at GA London - asked me and the students if we had seen the videos that Facebook had produced to explain machine learning... He was reminded of them as they use an example about a machine distinguishing between dogs and cars... (see what they did there?...). If you haven't seen the videos, here you go:
Intro to AI
Convolutional Neural NetsRead me...
Last week I had the opportunity to attend the annual IBM conference in Las Vegas. The World of Watson conference, formally known as Insight, provided me with an opportunity to meet new interesting people, talk to colleagues and customers, learn new things and share some ideas with like-minded people. As you can imagine, with Watson being at the centre stage of the event, there were a large number of presentations, stands and marketing featuring Watson-related things: from cognitive chocolate and brews through to cognitive computing and beyond.
My session took place on Monday October 24th and I was very pleased to see a full room, and even later standing-room only just minuted before the start. We covered some of the fundamentals of data science and machine learning and took the pulse of their use in the insurance industry in particular. I then had the opportunity of sharing some of the results of the work we have been doing over the past 12 months at the Data Science Studio in London. The case studies showcased included examples in insurance, banking, wealth management and retail.
All in all, it was a very successful and enjoyable trip, in spite of the constant flashing lights of the slot machines around Las Vegas different venues.Read me...
The original blog can be seen here.
R has some good tools for importing data from spreadsheets, among them the readxl package for Excel and the googlesheets package for Google Sheets. But these only work well when the data in the spreadsheet are arranged as a rectangular table, and not overly encumbered with formatting or generated with formulas. As Jenny Bryan pointed out in her recent talk at the useR!2016 conference (and embedded below, or download PDF slides here), in practice few spreadsheets have "a clean little rectangle of data in the upper-left corner", because most people use spreadsheets not just a file format for data retrieval, but also as a reporting/visualization/analysis tool.
Nonetheless, for a practicing data scientist, there's a lot of useful data locked up in these messy spreadsheets that needs to be imported into R before we can begin analysis. As just one example given by Jenny in her talk, this spreadsheet was included as one of 15,000 spreadsheet attachments (one with 175 tabs!) in the Enron Corpus.
To make it easier to import data into R from messy spreadsheets like this, Jenny and co-author Richard G. FitzJohn created the jailbreakr package. The package is in its early stages, but it can already import Excel (xlsx format) and Google Sheets intro R as a new "linen" objects from which small sub-tables can easily be extracted as data frames. It can also print spreadsheets in a condensed text-based format with one character per cell — useful if you're trying to figure out why an apparently simple spreadsheet isn't importing as you expect. (Check out the "weekend getaway winner" story near the end of Jenny's talk for a great example.)
The jailbreakr package isn't yet on CRAN, but if you want to try it out you can download it from the Github repository (or even contribute!) at the link below.
Github (rsheets): jailbreakr
I am very pleased to have finally received the Raspberry Pi 3 that I ordered the other day. I also got a Sense Hat - an add-on board for Raspberry Pi, made especially for the Astro Pi mission
The Sense HAT has an 8×8 RGB LED matrix, a five-button joystick and includes the following sensors:
- Barometric pressure
There is even a Python library providing easy access to everything on the board. I can't wait to start using it with some of the APIs available at Bluemix for example. Any ideas are more than welcome.Read me...
I am joining forces with Bob Yelland (IBM) again to organise a joint meetup. I say again as we organised a joint session a few months back between the Big Data Developers in London and the Data+Visual meet up. I even gave a talk on that one, about "Data Visualisation: The good, the bad and the ugly" Unlike the previous one, we are actually physically joining the attendees rather than having parallel sessions.
The event is now live and it will take place on the 25th of August. Shall I see you there?
On the Skills Matter site: https://skillsmatter.com/meetups/8259-datapalooza-nights-meetup#overview
and the MeetUp site: https://www.meetup.com/Big-Data-Developers-in-London/events/232919166/