A collection of Data Science and Data Visualisation related posts, pics and thoughts. Take a look and enjoy.
"Data Science and Analytics with Python" was published yesterday and now it is already appearing as a suggested book for related titles.
You can find it with the link above or in Amazon here.
Very pleased to see that finally the publication of my "Data Science and Analytics with Python" book has arrived.Read me...
It has been a long road, one filled with unicorns and Jackalopes, decision trees and random forests, variance and bias, cats and dogs, and targets and features.
Well over a year ago, the idea of writing another book seemed like a farfetched proposition. Writing the book came about from the work that I have been doing in the area as well as from discussions with my colleagues and students, including also practitioners and beneficiaries of data science and analytics.
It is my sincere hope that the book is useful to those coming afresh to this new field as well as to those more seasoned data scientists.
This afternoon I had the pleasure of approving the final version of the book that will be sent to the printers in the next few days.
Data Science & Augmented Intelligence - Reblog from "Data Science: a new discipline to change the world" by Alan Wilson
This is a reblog of the post by Alan Wilson that appeared in the EPSRC blog. You can see the original here.
Data science - the new kid on the block
I have re-badged myself several times in my research career: mathematician, theoretical physicist, economist (of sorts), geographer, city planner, complexity scientist, and now data scientist. This is partly personal idiosyncrasy but also a reflection of how new interdisciplinary research challenges emerge. I now have the privilege of being the Chief Executive of The Alan Turing Institute - the national centre for data science. 'Data science' is the new kid on the block. How come?
First, there is an enormous amount of new 'big' data; second, this has had a powerful impact on all the sciences; and thirdly, on society, the economy and our way of life. Data science represents these combinations. The data comes from wide-spread digitisation combined with the 'open data' initiatives of government and extensive deployment of sensors and devices such as mobile phones. This generates huge research opportunities.
In broad terms, data science has two main branches. First, what can we do with the data? Applications of statistics and machine learning fall under this branch. Second, how can we transform existing science with this data and these methods? Much of the second is rooted in mathematics. To make this work in practice, there is a time-consuming first step: making the data useable by combining different sources in different formats. This is known as 'data wrangling', which coincidentally is the subject of a new Turing research project to speed up this time-consuming process. The whole field is driven by the power of the computer, and computer science. Understanding the effects of data on society, and the ethical questions it provokes, is led by the social sciences.
All of this combines in the idea of artificial intelligence, or AI. While the 'machine' has not yet passed the 'Turing test' and cannot compete with humans in thought, in many applications AI and data science now support human decision making. The current buzz phrase for this is 'augmented intelligence'.
I can illustrate the research potential of data science through two examples, the first from my own field of urban research; the second from medicine - with recent AI research in this field learned, no doubt imperfectly, from my Turing colleague Mihaela van der Schaar.
There is a long history of developing mathematical and computer models of cities. Data arrives very slowly for model calibration - the census, for example, is critical. A combination of open government data and real-time flows from mobile phones and social media networks has changed this situation: real-time calibration is now possible. This potentially transforms both the science and its application in city planning. Machine learning complements, and potentially integrates with, the models. Data science in this case adds to an existing deep knowledge base.
Medical diagnosis is also underpinned by existing knowledge - physiology, cell and molecular biology for example. It is a skilled business, interpreting symptoms and tests. This can be enhanced through data science techniques - beginning with advances in imaging and visualisation and then the application of machine learning to the variety of evidence available. The clinician can add his or her own judgement. Treatment plans follow. At this point, something really new kicks in. 'Live' data on patients, including their responses to treatment, becomes available. This data can be combined with personal data to derive clusters of 'like' patients, enabling the exploration of the effectiveness of different treatment plans for different types of patients. This combination of data science techniques and human decision making is an excellent example of augmented intelligence. This opens the way to personalised intelligent medicine, which is set to have a transformative effect on healthcare (for those interested in finding out more, reserve a place for Mihaela van der Schaar's Turing Lecture on 4 May).
An exciting new agenda
These kinds of developments of data science, and the associated applications, are possible in almost all sectors of industry. It is the role of the Alan Turing Institute to explore both the fundamental science underpinnings, and the potential applications, of data science across this wide landscape.
We currently work in fields as diverse as digital engineering, defence and security, computer technology and finance as well as cities and health. This range will expand as this very new Institute grows. We will work with and through universities and with commercial, public and third sector partners, to generate and develop the fruits of data science. This is a challenging agenda but a hugely exciting one.Read me...
Listening to O'Reilly Data Show - O'Reilly Media Podcast (Becoming a machine learning engineer): https://www.oreilly.com/ideas/becoming-a-machine-learning-engineer
The O’Reilly Data Show Podcast: Aurélien Géron on enabling companies to use machine learning in real-world products.
In this episode of the Data Show, I spoke with https://www.linkedin.com/in/aur%25C3%25A9lien-g%25C3%25A9ron-02720b83/, a serial entrepreneur, data scientist, and author of a popular, new book entitled Hands-on Machine Learning with Scikit-Learn and TensorFlow. Géron’s book is aimed at software engineers who want to learn machine learning and start deploying machine learning models in real-world products.
As more companies adopt big data and data science technologies, there is an emerging cohort of individuals who have strong software engineering skills and are experienced using machine learning and statistical techniques. The need to build data products has given rise to what many are calling “machine learning engineers”: individuals who can work on both data science prototypes and production systems.
Géron is finding strong demand for his services as a consulting machine learning engineer, and he hopes his new book will be an important resource for those who want to enter the field.
Here are some highlights from our conversation:
From product manager to machine learning engineer
I decided to join Google. They offered me a job as the lead product manager of YouTube's video classification team. The goal is to create a system that can automatically find out what each video is about. Google has a huge knowledge graphfor hundreds of millions of topics in it, and the goal is to actually connect each video with all the topics in the knowledge graph covered in the video.
... I was a product manager, and I had always been a software engineer. I felt a little bit far from the technical aspects; I wanted to code again. That was the first thing. The second thing is, TensorFlow came out and there was a lot of communication internally at Google. I began using TensorFlow, and loved it. I knew TensorFlow would become popular, and I felt it would make for a good book.
Writing a machine learning book for engineers
I had gone through all the classes I could; there are internal classes at Google for learning machine learning, and they had great teachers there. I also learned as much as I could from books, from Andrew Ng's Coursera class, and everything you can think of to learn machine learning. I was a bit frustrated by the books. The books are really good, but a lot of them are from researchers and they don't feel hands-on. I'm a software engineer; I wanted to code. That's when I decided that I wanted to write a book about TensorFlow that was really hands-on, with examples of code and things that engineers would pick up and start using right away. The other thing is that while there were a few books targeted at engineers, they really stayed as far away from the underlying math as possible. In addition, many of the existing books relied on toy functions, toy examples of code, and that was also a bit frustrating because I wanted to have production-ready code. That's how the idea grew: write a book about TensorFlow for engineers, with production-ready examples.
Business metrics are distinct from machine learning metrics
You can spend months tuning a great classifier that will detect with 98% precision a particular set of topics, but then you launch it and it really doesn't affect your business metrics whatsoever.
The first step is to really understand what the business metrics, or objectives, are. How are you going to measure them? Then, go and see if you have a chance at improving things. An interesting technique is to try to manually achieve the task. Have a human try to achieve the task and see if that has an impact. It's not always possible, but if you can do that, it might be worth spending months building an architecture to do it automatically. If a human cannot improve things, it might be challenging for a machine to do better. It might still be possible, but it might be tougher.
Make sure you know what the business objective is and never to lose track of it. I've seen people start improving models, but they don't really have metrics to see whether or not things have improved. It sounds stupid but one of the very first things you need to do is to make sure you have clear metrics that everybody agrees on. It's very tempting to say, ‘I feel this architecture is going to work better’ and try to then work on it, but it hasn't improved anything because you're working without metrics.
- Hands-on Machine Learning with Scikit-Learn and TensorFlow (Aurélien Géron’s new book)
- What is Hardcore Data Science - in practice? (Mikio Braun on how to bring data science into production)
- The Deep Learning video collection (Strata Data 2016)
- Fundamentals of Deep Learning
- Advanced Analytics with Spark
"What is machine learning?" is a question a lot of us often encounter. From email filtering to recommendation engines, machine learning is used is many of our daily activities.
Here is a video from the Oxford Sparks outreach program of the University of Oxford with a two-minute explanation. Enjoy
Well, I am very pleased to show you the cover that will be used for "Data Science and Analytics with Python" book. Not long to publication day!Read me...
Very pleased to have given an intro talk on Data Science and Analytics at General Assembly yesterday.
I have now received comments and corrections for the proofreading of my “Data Science and Analytics with Python” book.
Two weeks and counting to return corrections and comments back to the editor and project manager.
During the weekend I got a member of the team getting in touch because he was unable to get a Python package working for him . He had just installed Python in his machine, but things were not quite right... For example pip was not working and he had a bit of a bother setting some environment variables... I recommended to him having a look at installing Python via the Anaconda distribution. Today he was up and running with his app.
Given that outcome, I thought it was a great coincidence that the latest episode of Talk Python To Me that started playing on my way back home happened to be about Conda and Conda-Forge. I highly recommend listening to it. Take a loook:
Talk Python To Me - Python conversations for passionate developers - #94 Guarenteed packages via Conda and Conda-Forge
Have you ever had trouble installing a package you wanted to use in your Python app? Likely it contained some odd dependency, required a compilation step, maybe even using an uncommon compiler like Fortran. Did you try it on Windows? How many times have you seen "Cannot find vcvarsall.bat" before you had to take a walk?
If this sounds familiar, you might want to check conda the package manager, Anaconda, the distribution, conda forge, and conda build. They dramatically lower the bar for installing packages on all the platforms.
This week you'll meet Phil Elson, Kale Franz, and Michael Sarahan who all work on various parts of this ecosystem.
Links from the show:
Anaconda distribution: continuum.io/anaconda-overview
I have been thinking of making a post about CRISP-DM... in the meantime here is one from Steph Locke.
The Cross Industry Standard Process for Data Mining (CRISP-DM) was a concept developed 20 years ago now. I’ve read about it in various data mining and related books and it’s come in very handy over the years. In this post, I’ll outline what the model is and why you should know about it, even if it has that terribly out of vogue phrase data mining in it!
Data / R people. Do you know what the CRISP-DM model is?
— Steph Locke (@SteffLocke) January 8, 2017
The model splits a data mining project into six phases and it allows for needing to go back and forth between different stages. I’d personally stick a few more backwards arrows but it’s generally fine. The CRISP-DM model applies equally well to a data science project.
CRISP-DM Process diagram by Kenneth Jensen (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
Typical activities in each phase
In Data Mining Techniques in CRM, a very readable book, they outline in Table 1.1. some typical activities within each phase:
- Business Understanding
- Understanding the business goal
- Situation assessment
- Translating the business goal in a data mining objective
- Development of a project plan
- Data understanding
- Considering data requirements
- Initial data collection, exploration, and quality assessment
- Data preparation
- Selection of required data
- Data acquisition
- Data integration and formatting […]
- Data cleaning
- Data tranformation and enrichment […]
- Selection of appropriate modeling technique
- […] Splitting of the dataset into training and testing subsets for evaluation purposes
- Development and examination of alternative modeling algorithms and parameter settings
- Fine tuning of the model settings according to an initial assessment of the model’s performance
- Model evaluation
- Evaluation of the model in the context of the business success criteria
- model approval
- Create a report of findings
- Planning and development of the deployment procedure
- Deployment of the […] model
- distribution of the model results and integration in the organisation’s operational […] system
- Development of a maintenance / update plan
- Review of the project
- Planning the next steps
The CRISP-DM process outlines the steps involved in performing data science activities from business need to deployment, and most importantly it indicates how iterative this process is and that you never get things perfectly right.
Within a given project, we know that at the beginning of our first ever project we may not have a lot of domain knowledge, or there might be problems with the data or the model might not be valuable enough to put into production. These things happen, and the really nice thing about the CRISP-DM model is it allows for us to do that. It’s not a single linear path from project kick-off to deployment. It helps you remember not to beat yourself up over having to go back a step. It also equips you with something upfront to explain to managers that sometimes you will need to bounce between some phases, and that’s ok.
All models are wrong but some are useful (George Box)
We also know that our model is not going to be perfect. By the end of the project, our model’s value is already deteriorating! We get new customers, people change, the world changes, the business changes. Everything is conspiring against your model. This means it requires regular TLC for it remain of value. We might need to just regular adjust slightly for the latest view of the world (re-calibration) or we might need to take another tilt at modelling the problem again. The big circle around the process shows this fact of a data scientist’s life.
Working from the expectation that we will be iterative, we can start planning cycles of work. These might start with a short, small, simple model cycle to get a basic model quickly. Then further iterations can develop stronger models. The business gets some immediate benefit and it can then continue getting additional benefit from further cycles, or people could be moved onto building the next quick and simple model.
This gives the business a better high-level view of where data scientists are adding value and it means if the company is evolving the processes and data engineering capabilities at the same time, then a broad range of simple models can be first developed and implemented, giving learning experiences for all involved.
Estimation of project work and scoping is often difficult for data science projects, and that does need to change. One thing we can do is take the CRISP-DM phases and typical activities and build checklists and process frameworks around them. We can start moving each “bespoke” activity into a “cookie-cutter” activity.
One simple way of doing this is to start with a checklist. I am a big fan on checklists, more so after reading The Checklist Manifesto. You can build a manual checklist for people to work through to make sure important tasks are completed, that considerations from past projects are addressed, and you can ensure that ethical, regulatory, and legal considerations are considered at the right points in the development cycle.
The Microsoft Team Data Science Process is a developing framework that broadly follows the CRISP-DM model and is bringing in templates and tools to help data scientists. It’s proving quite interesting and I would recommend it as follow up reading.
Thinking about how we work
I read a lot of productivity, project management, and framework books. I’m always interested in how we can do our jobs better. Usually, this boils down to making things simpler and helping ensure we do the right things at the right time. The CRISP-DM is one simple thing that has helped me put that structure onto what often seems a chaotic process. I hope it could offer you some benefit and I’d be really interested to hear your thoughts, experiences, and tips for building better data science workflows.Read me...
I am very pleased to tell you about some news I received a couple of weeks ago from my editor: my book "Data Science and Analytics with Python" has been transferred to the production department so that they can begin the publication process!
The book has been assigned a Project Editor who will handle the proofreading and handle all aspects of the production process. This was after clearing the review process I told you about some time ago. The review was lengthy but it was very positive and the comments of the reviewers have definitely improved the manuscript.
As a result of the review, the table of contents has changed a bit since the last update I posted. Here is the revised table:
- The Trials and Tribulations of a Data Scientist
- Python: For Something Completely Different!
- The Machine that Goes “Ping”: Machine Learning and Pattern Recognition
- The Relationship Conundrum: Regression
- Jackalopes and Hares: Clustering
- Unicorns and Horses: Classification
- Decisions, Decisions: Hierarchical Clustering, Decision Trees and Ensemble Techniques
- Less is More: Dimensionality Reduction
- Kernel Trick Under the Sleeve: Support Vector Machines
Each of the chapters is intended to be sufficiently self-contained. There are some occasions where reference to other sections is needed, and I am confident that it is a good thing for the reader. Chapter 1 is effectively a discussion of what data science and analytics are, paying particular attention to the data exploration process and munging. It also offers my perspective as to what skills and roles are required to get a successful data science function.
Chapter 2 is a quick reminder of some of the most important features of Python. We then move into the core of machine learning concepts that are used in the rest of the book. Chapter 4 covers regression from ordinary least squares to LASSO and ridge regression. Chapter 5 covers clustering (k-means for example) and Chapter 6 classification algorithms such as Logistic Regression and Naïve Bayes.
In Chapter 7 we introduce the use of hierarchical clustering, decision trees and talk about ensemble techniques such as bagging and boosting.
Dimensionality reduction techniques such as Principal Component Analysis are discussed in Chapter 8 and Chapter 9 covers the support vector machine algorithm and the all important Kernel trick in applications such as regression and classification.
The book contains 55 figures and 18 tables, plus plenty of bits and pieces of Python code to play with.
I guess I will have to sit and wait for the proofreading to be completed and then start the arduous process of going through the comments and suggestions. As ever I will keep you posted as how things go.
Ah! By the way, I will start a mailing list to tell people when the book is ready, so if you are interested, please let me know!
Keep in touch!
PS. The table of contents is also now available at CRC Press here.