Data Science

A collection of Data Science and Data Visualisation related posts, pics and thoughts. Take a look and enjoy.

Core ML - Preparing the environment

Hello again! In preparation to training a model to be converted by Core ML to be used in an application, I would like to make sure we have a suitable environment to work on. One of the first things that came to my attention looking at the coremltools module is the fact that it only supports Python 2! Yes, you read correctly, you will have to make sure you use Python 2.7 if you want to make this work. As you probably know, Python 2 will be retired in 2020, so I hope that Apple is considering in their development cycles. In the meantime you can see the countdown to Python 2's retirement here, and thanks Python 2 for the many years of service...

Anyway, if you are a Python 2 user, then you are good to go. If on the other hand you have moved with the times you may need to make appropriate installations. I am using Anaconda (you may use your favourite distro) and I will be creating a conda environment (I'm calling it coreml) with Python 2.7 and some of the libraries I will be using:

> conda create --name coreml python=2.7 ipython jupyter scikit-learn

> source activate coreml

(coreml) > pip install coremltools

I am sure there may be other modules that will be needed, and I will make appropriate installations (and additions to this post) as that becomes clearer.

You can get a look at Apple's coremltools github repo here.

ADDITIONS: As I mentioned, there may have been other modules that needed installing in the new environment here is a list:

  • pandas
  • matplotlib
  • pillow
Read me...

Core ML - What is it?

In a previous post I mentioned that I will be sharing some notes about my journey with doing data science and machine learning by Apple technology. This is the firsts of those posts and here I will go about what Core ML is...

Core ML is a computer framework. So what is a framework?  Well, in computer terms is a software abstraction that enables generic functionality to be modified as required by the user to transform it into software for specific purposes to enable the development of a system or even a humble project.

Core ML integrates a trained machine learning model into your app.

So Core ML is an Apple provided framework to speed apps that use trained machine learning models. Notice that word in bold - trained - is part of the description of the framework. This means that the model has to be developed externally with appropriate training data for the specific project in mind. For instance if you are interested in building a classifier that distinguishes cats from cars, then you need to train the model with lots of cat and car images.

As it stands Core ML supports a variety of machine learning models, from generalised linear models (GLMs for short) to neural nets. Furthermore it helps with the tests of adding the trained machine learning model to your application by automatically creating a custom programmatic interface that supplies an APU to your model. All this within the comfort of Xcode!

There is an important point to remember. The model has to be developed externally from Core ML, in other words you may want to use your favourite machine learning framework (that word again), computer language and environment to cover the different aspects of the data science workflow. You can read more in that in Chapter 3 of my "Data Science and Analytics with Python" book. So whether you use Scikit-learnm, Keras or Caffe, the model you develop has to be trained (tested and evaluated) beforehand. Once you are ready, then Core ML will support you in bringing it to the masses via your app.

As mentioned in the Core ML documentation:

Core ML is optimized for on-device performance, which minimizes memory footprint and power consumption. Running strictly on the device ensures the privacy of user data and guarantees that your app remains functional and responsive when a network connection is unavailable.

OK, so in the next few posts we will be using Python and coremltools, to generate a so-called .mlmodel file that Xcode can use and deploy. Stay tuned!

Read me...

What Is Artificial Intelligence?

Original article by JF Puget here.

Here is a question I was asked to discuss at a conference last month: what is Artifical Intelligence (AI)?  Instead of trying to answer it, which could take days, I decided to focus on how AI has been defined over the years.  Nowadays, most people probably equate AI with deep learning.  This has not always been the case as we shall see.

Most people say that AI was first defined as a research field in a 1956 workshop at Dartmouth College.  Reality is that is has been defined 6 years earlier by Alan Turing in 1950.  Let me cite Wikipedia here:

The Turing test, developed by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviorequivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation is a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel such as a computer keyboard and screen so the result would not depend on the machine's ability to render words as speech.[2] If the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test. The test does not check the ability to give correct answers to questions, only how closely answers resemble those a human would give.

The test was introduced by Turing in his paper, "Computing Machinery and Intelligence", while working at the University of Manchester(Turing, 1950; p. 460).[3] It opens with the words: "I propose to consider the question, 'Can machines think?'" Because "thinking" is difficult to define, Turing chooses to "replace the question by another, which is closely related to it and is expressed in relatively unambiguous words."[4] Turing's new question is: "Are there imaginable digital computers which would do well in the imitation game?"[5] This question, Turing believed, is one that can actually be answered. In the remainder of the paper, he argued against all the major objections to the proposition that "machines can think".[6]

image

So, the first definition of AI was about thinking machines.  Turing decided to test thinking via a chat.

The definition of AI rapidly evolved to include the ability to perform complex reasoning and planing tasks.  Early success in the 50s led prominent researchers to make imprudent predictions about how AI would become a reality in the 60s.  The lack of realization of these predictions led to funding cut known as the AI winter in the 70s.

In the early 80s, building on some success for medical diagnosis, AI came back with expert systems.  These systems were trying to capture the expertise of humans in various domains, and were implemented as rule based systems.  This was the days were AI was focusing on the ability to perform tasks at best human expertise level.  Success like IBM Deep Blue beating the chess world champion, Gary Kasparov, in  1997 was the acme of this line of AI research.

Let's contrast this with today's AI.  The focus is on perception: can we have systems that recognize what is in a picture, what is in a video, what is said in a sound track?  Rapid progress is underway for these tasks thanks to the use of deep learning.  Is it AI still?  Are we automating human thinking?  Reality is we are working on automating tasks that most humans can do without any thinking effort. Yet we see lots of bragging about AI being a reality when all we have is some ability to mimic human perception.  I really find it ironic that our definition of intelligence is that of mere perception  rather than thinking.

Granted, not all AI work today is about perception.  Work on natural language processing (e.g. translation) is a bit closer to reasoning than mere perception tasks described above.  Success like IBM Watson at Jeopardy, or Google AlphaGO at Go are two examples of the traditional AI aiming at replicate tasks performed by human experts.    The good news (to me at least) is that the progress is so rapid on perception that it will move from a research field to an engineering field in the coming years.  We will then see a re-positioning of researchers on other AI related topics such as reasoning and planning.  We'll be closer to Turing's initial view of AI.

Read me...

Now... presenting at ODSC Europe

Data science is definitely in everyone’s lips and this time I had the opportunity of showcasing some of my thoughts, practices and interests at the Open Data Science Conference in London.

The event was very well attended by data scientists, engineers and developers at all levels of seniority, as well as business stakeholders. I had the great opportunity to present the landscape that newcomers and seasoned practitioners must be familiar with to be able to make a successful transition into this exciting field.

It was also a great opportunity to showcase “Data Science and Analytics with Python” and to get to meet new people including some that know other members of my family too.

-j

Read me...

Data Science and Analytics with Python - New York Team

Earlier this week I received this picture of the team in New York. As you can see they have recently all received a copy of my "Data Science and Analytics with Python" book.

Thanks guys!

TeamNY.PNG

Read me...

Python overtakes R - Reblog

Did you use R, Python (along with their packages), both, or other tools for Analytics, Data Science, Machine Learning work in 2016 and 2017?

Python did not quite "swallow" R, but the results, based on 954 voters, show that in 2017 Python ecosystem overtook R as the leading platform for Analytics, Data Science, Machine Learning.

While in 2016 Python was in 2nd place ("Mainly Python" had 34% share vs 42% for "Mainly R"), in 2017 Python had 41% vs 36% for R.

The share of KDnuggets readers who used both R and Python in significant ways also increased from 8.5% to 12% in 2017, while the share who mainly used other tools dropped from 16% to 11%.

Python, R, Other Analytics, Data Science platform, 2016-2017
Fig. 1: Share of Python, R, Both, or Other platforms usage for Analytics, Data Science, Machine Learning, 2016 vs 2017

Next, we examine the transitions between the different platforms.

Python vs R vs Other, 2016 to 2017 Transitions
Fig. 2: Analytics, Data Science, Machine Learning Platforms
Transitions between R, Python, Both, and Other from 2016 to 2017

This chart looks complicated, but we see two key aspects, and Python wins on both:

  • Loyalty: Python users are more loyal, with 91% of 2016 Python users staying with Python. Only 74% of R users stayed, and 60% of other platforms users did.
  • Switching: Only 5% of Python users moved to R, while twice as many - 10% of R users moved to Python. Among those who used both in 2016, only 49% kept using both, 38% moved to Python, and 11% moved to R.

Net we look at trends across multiple years.

In our 2015 Poll on R vs Python we did not offer an option for "Both Python and R", so to compare trends across 4 years, we replace the shares of Python and R in 2016 and 2017 by
Python* = (Python share) + 50% of (Both Python and R)
R* = (R share) + 50% of (Both Python and R)

We see that share of R usage is slowly declining (from about 50% in 2015 to 36% in 2017), while Python share is steadily growing - from 23% in 2014 to 47% in 2017. The share of other platforms is also steadily declining.

Python R Other 2014 17 Trends
Fig. 3: Python vs R vs Other platforms for Analytics, Data Science, and Machine Learning, 2014-17

Finally, we look at trends and patterns by region. The regional participation was:

  • US/Canada, 40%
  • Europe, 35%
  • Asia, 12.5%
  • Latin America, 6.2%
  • Africa/Middle East, 3.6%
  • Australia/NZ, 3.1%

To simplify the chart we split "Both" votes among R and Python, as above, and also combine 4 regions with smaller participation of Asia, AU/NZ, Latin America, and Africa/Middle East into one "Rest" region.

Python R Other Region 2016 2017
Fig. 4: Python* vs R* vs Rest by Region, 2016 vs 2017

We observe the same pattern across all regions:

  • increase in Python share, by 8-10%
  • decline in R share, by about 2-4%
  • decline in other platforms, by 5-7%

The future looks bright for Python users, but we expect that R and other platforms will retain some share in the foreseeable future because of their large embedded base.

Read me...

Languages for Data Science

Very often the question about what programming language is best for data science work. The answer may depend on who you ask, there are many options out there and they all have their advantages and disadvantages. Here are some thoughts from Peter Gleeson on this matter:

While there is no correct answer, there are several things to take into consideration. Your success as a data scientist will depend on many points, including:

Specificity

When it comes to advanced data science, you will only get so far reinventing the wheel each time. Learn to master the various packages and modules offered in your chosen language. The extent to which this is possible depends on what domain-specific packages are available to you in the first place!

Generality

A top data scientist will have good all-round programming skills as well as the ability to crunch numbers. Much of the day-to-day work in data science revolves around sourcing and processing raw data or ‘data cleaning’. For this, no amount of fancy machine learning packages are going to help.

Productivity

In the often fast-paced world of commercial data science, there is much to be said for getting the job done quickly. However, this is what enables technical debt to creep in — and only with sensible practices can this be minimized.

Performance

In some cases it is vital to optimize the performance of your code, especially when dealing with large volumes of mission-critical data. Compiled languages are typically much faster than interpreted ones; likewise statically typed languages are considerably more fail-proof than dynamically typed. The obvious trade-off is against productivity.

To some extent, these can be seen as a pair of axes (Generality-Specificity, Performance-Productivity). Each of the languages below fall somewhere on these spectra.

With these core principles in mind, let’s take a look at some of the more popular languages used in data science. What follows is a combination of research and personal experience of myself, friends and colleagues — but it is by no means definitive! In approximately order of popularity, here goes:

R

What you need to know

Released in 1995 as a direct descendant of the older S programming language, R has since gone from strength to strength. Written in C, Fortran and itself, the project is currently supported by the R Foundation for Statistical Computing.

License

Free!

Pros

  • Excellent range of high-quality, domain specific and open source packages. R has a package for almost every quantitative and statistical application imaginable. This includes neural networks, non-linear regression, phylogenetics, advanced plotting and many, many others.
  • The base installation comes with very comprehensive, in-built statistical functions and methods. R also handles matrix algebra particularly well.
  • Data visualization is a key strength with the use of libraries such as ggplot2.

Cons

  • Performance. There’s no two ways about it, R is not a quick language.
  • Domain specificity. R is fantastic for statistics and data science purposes. But less so for general purpose programming.
  • Quirks. R has a few unusual features that might catch out programmers experienced with other languages. For instance: indexing from 1, using multiple assignment operators, unconventional data structures.

Verdict — “brilliant at what it’s designed for”

R is a powerful language that excels at a huge variety of statistical and data visualization applications, and being open source allows for a very active community of contributors. Its recent growth in popularity is a testament to how effective it is at what it does.

Python

What you need to know

Guido van Rossum introduced Python back in 1991. It has since become an extremely popular general purpose language, and is widely used within the data science community. The major versions are currently 3.6 and 2.7.

License

Free!

Pros

  • Python is a very popular, mainstream general purpose programming language. It has an extensive range of purpose-built modules and community support. Many online services provide a Python API.
  • Python is an easy language to learn. The low barrier to entry makes it an ideal first language for those new to programming.
  • Packages such as pandas, scikit-learn and Tensorflow make Python a solid option for advanced machine learning applications.

Cons

  • Type safety: Python is a dynamically typed language, which means you must show due care. Type errors (such as passing a String as an argument to a method which expects an Integer) are to be expected from time-to-time.
  • For specific statistical and data analysis purposes, R’s vast range of packages gives it a slight edge over Python. For general purpose languages, there are faster and safer alternatives to Python.

Verdict — “excellent all-rounder”

Python is a very good choice of language for data science, and not just at entry-level. Much of the data science process revolves around the ETL process (extraction-transformation-loading). This makes Python’s generality ideally suited. Libraries such as Google’s Tensorflow make Python a very exciting language to work in for machine learning.

SQL

What you need to know

SQL (‘Structured Query Language’) defines, manages and queries relational databases. The language appeared by 1974 and has since undergone many implementations, but the core principles remain the same.

License

Varies — some implementations are free, others proprietary

Pros

  • Very efficient at querying, updating and manipulating relational databases.
  • Declarative syntax makes SQL an often very readable language . There’s no ambiguity about what SELECT name FROM users WHERE age > 18 is supposed to do!
  • SQL is very used across a range of applications, making it a very useful language to be familiar with. Modules such as SQLAlchemy make integrating SQL with other languages straightforward.

Cons

  • SQL’s analytical capabilities are rather limited — beyond aggregating and summing, counting and averaging data, your options are limited.
  • For programmers coming from an imperative background, SQL’s declarative syntax can present a learning curve.
  • There are many different implementations of SQL such as PostgreSQL, SQLite, MariaDB . They are all different enough to make inter-operability something of a headache.

Verdict — “timeless and efficient”

SQL is more useful as a data processing language than as an advanced analytical tool. Yet so much of the data science process hinges upon ETL, and SQL’s longevity and efficiency are proof that it is a very useful language for the modern data scientist to know.

Java

What you need to know

Java is an extremely popular, general purpose language which runs on the (JVM) Java Virtual Machine. It’s an abstract computing system that enables seamless portability between platforms. Currently supported by Oracle Corporation.

License

Version 8 — Free! Legacy versions, proprietary.

Pros

  • Ubiquity . Many modern systems and applications are built upon a Java back-end. The ability to integrate data science methods directly into the existing codebase is a powerful one to have.
  • Strongly typed. Java is no-nonsense when it comes to ensuring type safety. For mission-critical big data applications, this is invaluable.
  • Java is a high-performance, general purpose, compiled language . This makes it suitable for writing efficient ETL production code and computationally intensive machine learning algorithms.

Cons

  • For ad-hoc analyses and more dedicated statistical applications, Java’s verbosity makes it an unlikely first choice. Dynamically typed scripting languages such as R and Python lend themselves to much greater productivity.
  • Compared to domain-specific languages like R, there aren’t a great number of libraries available for advanced statistical methods in Java.

Verdict — “a serious contender for data science”

There is a lot to be said for learning Java as a first choice data science language. Many companies will appreciate the ability to seamlessly integrate data science production code directly into their existing codebase, and you will find Java’s performance and and type safety are real advantages. However, you’ll be without the range of stats-specific packages available to other languages. That said, definitely one to consider — especially if you already know one of R and/or Python.

Scala

What you need to know

Developed by Martin Odersky and released in 2004, Scala is a language which runs on the JVM. It is a multi-paradigm language, enabling both object-oriented and functional approaches. Cluster computing framework Apache Spark is written in Scala.

License

Free!

Pros

  • Scala + Spark = High performance cluster computing. Scala is an ideal choice of language for those working with high-volume data sets.
  • Multi-paradigmatic: Scala programmers can have the best of both worlds. Both object-oriented and functional programming paradigms available to them.
  • Scala is compiled to Java bytecode and runs on a JVM. This allows inter-operability with the Java language itself, making Scala a very powerful general purpose language, while also being well-suited for data science.

Cons

  • Scala is not a straightforward language to get up and running with if you’re just starting out. Your best bet is to download sbt and set up an IDE such as Eclipse or IntelliJ with a specific Scala plug-in.
  • The syntax and type system are often described as complex. This makes for a steep learning curve for those coming from dynamic languages such as Python.

Verdict — “perfect, for suitably big data”

When it comes to using cluster computing to work with Big Data, then Scala + Spark are fantastic solutions. If you have experience with Java and other statically typed languages, you’ll appreciate these features of Scala too. Yet if your application doesn’t deal with the volumes of data that justify the added complexity of Scala, you will likely find your productivity being much higher using other languages such as R or Python.

Julia

What you need to know

Released just over 5 years ago, Julia has made an impression in the world of numerical computing. Its profile was raised thanks to early adoption by several major organizationsincluding many in the finance industry.

License

Free!

Pros

  • Julia is a JIT (‘just-in-time’) compiled language, which lets it offer good performance. It also offers the simplicity, dynamic-typing and scripting capabilities of an interpreted language like Python.
  • Julia was purpose-designed for numerical analysis. It is capable of general purpose programming as well.
  • Readability. Many users of the language cite this as a key advantage

Cons

  • Maturity. As a new language, some Julia users have experienced instability when using packages. But the core language itself is reportedly stable enough for production use.
  • Limited packages are another consequence of the language’s youthfulness and small development community. Unlike long-established R and Python, Julia doesn’t have the choice of packages (yet).

Verdict — “one for the future”

The main issue with Julia is one that cannot be blamed for. As a recently developed language, it isn’t as mature or production-ready as its main alternatives Python and R. But, if you are willing to be patient, there’s every reason to pay close attention as the language evolves in the coming years.

MATLAB

What you need to know

MATLAB is an established numerical computing language used throughout academia and industry. It is developed and licensed by MathWorks, a company established in 1984 to commercialize the software.

License

Proprietary — pricing varies depending on your use case

Pros

  • Designed for numerical computing. MATLAB is well-suited for quantitative applications with sophisticated mathematical requirements such as signal processing, Fourier transforms, matrix algebra and image processing.
  • Data Visualization. MATLAB has some great inbuilt plotting capabilities.
  • MATLAB is often taught as part of many undergraduate courses in quantitative subjects such as Physics, Engineering and Applied Mathematics. As a consequence, it is widely used within these fields.

Cons

  • Proprietary licence. Depending on your use-case (academic, personal or enterprise) you may have to fork out for a pricey licence. There are free alternatives available such as Octave. This is something you should give real consideration to.
  • MATLAB isn’t an obvious choice for general-purpose programming.

Verdict — “best for mathematically intensive applications”

MATLAB’s widespread use in a range of quantitative and numerical fields throughout industry and academia makes it a serious option for data science. The clear use-case would be when your application or day-to-day role requires intensive, advanced mathematical functionality; indeed, MATLAB was specifically designed for this.

Other Languages

There are other mainstream languages that may or may not be of interest to data scientists. This section provides a quick overview… with plenty of room for debate of course!

C++

C++ is not a common choice for data science, although it has lightning fast performance and widespread mainstream popularity. The simple reason may be a question of productivity versus performance.

As one Quora user puts it:

“If you’re writing code to do some ad-hoc analysis that will probably only be run one time, would you rather spend 30 minutes writing a program that will run in 10 seconds, or 10 minutes writing a program that will run in 1 minute?”

The dude’s got a point. Yet for serious production-level performance, C++ would be an excellent choice for implementing machine learning algorithms optimized at a low-level.

Verdict — “not for day-to-day work, but if performance is critical…”

Read me...


Probably more likely than probable - Reblog

This is a reblog from here Probably more likely than probable // Revolutions

What kind of probability are people talking about when they say something is "highly likely" or has "almost no chance"? The chart below, created by Reddit user zonination, visualizes the responses of 46 other Reddit users to "What probability would you assign to the phase: <phrase>" for various statements of probability. Each set of responses has been converted to a kernel destiny estimate and presented as a joyplot using R.

Probable

Somewhat surprisingly, the results from the Redditors hew quite closely to a similar study of 23 NATO intelligence officers in 2007. In that study, the officers — who were accustomed to reading intelligence reports with assertions of likelihood — were giving a similar task with the same descriptions of probability. The results, here presented as a dotplot, are quite similar.

CIA

For details on the analysis of the Redditors, including the data and R code behind the joyplot chart, check out the Github repository linked below.

Github (zonination): Perceptions of Probability and Numbers

Read me...

Another "Data Science and Analytics with Python" Delivered

Another "Data Science and Analytics with Python" Delivered. Thanks for sharing the picture Dave Groves.

Read me...

Data Science and Analytics with Python already being suggested!

"Data Science and Analytics with Python" was published yesterday and now it is already appearing as a suggested book for related titles.

You can find it with the link above or in Amazon here.

 

 

Read me...

"Data Science and Analytics with Python" is published

Very pleased to see that finally the publication of my "Data Science and Analytics with Python" book has arrived.

Read me...

Final version of "Data Science and Analytics with Python" approved

It has been a long road, one filled with unicorns and Jackalopes, decision trees and random forests, variance and bias, cats and dogs, and targets and features.

Well over a year ago, the idea of writing another book seemed like a farfetched proposition. Writing the book came about from the work that I have been doing in the area as well as from discussions with my colleagues and students, including also practitioners and beneficiaries of data science and analytics.

It is my sincere hope that the book is useful to those coming afresh to this new field as well as to those more seasoned data scientists.

This afternoon I had the pleasure of approving the final version of the book that will be sent to the printers in the next few days.

Once the book is available you can get a copy directly with CRC Press or from Amazon.

Enjoy!

-j

Read me...

One Reply to “Data Science”

%d bloggers like this: