Random thoughts about random subjects… From science to literature and between manga and watercolours, passing by data science and rugby; including film, physics and fiction, programming, pictures and puns.
With the lockdown and social distancing rules forcing all of us to adjust our calendars, events and even lesson plans and lectures, I was not surprised to hear of speaking opportunities that otherwise may not arise.
A great example is the reprise of a talk I gave about a year ago while visiting Mexico. It was a great opportunity to talk to Social Science students at the Political Science Faculty of the Universidad Autónoma del Estado de México. The subject was open but had to cover the use of technology and I thought that talking about the use of natural language processing in terms of digital humanities would be a winner. And it was…
In March this year I was approached by the Faculty to re-run the talk but this time instead of doing it face to face we would use a teleconference room. Not only was I, the speaker, talking from the comfort of my own living room, but also all the attendees would be at home. Furthermore, some of the students may not have access to the live presentation (lack of broadband, equipment, etc) and recoding the session for later usage was the best option for them.
I didn’t hesitate in saying yes, and I enjoyed the interaction a lot. Today I learnt that the session was the focus of a small note in a local newspaper. The session was run in Spanish and the note in Portal, the local newspaper, is in Spanish too. I really liked that they picked a line I used in the session to convince the students that technology is not just for the natural sciences:
“Hay que hacer ciencias sociales con técnicas del Siglo XIX… El mundo es de los geeks.
“We should study social sciences applying techniques of the 21st Century. The world today belongs to us, the geeks.
The point is that although qualitative and quantitative techniques are widely used in social science, the use of new platforms and even programming languages such as python open up opportunities for social scientists too.
Python retains its top spot in the fifth annual IEEE Spectrum top programming language rankings, and also gains a designation as an “embedded language”. Data science language R remains the only domain-specific slot in the top 10 (where it as listed as an “enterprise language”) and drops one place compared to its 2017 ranking to take the #7 spot.
Looking at other data-oriented languages, Matlab as at #11 (up 3 places), SQL is at #24 (down 1), Julia at #32 (down 1) and SAS at #40 (down 3). Click the screenshot below for an interactive version of the chart where you can also explore the top 50 rankings.
The IEEE Spectrum rankings are based on search, social media, and job listing trends, GitHub repositories, and mentions in journal articles. You can find details on the ranking methodology here, and discussion of the trends behind the 2018 rankings at the link below.
Very often the question about what programming language is best for data science work. The answer may depend on who you ask, there are many options out there and they all have their advantages and disadvantages. Here are some thoughts from Peter Gleeson on this matter:
While there is no correct answer, there are several things to take into consideration. Your success as a data scientist will depend on many points, including:
When it comes to advanced data science, you will only get so far reinventing the wheel each time. Learn to master the various packages and modules offered in your chosen language. The extent to which this is possible depends on what domain-specific packages are available to you in the first place!
A top data scientist will have good all-round programming skills as well as the ability to crunch numbers. Much of the day-to-day work in data science revolves around sourcing and processing raw data or ‘data cleaning’. For this, no amount of fancy machine learning packages are going to help.
In the often fast-paced world of commercial data science, there is much to be said for getting the job done quickly. However, this is what enables technical debt to creep in — and only with sensible practices can this be minimized.
In some cases it is vital to optimize the performance of your code, especially when dealing with large volumes of mission-critical data. Compiled languages are typically much faster than interpreted ones; likewise statically typed languages are considerably more fail-proof than dynamically typed. The obvious trade-off is against productivity.
To some extent, these can be seen as a pair of axes (Generality-Specificity, Performance-Productivity). Each of the languages below fall somewhere on these spectra.
With these core principles in mind, let’s take a look at some of the more popular languages used in data science. What follows is a combination of research and personal experience of myself, friends and colleagues — but it is by no means definitive! In approximately order of popularity, here goes:
What you need to know
Released in 1995 as a direct descendant of the older S programming language, R has since gone from strength to strength. Written in C, Fortran and itself, the project is currently supported by the R Foundation for Statistical Computing.
Excellent range of high-quality, domain specific and open source packages. R has a package for almost every quantitative and statistical application imaginable. This includes neural networks, non-linear regression, phylogenetics, advanced plotting and many, many others.
The base installation comes with very comprehensive, in-built statistical functions and methods. R also handles matrix algebra particularly well.
Data visualization is a key strength with the use of libraries such as ggplot2.
Domain specificity. R is fantastic for statistics and data science purposes. But less so for general purpose programming.
Quirks. R has a few unusual features that might catch out programmers experienced with other languages. For instance: indexing from 1, using multiple assignment operators, unconventional data structures.
Verdict — “brilliant at what it’s designed for”
R is a powerful language that excels at a huge variety of statistical and data visualization applications, and being open source allows for a very active community of contributors. Its recent growth in popularity is a testament to how effective it is at what it does.
What you need to know
Guido van Rossum introduced Python back in 1991. It has since become an extremely popular general purpose language, and is widely used within the data science community. The major versions are currently 3.6 and 2.7.
Type safety: Python is a dynamically typed language, which means you must show due care. Type errors (such as passing a String as an argument to a method which expects an Integer) are to be expected from time-to-time.
For specific statistical and data analysis purposes, R’s vast range of packages gives it a slight edge over Python. For general purpose languages, there are faster and safer alternatives to Python.
Verdict — “excellent all-rounder”
Python is a very good choice of language for data science, and not just at entry-level. Much of the data science process revolves around the ETL process (extraction-transformation-loading). This makes Python’s generality ideally suited. Libraries such as Google’s Tensorflow make Python a very exciting language to work in for machine learning.
What you need to know
SQL (‘Structured Query Language’) defines, manages and queries relational databases. The language appeared by 1974 and has since undergone many implementations, but the core principles remain the same.
Varies — some implementations are free, others proprietary
Very efficient at querying, updating and manipulating relational databases.
Declarative syntax makes SQL an often very readable language . There’s no ambiguity about what
SELECT name FROM users WHERE age > 18
is supposed to do!
SQL is very used across a range of applications, making it a very useful language to be familiar with. Modules such as SQLAlchemy make integrating SQL with other languages straightforward.
SQL’s analytical capabilities are rather limited — beyond aggregating and summing, counting and averaging data, your options are limited.
For programmers coming from an imperative background, SQL’s declarative syntax can present a learning curve.
There are many different implementations of SQL such as PostgreSQL, SQLite, MariaDB . They are all different enough to make inter-operability something of a headache.
Verdict — “timeless and efficient”
SQL is more useful as a data processing language than as an advanced analytical tool. Yet so much of the data science process hinges upon ETL, and SQL’s longevity and efficiency are proof that it is a very useful language for the modern data scientist to know.
What you need to know
Java is an extremely popular, general purpose language which runs on the (JVM) Java Virtual Machine. It’s an abstract computing system that enables seamless portability between platforms. Currently supported by Oracle Corporation.
Version 8 — Free! Legacy versions, proprietary.
Ubiquity . Many modern systems and applications are built upon a Java back-end. The ability to integrate data science methods directly into the existing codebase is a powerful one to have.
Strongly typed. Java is no-nonsense when it comes to ensuring type safety. For mission-critical big data applications, this is invaluable.
Java is a high-performance, general purpose, compiled language . This makes it suitable for writing efficient ETL production code and computationally intensive machine learning algorithms.
For ad-hoc analyses and more dedicated statistical applications, Java’s verbosity makes it an unlikely first choice. Dynamically typed scripting languages such as R and Python lend themselves to much greater productivity.
Compared to domain-specific languages like R, there aren’t a great number of libraries available for advanced statistical methods in Java.
Verdict — “a serious contender for data science”
There is a lot to be said for learning Java as a first choice data science language. Many companies will appreciate the ability to seamlessly integrate data science production code directly into their existing codebase, and you will find Java’s performance and and type safety are real advantages. However, you’ll be without the range of stats-specific packages available to other languages. That said, definitely one to consider — especially if you already know one of R and/or Python.
What you need to know
Developed by Martin Odersky and released in 2004, Scala is a language which runs on the JVM. It is a multi-paradigm language, enabling both object-oriented and functional approaches. Cluster computing framework Apache Spark is written in Scala.
Scala + Spark = High performance cluster computing. Scala is an ideal choice of language for those working with high-volume data sets.
Multi-paradigmatic: Scala programmers can have the best of both worlds. Both object-oriented and functional programming paradigms available to them.
Scala is compiled to Java bytecode and runs on a JVM. This allows inter-operability with the Java language itself, making Scala a very powerful general purpose language, while also being well-suited for data science.
Scala is not a straightforward language to get up and running with if you’re just starting out. Your best bet is to download sbt and set up an IDE such as Eclipse or IntelliJ with a specific Scala plug-in.
The syntax and type system are often described as complex. This makes for a steep learning curve for those coming from dynamic languages such as Python.
Verdict — “perfect, for suitably big data”
When it comes to using cluster computing to work with Big Data, then Scala + Spark are fantastic solutions. If you have experience with Java and other statically typed languages, you’ll appreciate these features of Scala too. Yet if your application doesn’t deal with the volumes of data that justify the added complexity of Scala, you will likely find your productivity being much higher using other languages such as R or Python.
What you need to know
Released just over 5 years ago, Julia has made an impression in the world of numerical computing. Its profile was raised thanks to early adoption by several major organizationsincluding many in the finance industry.
Julia is a JIT (‘just-in-time’) compiled language, which lets it offer good performance. It also offers the simplicity, dynamic-typing and scripting capabilities of an interpreted language like Python.
Julia was purpose-designed for numerical analysis. It is capable of general purpose programming as well.
Readability. Many users of the language cite this as a key advantage
Maturity. As a new language, some Julia users have experienced instability when using packages. But the core language itself is reportedly stable enough for production use.
Limited packages are another consequence of the language’s youthfulness and small development community. Unlike long-established R and Python, Julia doesn’t have the choice of packages (yet).
Verdict — “one for the future”
The main issue with Julia is one that cannot be blamed for. As a recently developed language, it isn’t as mature or production-ready as its main alternatives Python and R. But, if you are willing to be patient, there’s every reason to pay close attention as the language evolves in the coming years.
What you need to know
MATLAB is an established numerical computing language used throughout academia and industry. It is developed and licensed by MathWorks, a company established in 1984 to commercialize the software.
Proprietary — pricing varies depending on your use case
Designed for numerical computing. MATLAB is well-suited for quantitative applications with sophisticated mathematical requirements such as signal processing, Fourier transforms, matrix algebra and image processing.
Data Visualization. MATLAB has some great inbuilt plotting capabilities.
MATLAB is often taught as part of many undergraduate courses in quantitative subjects such as Physics, Engineering and Applied Mathematics. As a consequence, it is widely used within these fields.
Proprietary licence. Depending on your use-case (academic, personal or enterprise) you may have to fork out for a pricey licence. There are free alternatives available such as Octave. This is something you should give real consideration to.
MATLAB isn’t an obvious choice for general-purpose programming.
Veredict — “best for mathematically intensive applications”
MATLAB’s widespread use in a range of quantitative and numerical fields throughout industry and academia makes it a serious option for data science. The clear use-case would be when your application or day-to-day role requires intensive, advanced mathematical functionality; indeed, MATLAB was specifically designed for this.
There are other mainstream languages that may or may not be of interest to data scientists. This section provides a quick overview… with plenty of room for debate of course!
C++ is not a common choice for data science, although it has lightning fast performance and widespread mainstream popularity. The simple reason may be a question of productivity versus performance.
“If you’re writing code to do some ad-hoc analysis that will probably only be run one time, would you rather spend 30 minutes writing a program that will run in 10 seconds, or 10 minutes writing a program that will run in 1 minute?”
The dude’s got a point. Yet for serious production-level performance, C++ would be an excellent choice for implementing machine learning algorithms optimized at a low-level.
Verdict — “not for day-to-day work, but if performance is critical…”
What kind of probability are people talking about when they say something is “highly likely” or has “almost no chance”? The chart below, created by Reddit user zonination, visualizes the responses of 46 other Reddit users to “What probability would you assign to the phase: <phrase>” for various statements of probability. Each set of responses has been converted to a kernel destiny estimate and presented as a joyplot using R.
Somewhat surprisingly, the results from the Redditors hew quite closely to a similar study of 23 NATO intelligence officers in 2007. In that study, the officers — who were accustomed to reading intelligence reports with assertions of likelihood — were giving a similar task with the same descriptions of probability. The results, here presented as a dotplot, are quite similar.
For details on the analysis of the Redditors, including the data and R code behind the joyplot chart, check out the Github repository linked below.
R has some good tools for importing data from spreadsheets, among them the readxl package for Excel and the googlesheets package for Google Sheets. But these only work well when the data in the spreadsheet are arranged as a rectangular table, and not overly encumbered with formatting or generated with formulas. As Jenny Bryan pointed out in her recent talk at the useR!2016 conference (and embedded below, or download PDF slides here), in practice few spreadsheets have “a clean little rectangle of data in the upper-left corner”, because most people use spreadsheets not just a file format for data retrieval, but also as a reporting/visualization/analysis tool.
Nonetheless, for a practicing data scientist, there’s a lot of useful data locked up in these messy spreadsheets that needs to be imported into R before we can begin analysis. As just one example given by Jenny in her talk, this spreadsheet was included as one of 15,000 spreadsheet attachments (one with 175 tabs!) in the Enron Corpus.
To make it easier to import data into R from messy spreadsheets like this, Jenny and co-author Richard G. FitzJohn created the jailbreakr package. The package is in its early stages, but it can already import Excel (xlsx format) and Google Sheets intro R as a new “linen” objects from which small sub-tables can easily be extracted as data frames. It can also print spreadsheets in a condensed text-based format with one character per cell — useful if you’re trying to figure out why an apparently simple spreadsheet isn’t importing as you expect. (Check out the “weekend getaway winner” story near the end of Jenny’s talk for a great example.)
The jailbreakr package isn’t yet on CRAN, but if you want to try it out you can download it from the Github repository (or even contribute!) at the link below.
Today I had the opportunity of running a #DataScience bootcamp in London. It was an all-day affair and although the attendees were engaged, I’m sure that by the end of the 6th hour they were quite tired.
The discussions ranged from what data science is, the skills required to become a data scientist and also to manage them. Finally we implemented some data analyses based on linear regression, all using R. I was very pleased to see some of the results.
Last Thursday I attended a Cloudera Breakfast Briefing where Sean Owen was speaking about Spark and the examples were related to building decision trees and random forests. It was a good session in general.
Sean started his talk with an example using the Iris dataset using R, in particular the “party” library. He then moved on to talk about Spark and MLlib.
For the rest of the talk he used the “Covertype” data set that contains 581,012 data points describing trees using 54 features (elevation, slope, soil tye, etc,) predicting forest cover type (spruce, aspen, etc.). A very apt dataset for the construction of random forests, right? I was very pleased to see a new (for me) dataset being used!
Sean want over some bits and pieces about using Spark, highlighting the compactness of the code. He also turned his attention to the tuning of hyper-parameters and its importance.
There are different ways to approach this, but it is always about finding a balance, a trade-off. For a tree we can play with the depth of the tree, the maximum number of bins (i.e. the number of different decision rules to be tried), the amount of impurity (Gini or Entropy measures).
If we don’t know the right values for the hyperparameters, we can try several ones. Particularly if you have enough room on your cluster.
Building a random forest: let various trees see only a subset of the data, then combine. Another approach is to let the trees see a subset of the features. The latter is a nice idea as this may be a more reasonable approach for large clusters, where communication among nodes is kept to a minimum -> good for Spark or Hadoop.
Sean finished with some suggestions of things one can try:
Try SVM and LogisticRegression in MLlib
Real-time scoring with Spark Streaming
Use random decision forests for regression
Nonetheless, the best bit of this all was that after asking a couple of questions I managed to get my hands in a “Tofu Scientist” T-Shirt! Result!
If you’re interested in a career in data, and you’re familiar with the set of skills you’ll need to master, you know that Python and R are two of the most popular languages for data analysis. If you’re not exactly sure which to start learning first, you’re reading the right article.
When it comes to data analysis, both Python and R are simple (and free) to install and relatively easy to get started with. If you’re a newcomer to the world of data science and don’t have experience in either language, or with programming in general, it makes sense to be unsure whether to learn R or Python first.
Luckily, you can’t really go wrong with either.
The Case for R
R has a long and trusted history and a robust supporting community in the data industry. Together, those facts mean that you can rely on online support from others in the field if you need assistance or have questions about using the language. Plus, there are plenty of publicly released packages, more than 5,000 in fact, that you can download to use in tandem with R to extend its capabilities to new heights. That makes R great for conducting complex exploratory data analysis. R also integrates well with other computer languages like C++, Java, and C.
When you need to do heavy statistical analysis or graphing, R’s your go-to. Common mathematical operations like matrix multiplication work straight out of the box, and the language’s array-oriented syntax makes it easier to translate from math to code, especially for someone with no or minimal programming background.
The Case for Python
Python is a general-purpose programming language that can pretty much do anything you need it to: data munging, data engineering, data wrangling, website scraping, web app building, and more. It’s simpler to master than R if you have previously learned an object-oriented programming language like Java or C++.
In addition, because Python is an object-oriented programming language, it’s easier to write large-scale, maintainable, and robust code with it than with R. Using Python, the prototype code that you write on your own computer can be used as production code if needed.
Although Python doesn’t have as comprehensive a set of packages and libraries available to data professionals as R, the combination of Python with tools like Pandas, Numpy, Scipy, Scikit-Learn, and Seaborn will get you pretty darn close. The language is also slowly becoming more useful for tasks like machine learning, and basic to intermediate statistical work (formerly just R’s domain).
Choosing Between Python and R
Here are a few guidelines for determining whether to begin your data language studies with Python or with R.
Choose the language to begin with based on your personal preference, on which comes more naturally to you, which is easier to grasp from the get-go. To give you a sense of what to expect, mathematicians and statisticians tend to prefer R, whereas computer scientists and software engineers tend to favor Python. The best news is that once you learn to program well in one language, it’s pretty easy to pick up others.
You can also make the Python vs. R call based on a project you know you’ll be working on in your data studies. If you’re working with data that’s been gathered and cleaned for you, and your main focus is the analysis of that data, go with R. If you have to work with dirty or jumbled data, or to scrape data from websites, files, or other data sources, you should start learning, or advancing your studies in, Python.
Once you have the basics of data analysis under your belt, another criterion for evaluating which language to further your skills in is what language your teammates are using. If you’re all literally speaking the same language, it’ll make collaboration—as well as learning from each other—much easier.
Jobs calling for skill in Python compared to R have increased similarly over the last few years.
That said, as you can see, Python has started to overtake R in data jobs. Thanks to the expansion of the Python ecosystem, tools for nearly every aspect of computing are readily available in the language. In addition, since Python can be used to develop web applications, it enables companies to employ crossover between Python developers and data science teams. That’s a major boon given the shortage of data experts in the current marketplace.
The Bottom Line
In general, you can’t err whether you choose to learn Python first or R first for data analysis. Each language has its pros and cons for different scenarios and tasks. In addition, there are actually libraries to use Python with R, and vice versa—so learning one won’t preclude you from being able to learn and use the other. Perhaps the best solution is to use the above guidelines to decide which of the two languages to begin with, then fortify your skill set by learning the other one.