Random thoughts about random subjects… From science to literature and between manga and watercolours, passing by data science and rugby; including film, physics and fiction, programming, pictures and puns.
I came across the image above in the Slack channel of the University of Hertfordshire Centre for Astrophysics Research. It summarises some of the fundamental knowledge in computer science that was assumed necessary at some point in time: Binar, CPU execution and algorithms.
They refer to 7 algorithms, but actually rather than actual algorithms they are classes:
String Matching and Parsing
Working with dates and times in programming can be a painful test at times. In Python, there are some excellent libraries that help with all the pain, and recently I became aware of Pendulum. It is effectively are replacement for the standard datetime class and it has a number of improvements. Check out the documentation for further information.
Installation of the packages is straightforward with pip:
$ pip install pendulum
For example, some simple manipulations involving time zones:
now = pendulum.now('Europe/Paris')
# Changing timezone
# Default support for common datetime formats
Duration can be used as a replacement for the standard timedelta class:
It also supports the definition of a period, i.e. a duration that is aware of the DateTime instances that created it. For example:
dt1 = pendulum.now()
dt2 = dt1.add(days=3)
# A period is the difference between 2 instances
period = dt2 - dt1
# A period is iterable
for dt in period:
Give it a go, and let me know what you think of it.
Python retains its top spot in the fifth annual IEEE Spectrum top programming language rankings, and also gains a designation as an “embedded language”. Data science language R remains the only domain-specific slot in the top 10 (where it as listed as an “enterprise language”) and drops one place compared to its 2017 ranking to take the #7 spot.
Looking at other data-oriented languages, Matlab as at #11 (up 3 places), SQL is at #24 (down 1), Julia at #32 (down 1) and SAS at #40 (down 3). Click the screenshot below for an interactive version of the chart where you can also explore the top 50 rankings.
The IEEE Spectrum rankings are based on search, social media, and job listing trends, GitHub repositories, and mentions in journal articles. You can find details on the ranking methodology here, and discussion of the trends behind the 2018 rankings at the link below.
It’s Ada Lovelace day, celebrating the work of women in mathematics, science, technology and engineering. To join the celebration +Plus Magazine revisits a collection of interviews with female mathematicians produced earlier this year. The interviews accompany the Women of Mathematics photo exhibition, which celebrates female mathematicians from institutions throughout Europe. It was launched in Berlin in the summer of 2016 and is now touring European institutions.
To watch the interviews with the women or read the transcripts, and to see the portraits that featured in the exhibition, click on the links below. For more content by or about female mathematicians click here.
Very often the question about what programming language is best for data science work. The answer may depend on who you ask, there are many options out there and they all have their advantages and disadvantages. Here are some thoughts from Peter Gleeson on this matter:
While there is no correct answer, there are several things to take into consideration. Your success as a data scientist will depend on many points, including:
When it comes to advanced data science, you will only get so far reinventing the wheel each time. Learn to master the various packages and modules offered in your chosen language. The extent to which this is possible depends on what domain-specific packages are available to you in the first place!
A top data scientist will have good all-round programming skills as well as the ability to crunch numbers. Much of the day-to-day work in data science revolves around sourcing and processing raw data or ‘data cleaning’. For this, no amount of fancy machine learning packages are going to help.
In the often fast-paced world of commercial data science, there is much to be said for getting the job done quickly. However, this is what enables technical debt to creep in — and only with sensible practices can this be minimized.
In some cases it is vital to optimize the performance of your code, especially when dealing with large volumes of mission-critical data. Compiled languages are typically much faster than interpreted ones; likewise statically typed languages are considerably more fail-proof than dynamically typed. The obvious trade-off is against productivity.
To some extent, these can be seen as a pair of axes (Generality-Specificity, Performance-Productivity). Each of the languages below fall somewhere on these spectra.
With these core principles in mind, let’s take a look at some of the more popular languages used in data science. What follows is a combination of research and personal experience of myself, friends and colleagues — but it is by no means definitive! In approximately order of popularity, here goes:
What you need to know
Released in 1995 as a direct descendant of the older S programming language, R has since gone from strength to strength. Written in C, Fortran and itself, the project is currently supported by the R Foundation for Statistical Computing.
Excellent range of high-quality, domain specific and open source packages. R has a package for almost every quantitative and statistical application imaginable. This includes neural networks, non-linear regression, phylogenetics, advanced plotting and many, many others.
The base installation comes with very comprehensive, in-built statistical functions and methods. R also handles matrix algebra particularly well.
Data visualization is a key strength with the use of libraries such as ggplot2.
Domain specificity. R is fantastic for statistics and data science purposes. But less so for general purpose programming.
Quirks. R has a few unusual features that might catch out programmers experienced with other languages. For instance: indexing from 1, using multiple assignment operators, unconventional data structures.
Verdict — “brilliant at what it’s designed for”
R is a powerful language that excels at a huge variety of statistical and data visualization applications, and being open source allows for a very active community of contributors. Its recent growth in popularity is a testament to how effective it is at what it does.
What you need to know
Guido van Rossum introduced Python back in 1991. It has since become an extremely popular general purpose language, and is widely used within the data science community. The major versions are currently 3.6 and 2.7.
Type safety: Python is a dynamically typed language, which means you must show due care. Type errors (such as passing a String as an argument to a method which expects an Integer) are to be expected from time-to-time.
For specific statistical and data analysis purposes, R’s vast range of packages gives it a slight edge over Python. For general purpose languages, there are faster and safer alternatives to Python.
Verdict — “excellent all-rounder”
Python is a very good choice of language for data science, and not just at entry-level. Much of the data science process revolves around the ETL process (extraction-transformation-loading). This makes Python’s generality ideally suited. Libraries such as Google’s Tensorflow make Python a very exciting language to work in for machine learning.
What you need to know
SQL (‘Structured Query Language’) defines, manages and queries relational databases. The language appeared by 1974 and has since undergone many implementations, but the core principles remain the same.
Varies — some implementations are free, others proprietary
Very efficient at querying, updating and manipulating relational databases.
Declarative syntax makes SQL an often very readable language . There’s no ambiguity about what
SELECT name FROM users WHERE age > 18
is supposed to do!
SQL is very used across a range of applications, making it a very useful language to be familiar with. Modules such as SQLAlchemy make integrating SQL with other languages straightforward.
SQL’s analytical capabilities are rather limited — beyond aggregating and summing, counting and averaging data, your options are limited.
For programmers coming from an imperative background, SQL’s declarative syntax can present a learning curve.
There are many different implementations of SQL such as PostgreSQL, SQLite, MariaDB . They are all different enough to make inter-operability something of a headache.
Verdict — “timeless and efficient”
SQL is more useful as a data processing language than as an advanced analytical tool. Yet so much of the data science process hinges upon ETL, and SQL’s longevity and efficiency are proof that it is a very useful language for the modern data scientist to know.
What you need to know
Java is an extremely popular, general purpose language which runs on the (JVM) Java Virtual Machine. It’s an abstract computing system that enables seamless portability between platforms. Currently supported by Oracle Corporation.
Version 8 — Free! Legacy versions, proprietary.
Ubiquity . Many modern systems and applications are built upon a Java back-end. The ability to integrate data science methods directly into the existing codebase is a powerful one to have.
Strongly typed. Java is no-nonsense when it comes to ensuring type safety. For mission-critical big data applications, this is invaluable.
Java is a high-performance, general purpose, compiled language . This makes it suitable for writing efficient ETL production code and computationally intensive machine learning algorithms.
For ad-hoc analyses and more dedicated statistical applications, Java’s verbosity makes it an unlikely first choice. Dynamically typed scripting languages such as R and Python lend themselves to much greater productivity.
Compared to domain-specific languages like R, there aren’t a great number of libraries available for advanced statistical methods in Java.
Verdict — “a serious contender for data science”
There is a lot to be said for learning Java as a first choice data science language. Many companies will appreciate the ability to seamlessly integrate data science production code directly into their existing codebase, and you will find Java’s performance and and type safety are real advantages. However, you’ll be without the range of stats-specific packages available to other languages. That said, definitely one to consider — especially if you already know one of R and/or Python.
What you need to know
Developed by Martin Odersky and released in 2004, Scala is a language which runs on the JVM. It is a multi-paradigm language, enabling both object-oriented and functional approaches. Cluster computing framework Apache Spark is written in Scala.
Scala + Spark = High performance cluster computing. Scala is an ideal choice of language for those working with high-volume data sets.
Multi-paradigmatic: Scala programmers can have the best of both worlds. Both object-oriented and functional programming paradigms available to them.
Scala is compiled to Java bytecode and runs on a JVM. This allows inter-operability with the Java language itself, making Scala a very powerful general purpose language, while also being well-suited for data science.
Scala is not a straightforward language to get up and running with if you’re just starting out. Your best bet is to download sbt and set up an IDE such as Eclipse or IntelliJ with a specific Scala plug-in.
The syntax and type system are often described as complex. This makes for a steep learning curve for those coming from dynamic languages such as Python.
Verdict — “perfect, for suitably big data”
When it comes to using cluster computing to work with Big Data, then Scala + Spark are fantastic solutions. If you have experience with Java and other statically typed languages, you’ll appreciate these features of Scala too. Yet if your application doesn’t deal with the volumes of data that justify the added complexity of Scala, you will likely find your productivity being much higher using other languages such as R or Python.
What you need to know
Released just over 5 years ago, Julia has made an impression in the world of numerical computing. Its profile was raised thanks to early adoption by several major organizationsincluding many in the finance industry.
Julia is a JIT (‘just-in-time’) compiled language, which lets it offer good performance. It also offers the simplicity, dynamic-typing and scripting capabilities of an interpreted language like Python.
Julia was purpose-designed for numerical analysis. It is capable of general purpose programming as well.
Readability. Many users of the language cite this as a key advantage
Maturity. As a new language, some Julia users have experienced instability when using packages. But the core language itself is reportedly stable enough for production use.
Limited packages are another consequence of the language’s youthfulness and small development community. Unlike long-established R and Python, Julia doesn’t have the choice of packages (yet).
Verdict — “one for the future”
The main issue with Julia is one that cannot be blamed for. As a recently developed language, it isn’t as mature or production-ready as its main alternatives Python and R. But, if you are willing to be patient, there’s every reason to pay close attention as the language evolves in the coming years.
What you need to know
MATLAB is an established numerical computing language used throughout academia and industry. It is developed and licensed by MathWorks, a company established in 1984 to commercialize the software.
Proprietary — pricing varies depending on your use case
Designed for numerical computing. MATLAB is well-suited for quantitative applications with sophisticated mathematical requirements such as signal processing, Fourier transforms, matrix algebra and image processing.
Data Visualization. MATLAB has some great inbuilt plotting capabilities.
MATLAB is often taught as part of many undergraduate courses in quantitative subjects such as Physics, Engineering and Applied Mathematics. As a consequence, it is widely used within these fields.
Proprietary licence. Depending on your use-case (academic, personal or enterprise) you may have to fork out for a pricey licence. There are free alternatives available such as Octave. This is something you should give real consideration to.
MATLAB isn’t an obvious choice for general-purpose programming.
Veredict — “best for mathematically intensive applications”
MATLAB’s widespread use in a range of quantitative and numerical fields throughout industry and academia makes it a serious option for data science. The clear use-case would be when your application or day-to-day role requires intensive, advanced mathematical functionality; indeed, MATLAB was specifically designed for this.
There are other mainstream languages that may or may not be of interest to data scientists. This section provides a quick overview… with plenty of room for debate of course!
C++ is not a common choice for data science, although it has lightning fast performance and widespread mainstream popularity. The simple reason may be a question of productivity versus performance.
“If you’re writing code to do some ad-hoc analysis that will probably only be run one time, would you rather spend 30 minutes writing a program that will run in 10 seconds, or 10 minutes writing a program that will run in 1 minute?”
The dude’s got a point. Yet for serious production-level performance, C++ would be an excellent choice for implementing machine learning algorithms optimized at a low-level.
Verdict — “not for day-to-day work, but if performance is critical…”
As data scientists, it’s easy to get bogged down in the details. We’re busy implementing Python and R code to extract valuable insights from data, train effective machine learning models, or put a distributed computation system together. Many of these tasks, especially those relating to data ingestion or wrangling, are time-consuming but are the bread and butter of the data scientist’s daily grind. What we often forget, however, is that we must not only be data engineers, but also contributors to the data science corpus of knowledge.
If a data product derives its value from data and generates more data in return, then a data scientist derives their value from previously published works and should generate more publications in return. Indeed, one of the reasons that Machine Learning has grown ubiquitous (see the many Python-tagged questions related to ML on Stack Overflow) is thanks to meticulous blog posts and tools from scientific research (e.g. Scikit-Learn) that enable the rapid implementation of a variety of algorithms. Google in particular has driven the growth of data products by publishing systems papers about their methodologies, enabling the creation of open source tools like Hadoop and Word2Vec.
By building on a firm base for both software and for modeling, we are able to achieve greater results, faster. Exploration, discussion, criticism, and experimentation all enable us to have new ideas, write better code, and implement better systems by tapping into the collective genius of a data community. Publishing is vitally important to keeping this data science gravy train on the tracks for the foreseeable future.
In academia, the phrase “publish or perish” describes the pressure to establish legitimacy through publications. Clearly, we don’t want to take our rule as authors that far, but the question remains, “How can we effectively build publishing into our workflow?” The answer is through markup languages – simple, streamlined markup that we can add to plain text documents that build into a publishing layout or format. For example, the following markup languages/platforms build into the accompanying publishable formats:
The great thing about markup languages is that they can be managed inline with your code workflow in the same software versioning repository. Github goes even further as to automatically render Markdown files! In this post, we’ll get you started with several markup and publication styles so that you can find what best fits into your workflow and deployment methodology.
Markdown is the most ubiquitous of the markup languages we’ll describe in this post, and its simplicity means that it is often chosen for a variety of domains and applications, not just publishing. Markdown, originally created by John Gruber, is a text-to-HTML processor, where lightweight syntactic elements are used instead of the more heavyweight HTML tags. Markdown is intended for folks writing for the web, not designing for the web, and in some CMS systems, it is simply the way that you write, no fancy text editor required.
Markdown has seen special growth thanks to Github, which has an extended version of Markdown, usually referred to as “Github-Flavored Markdown.” This style of Markdown extends the basics of the original Markdown to include tables, syntax highlighting, and other inline formatting elements. If you create a Markdown file in Github, it is automatically rendered when viewing files on the web, and if you include a README.md in a directory, that file is rendered below the directory contents when browsing code. Github Issues are also expected to be in Markdown, further extended with tools like checkbox lists.
Markdown is used for so many applications it is difficult to name them all. Below are a select few that might prove useful to your publishing tasks.
Jekyll allows you to create static websites that are built from posts and pages written in Markdown.
Github Pages allows you to quickly publish Jekyll-generated static sites from a Github repository for free.
Silvrback is a lightweight blogging platform that allows you to write in Markdown (this blog is hosted on Silvrback).
Day One is a simple journaling app that allows you to write journal entries in Markdown.
Stack Overflow expects questions, answers, and comments to be written in Markdown.
MkDocs is a software documentation tool written in Markdown that can be hosted on ReadTheDocs.org.
GitBook is a toolchain for publishing books written in Markdown to the web or as an eBook.
There are also a wide variety of editors, browser plugins, viewers, and tools available for Markdown. Both Sublime Text and Atom support Markdown and automatic preview, as well as most text editors you’ll use for coding. Mou is a desktop Markdown editor for Mac OSX and iA Writer is a distraction-free writing tool for Markdown for iOS. (Please comment your favorite tools for Windows and Android). For Chrome, extensions like Markdown Here make it easy to compose emails in Gmail via Markdown or Markdown Preview to view Markdown documents directly in the browser.
Clearly, Markdown enjoys a broad ecosystem and diverse usage. If you’re still writing HTML for anything other than templates, you’re definitely doing it wrong at this point! It’s also worth including Markdown rendering for your own projects if you have user submitted text (also great for text-processing).
Rendering Markdown can be accomplished with the Python Markdown library, usually combined with the Bleach library for sanitizing bad HTML and linkifying raw text. A simple demo of this is as follows:
First install markdown and bleach using pip:
$ pip install markdown bleach
Then create a markdown parsing function as follows:
from markdown import markdown
This helper method renders Markdown then uses Bleach to sanitize it as
well as converting all links in text to actual anchor tags.
text = bleach.clean(text, strip=True) # Clean the text by stripping bad HTML tags
text = markdown(text) # Convert the markdown to HTML
text = bleach.linkify(text) # Add links from the text and add nofollow to existing links
Given a markdown file test.md whose contents are as follows:
# My Markdown Document
For more information, search on [Google](http://www.google.com).
The following code:
>>> with open('test.md', 'r') as f:
... print htmlize(f.read())
Will produce the following HTML output:
<h1>My Markdown Document</h1>
For more information, search on <a href="http://www.google.com" rel="nofollow">Google</a>.
Hopefully this brief example has also served as a demonstration of how Markdown and other markup languages work to render much simpler text with lightweight markup constructs into a larger publishing framework. Markdown itself is most often used for web publishing, so if you need to write HTML, then this is the choice for you!
iPython Notebook is an web-based, interactive environment that combines Python code execution, text (marked up with Markdown), mathematics, graphs, and media into a single document. The motivation for iPython Notebook was purely scientific: How do you demonstrate or present your results in a repeatable fashion where others can understand the work you’ve done? By creating an interactive environment where code, graphics, mathematical formulas, and rich text are unified and executable, iPython Notebook gives a presentation layer to otherwise unreadable or inscrutable code. Although Markdown is a big part of iPython Notebook, it deserves a special mention because of how critical it is to the data science community.
iPython Notebook is interesting because it combines both the presentation layer as well as the markup layer. When run as a server, usually locally, the notebook is editable, explorable (a tree view will present multiple notebook files), and executable – any code written in Python in the notebook can be evaluated and run using an interactive kernel in the background. Math formula written in LaTeX are rendered using MathJax. To enhance the delivery and shareability of these notebooks, the NBViewer allows you to share static notebooks from a Github repository.
iPython Notebook comes with most scientific distributions of Python like Anaconda or Canopy, but it is also easy to install iPython with pip:
$ pip install ipython
iPython itself is an enhanced interactive Python shell or REPL that extends the basic Python REPL with many advanced features, primarily allowing for a decoupled two-process model that enables the notebook. This process model essentially runs Python as a background kernel that receives execution instructions from clients and returns responses back to them.
To start an iPython notebook execute the following command:
$ ipython notebook
This will start a local server at
and automatically open your default browser to it. You’ll start in the “dashboard view”, which shows all of the notebooks available in the current working directory. Here you can create new notebooks and start to edit them. Notebooks are saved as .ipynb files in the local directory, a format called “Jupyter” that is simple JSON with a specific structure for representing each cell in the notebook. The Jupyter notebook files are easily reversioned via Git and Github since they are also plain text.
reStructuredText is an easy-to-read plaintext markup syntax specifically designed for use in Python docstrings or to generate Python documentation. In fact, the reStructuredText parser is a component of Docutils, an open-source text processing system that is used by Sphinx to generate intelligent and beautiful software documentation, in particular the native Python documentation.
Python software has a long history of good documentation, particularly because of the idea that batteries should come included. And documentation is a very strong battery! PyPi, the Python Package Index, ensures that third party packages provide documentation, and that the documentation can be easily hosted online through Python Hosted. Because of the ease of use and ubiquity of the tools, Python programmers are known for having very consistently documented code; sometimes it’s hard to tell the standard library from third party modules!
In How to Develop Quality Python Code, I mentioned that you should use Sphinx to generate documentation for your apps and libraries in a docs directory at the top-level. Generating reStructuredText documentation in a docs directory is fairly easy:
$ mkdir docs
$ cd docs
The quickstart utility will ask you many questions to configure your documentation. Aside from the project name, author, and version (which you have to type in yourself), the defaults are fine. However, I do like to change a few things:
> todo: write "todo" entries that can be shown or hidden on build (y/n) [n]: y
> coverage: checks for documentation coverage (y/n) [n]: y
> mathjax: include math, rendered in the browser by MathJax (y/n) [n]: y
Similar to iPython Notebook, reStructured text can render LaTeX syntax mathematical formulas. This utility will create a Makefile for you; to generate HTML documentation, simply run the following command in the docs directory:
$ make html
The output will be built in the folder _build/html where you can open the index.html in your browser.
While hosting documentation on Python Hosted is a good choice, a better choice might be Read the Docs, a website that allows you to create, host, and browse documentation. One great part of Read the Docs is the stylesheet that they use; it’s more readable than older ones. Additionally, Read the Docs allows you to connect a Github repository so that whenever you push new code (and new documentation), it is automatically built and updated on the website. Read the Docs can even maintain different versions of documentation for different releases.
Note that even if you aren’t interested in the overhead of learning reStructuredText, you should use your newly found Markdown skills to ensure that you have good documentation hosted on Read the Docs. See MkDocs for document generation in Markdown that Read the Docs will render.
When writing longer publications, you’ll need a more expressive tool that is just as lightweight as Markdown but able to handle constructs that go beyond simple HTML, for example cross-references, chapter compilation, or multi-document build chains. Longer publications should also move beyond the web and be renderable as an eBook (ePub or Mobi formats) or for print layout, e.g. PDF. These requirements add more overhead, but simplify workflows for larger media publication.
Writing for O’Reilly, I discovered that I really enjoyed working in AsciiDoc – a lightweight markup syntax, very similar to Markdown, which renders to HTML or DocBook. DocBook is very important, because it can be post-processed into other presentation formats such as HTML, PDF, EPUB, DVI, MOBI, and more, making AsciiDoc an effective tool not only for web publishing but also print and book publishing. Most text editors have an AsciiDoc grammar for syntax highlighting, in particular sublime-asciidoc and Atom AsciiDoc Preview, which make writing AsciiDoc as easy as Markdown.
AsciiDoctor is an AsciiDoc-specific toolchain for building books and websites from AsciiDoc. The project connects the various AsciiDoc tools and allows a simple command-line interface as well as preview tools. AsciiDoctor is primarily used for HTML and eBook formats, but at the time of this writing there is a PDF renderer, which is in beta. Another interesting project of O’Reilly’s is Atlas, a system for push-button publishing that manages AsciiDoc using a Git repository and wraps editorial build processes, comments, and automatic editing in a web platform. I’d be remiss not to mention GitBook which provides a similar toolchain for publishing larger books, though with Markdown.
If you’ve done any graduate work in the STEM degrees then you are probably already familiar with LaTeX to write and publish articles, reports, conference and journal papers, and books. LaTeX is not a simple markup language, to say the least, but it is effective. It is able to handle almost any publishing scenario you can throw at it, including (and in particular) rendering complex mathematical formulas correctly from a text markup language. Most data scientists still use LaTeX, using MathJax or the Daum Equation Editor, if only for the math.
If you’re going to be writing PDFs or reports, I can provide two primary tips for working with LaTeX. First consider cloud-based editing with Overleaf or ShareLaTeX, which allows you to collaborate and edit LaTeX documents similarly to Google Docs. Both of these systems have many of the classes and stylesheets already so that you don’t have to worry too much about the formatting, and instead just get down to writing. Additionally, they aggregate other tools like LaTeX templates and provide templates of their own for most document types.
My personal favorite workflow, however, is to use the Atom editor with the LaTeX package and the LaTeX grammar. When using Atom, you get very nice Git and Github integration – perfect for collaboration on larger documents. If you have a TeX distribution installed (and you will need to do that on your local system, no matter what), then you can automatically build your documents within Atom and view them in PDF preview.
Software developers agree that testing and documentation is vital to the successful creation and deployment of applications. However, although Agile workflows are designed to ensure that documentation and testing are included in the software development lifecycle, too often testing and documentation is left to last, or forgotten. When managing a development project, team leads need to ensure that documentation and testing are part of the “definition of done.”
In the same way, writing is vital to the successful creation and deployment of data products, and is similarly left to last or forgotten. Through publication of our work and ideas, we open ourselves up to criticism, an effective methodology for testing ideas and discovering new ones. Similarly, by explicitly sharing our methods, we make it easier for others to build systems rapidly, and in return, write tutorials that help us better build our systems. And if we translate scientific papers into practical guides, we help to push science along as well.
Don’t get bogged down in the details of writing, however. Use simple, lightweight markup languages to include documentation alongside your projects. Collaborate with other authors and your team using version control systems, and use free tools to make your work widely available. All of this is possible becasue of lightweight markup languages, and the more profecient you are at including writing in your workflow, the easier it will be to share your ideas.
This post is particularly link-heavy with many references to tools and languages. For reference, here are my preferred guides for each of the Markup languages discussed: