Python overtakes R – Reblog

Did you use R, Python (along with their packages), both, or other tools for Analytics, Data Science, Machine Learning work in 2016 and 2017?

Python did not quite “swallow” R, but the results, based on 954 voters, show that in 2017 Python ecosystem overtook R as the leading platform for Analytics, Data Science, Machine Learning.

While in 2016 Python was in 2nd place (“Mainly Python” had 34% share vs 42% for “Mainly R”), in 2017 Python had 41% vs 36% for R.

The share of KDnuggets readers who used both R and Python in significant ways also increased from 8.5% to 12% in 2017, while the share who mainly used other tools dropped from 16% to 11%.

Python, R, Other Analytics, Data Science platform, 2016-2017
Fig. 1: Share of Python, R, Both, or Other platforms usage for Analytics, Data Science, Machine Learning, 2016 vs 2017

Next, we examine the transitions between the different platforms.

Python vs R vs Other, 2016 to 2017 Transitions
Fig. 2: Analytics, Data Science, Machine Learning Platforms
Transitions between R, Python, Both, and Other from 2016 to 2017

This chart looks complicated, but we see two key aspects, and Python wins on both:

  • Loyalty: Python users are more loyal, with 91% of 2016 Python users staying with Python. Only 74% of R users stayed, and 60% of other platforms users did.
  • Switching: Only 5% of Python users moved to R, while twice as many – 10% of R users moved to Python. Among those who used both in 2016, only 49% kept using both, 38% moved to Python, and 11% moved to R.

Net we look at trends across multiple years.

In our 2015 Poll on R vs Python we did not offer an option for “Both Python and R”, so to compare trends across 4 years, we replace the shares of Python and R in 2016 and 2017 by
Python* = (Python share) + 50% of (Both Python and R)
R* = (R share) + 50% of (Both Python and R)

We see that share of R usage is slowly declining (from about 50% in 2015 to 36% in 2017), while Python share is steadily growing – from 23% in 2014 to 47% in 2017. The share of other platforms is also steadily declining.

Python R Other 2014 17 Trends
Fig. 3: Python vs R vs Other platforms for Analytics, Data Science, and Machine Learning, 2014-17

Finally, we look at trends and patterns by region. The regional participation was:

  • US/Canada, 40%
  • Europe, 35%
  • Asia, 12.5%
  • Latin America, 6.2%
  • Africa/Middle East, 3.6%
  • Australia/NZ, 3.1%

To simplify the chart we split “Both” votes among R and Python, as above, and also combine 4 regions with smaller participation of Asia, AU/NZ, Latin America, and Africa/Middle East into one “Rest” region.

Python R Other Region 2016 2017
Fig. 4: Python* vs R* vs Rest by Region, 2016 vs 2017

We observe the same pattern across all regions:

  • increase in Python share, by 8-10%
  • decline in R share, by about 2-4%
  • decline in other platforms, by 5-7%

The future looks bright for Python users, but we expect that R and other platforms will retain some share in the foreseeable future because of their large embedded base.

The percentage of excellence in book – reblog

I recently asked the guys at the Data Science class at GA to bring a good example of a “daft graph” and they all did that with gusto. As usual, there was a certain cable news channel that was mentioned a lot of times for their misleading use of graphics.

However,  there was one example that an attendees sent from a blog by Matt Reed and it was so great that I could not help re-posting the blog here. The original can be seen in this site.


The percentage of excellence in books…

1919 – Extracts from an Investigation into the Physical Properties of Books as They Are At Present Published. The Society of Calligraphers, Boston.
This is a small pamphlet that was designed and authored by the graphic designer W.A. Dwiggins  and his cousin L.B. Sigfried. It pilloried the format of books and his concern for the poor methods of printing trade books in the US at that time.
The book was published by the imaginary  Society of Calligraphers and the stinging investigation was a hoax cooked up by Dwiggins – nevertheless it did have an effect on publishing in the US following its wide distribution.
The graph by Dwiggins shows the reduction in book quality since 1910.

How to choose between learning Python or R first – Reblog

This post is a reblog of a post by Chenh Han Lee, the original can be seen at Udacity.

How to Choose Between Learning Python or R First
by Cheng Han Lee

January 12, 2015

If you’re interested in a career in data, and you’re familiar with the set of skills you’ll need to master, you know that Python and R are two of the most popular languages for data analysis. If you’re not exactly sure which to start learning first, you’re reading the right article.

When it comes to data analysis, both Python and R are simple (and free) to install and relatively easy to get started with. If you’re a newcomer to the world of data science and don’t have experience in either language, or with programming in general, it makes sense to be unsure whether to learn R or Python first.

Luckily, you can’t really go wrong with either.

The Case for R

R has a long and trusted history and a robust supporting community in the data industry. Together, those facts mean that you can rely on online support from others in the field if you need assistance or have questions about using the language. Plus, there are plenty of publicly released packages, more than 5,000 in fact, that you can download to use in tandem with R to extend its capabilities to new heights. That makes R great for conducting complex exploratory data analysis. R also integrates well with other computer languages like C++, Java, and C.

When you need to do heavy statistical analysis or graphing, R’s your go-to. Common mathematical operations like matrix multiplication work straight out of the box, and the language’s array-oriented syntax makes it easier to translate from math to code, especially for someone with no or minimal programming background.

The Case for Python

Python is a general-purpose programming language that can pretty much do anything you need it to: data munging, data engineering, data wrangling, website scraping, web app building, and more. It’s simpler to master than R if you have previously learned an object-oriented programming language like Java or C++.

In addition, because Python is an object-oriented programming language, it’s easier to write large-scale, maintainable, and robust code with it than with R. Using Python, the prototype code that you write on your own computer can be used as production code if needed.

Although Python doesn’t have as comprehensive a set of packages and libraries available to data professionals as R, the combination of Python with tools like Pandas, Numpy, Scipy, Scikit-Learn, and Seaborn will get you pretty darn close. The language is also slowly becoming more useful for tasks like machine learning, and basic to intermediate statistical work (formerly just R’s domain).

Choosing Between Python and R

Here are a few guidelines for determining whether to begin your data language studies with Python or with R.

Personal preference

Choose the language to begin with based on your personal preference, on which comes more naturally to you, which is easier to grasp from the get-go. To give you a sense of what to expect, mathematicians and statisticians tend to prefer R, whereas computer scientists and software engineers tend to favor Python. The best news is that once you learn to program well in one language, it’s pretty easy to pick up others.

Project selection

You can also make the Python vs. R call based on a project you know you’ll be working on in your data studies. If you’re working with data that’s been gathered and cleaned for you, and your main focus is the analysis of that data, go with R. If you have to work with dirty or jumbled data, or to scrape data from websites, files, or other data sources, you should start learning, or advancing your studies in, Python.


Once you have the basics of data analysis under your belt, another criterion for evaluating which language to further your skills in is what language your teammates are using. If you’re all literally speaking the same language, it’ll make collaboration—as well as learning from each other—much easier.

Job market

Jobs calling for skill in Python compared to R have increased similarly over the last few years.

R v Python



That said, as you can see, Python has started to overtake R in data jobs. Thanks to the expansion of the Python ecosystem, tools for nearly every aspect of computing are readily available in the language. In addition, since Python can be used to develop web applications, it enables companies to employ crossover between Python developers and data science teams. That’s a major boon given the shortage of data experts in the current marketplace.

The Bottom Line

In general, you can’t err whether you choose to learn Python first or R first for data analysis. Each language has its pros and cons for different scenarios and tasks. In addition, there are actually libraries to use Python with R, and vice versa—so learning one won’t preclude you from being able to learn and use the other. Perhaps the best solution is to use the above guidelines to decide which of the two languages to begin with, then fortify your skill set by learning the other one.

Is your brain warmed up enough yet? Get to it!

15 Unique Illnesses You Can Only Come Down With in German – Reblog

This is a reblog of the post bu Arika Okrent entitled:

Arika Okrent
15 . 01 . 15

The German language is so perfectly suited for these syndromes, coming down with them in any other language just won’t do.


At some point in the last couple of decades, parents in Germany started coming down with Kevinismus— a strange propensity to give their kids wholly un-German, American-sounding names like Justin, Mandy, Dennis, Cindy, and Kevin. Kids with these names tend to be less successful and have more behavior problems in school. Studies of the Kevinismus phenomenon attribute these effects to a combination of teachers’ prejudices toward the names, and the lower social status of parents who choose names like Kevin.


Föhn is the name for a specific wind that cools air as it draws up one side of a mountain, and then warms it as it compresses coming down the other side. These winds are believed to cause headaches and other feelings of illness. Many a 19th century German lady took to her fainting couch with a cold compress, suffering from Föhnkrankheit.


Kreislaufzusammenbruch, or “circulatory collapse,” sounds deathly serious, but it’s used quite commonly in Germany to mean something like “feeling woozy” or “I don’t think I can come into work today.”


Hörsturz refers to a sudden loss of hearing, which in Germany is apparently frequently caused by stress. Strangely, while every German knows at least 5 people who have had a bout of Hörsturz, it is practically unheard of anywhere else.


Frühjahrsmüdigkeit or “early year tiredness” can be translated as “spring fatigue.” Is it from the change in the weather? Changing sunlight patterns? Hormone imbalance? Allergies? As afflictions go, Frühjahrsmüdigkeit is much less fun than our “spring fever,” which is instead associated with increased vim, vigor, pep, and randiness.


Fernweh is the opposite of homesickness. It is the longing for travel, or getting out there beyond the horizon, what you might call… awaysickness.


Putzen means “to clean” and Fimmel is a mania or obsession. Putzfimmel is an obsession with cleaning. It is not unheard of outside of Germany, but elsewhere it is less culturally embedded and less fun to say.


An old-fashioned type of miserable lovesickness that was named “Werther’s fever” for the hero of Goethe’s The Sorrows of Young Werther. Poor young Werther suffers for the love of a peasant girl who is already married. Death is his only way out. A generation of sensitive young men brought made Werthersfieber quite fashionable in the late 18th century.


Ostalgie is nostalgia for the old way of life in East Germany (“ost” means East). If you miss your old Trabant and those weekly visits from the secret police, you may have Ostalgie.


Zeitkrankheit is “time sickness” or “illness of the times.” It’s a general term for whatever the damaging mindset or preoccupations of a certain era are.


Weltschmerz or “world pain,” is a sadness brought on by a realization that the world cannot be the way you wish it would be. It’s more emotional than pessimism, and more painful than ennui.


Ichschmerz is like Weltschmerz, but it is dissatisfaction with the self rather than the world. Which is probably what Weltschmerz really boils down to most of the time.


Lebensmüdigkeit translates as despair or world-weariness, but it also more literally means “life tiredness.” When someone does something stupidly dangerous, you might sarcastically ask, “What are you doing? Are you lebensmüde?!”


Zivilisationskrankheit, or “civilization sickness” is a problem caused by living in the modern world. Stress, obesity, eating disorders, carpal tunnel syndrome and diseases like type 2 diabetes are all examples.


Torschlusspanik or “gate closing panic” is the anxiety-inducing awareness that as time goes on, life’s opportunities just keep getting fewer and fewer and there’s no way to know which ones you should be taking before they close forever. It’s a Zivilisationskrankheit that may result in WeltschmerzIchschmerz, or Lebensmüdigkeit.

How to start thinking like a Data Scientist – by T Redman

This is a reblog of a post by Thomas Redman in the Harvard Business Review:

Slowly but steadily, data are forcing their way into every nook and cranny of every industry, company, and job. Managers who aren’t data savvy, who can’t conduct basic analyses, interpret more complex ones, and interact with data scientists are already at a disadvantage. Companies without a large and growing cadre of data-savvy managers are similarly disadvantaged.

Fortunately, you don’t have to be a data scientist or a Bayesian statistician to tease useful insights from data. This post explores an exercise I’ve used for 20 years to help those with an open mind (and a pencil, paper, and calculator) get started. One post won’t make you data savvy, but it will help you become data literate, open your eyes to the millions of small data opportunities, and enable you work a bit more effectively with data scientists, analytics, and all things quantitative.

While the exercise is very much a how-to, each step also illustrates an important concept in analytics — from understanding variation to visualization.

First, start with something that interests, even bothers, you at work, like consistently late-starting meetings. Whatever it is, form it up as a question and write it down: “Meetings always seem to start late. Is that really true?”

Next, think through the data that can help answer your question, and develop a plan for creating them. Write down all the relevant definitions and your protocol for collecting the data. For this particular example, you have to define when the meeting actually begins. Is it the time someone says, “Ok, let’s begin.”? Or the time the real business of the meeting starts? Does kibitzing count?

Now collect the data. It is critical that you trust the data. And, as you go, you’re almost certain to find gaps in data collection. You may find that even though a meeting has started, it starts anew when a more senior person joins in. Modify your definition and protocol as you go along.

Sooner than you think, you’ll be ready to start drawing some pictures. Good pictures make it easier for you to both understand the data and communicate main points to others. There are plenty of good tools to help, but I like to draw my first picture by hand. My go-to plot is a time-series plot, where the horizontal axis has the date and time and the vertical axis has the variable of interest. Thus, a point on the graph below (click for a larger image) is the date and time of a meeting versus the number of minutes late.


Now return to the question that you started with and develop summary statistics. Have you discovered an answer? In this case, “Over a two-week period, 10% of the meetings I attended started on time. And on average, they started 12 minutes late.”

But don’t stop there. Answer the “so what?” question. In this case, “If those two weeks are typical, I waste an hour a day. And that costs the company $X/year.”

Many analyses end because there is no “so what?” Certainly if 80% of meetings start within a few minutes of their scheduled start times, the answer to the original question is, “No, meetings start pretty much on time,” and there is no need to go further.

But this case demands more, as some analyses do. Get a feel for variation. Understanding variation leads to a better feel for the overall problem, deeper insights, and novel ideas for improvement. Note on the picture that 8-20 minutes late is typical. A few meetings start right on time, others nearly a full 30 minutes late. It might be better if one could judge, “I can get to meetings 10 minutes late, just in time for them to start,” but the variation is too great.

Now ask, “What else does the data reveal?” It strikes me that five meetings began exactly on time, while every other meeting began at least seven minutes late. In this case, bringing meeting notes to bear reveals that all five meetings were called by the Vice President of Finance. Evidently, she starts all her meetings on time.

So where do you go from here? Are there important next steps? This example illustrates a common dichotomy. On a personal level, results pass both the “interesting” and “important” test. Most of us would give anything to get back an hour a day. And you may not be able to make all meetings start on time, but if the VP can, you can certainly start the meetings you control promptly.

On the company level, results so far only pass the interesting test. You don’t know whether your results are typical, nor whether others can be as hard-nosed as the VP when it comes to starting meetings. But a deeper look is surely in order: Are your results consistent with others’ experiences in the company? Are some days worse than others? Which starts later: conference calls or face-to-face meetings? Is there a relationship between meeting start time and most senior attendee? Return to step one, pose the next group of questions, and repeat the process. Keep the focus narrow — two or three questions at most.

I hope you’ll have fun with this exercise. Many find a primal joy in data. Hooked once, hooked for life. But whether you experience that primal joy or not, do not take this exercise lightly. There are fewer and fewer places for the “data illiterate” and, in my humble opinion, no more excuses.

ICM2014 ― opening ceremony

ICM2014 — opening ceremony
// Gowers’s Weblog

I’d forgotten just how full the first day of an ICM is. First, you need to turn up early for the opening ceremony, so you end up sitting around for an hour and half or so before it even starts. Then there’s the ceremony itself, which lasts a couple of hours. Then in the afternoon you have talks about the four Fields Medallists and the Nevanlinna Prize winner, with virtually no breaks. Then after a massive ten minutes, the Nevanlinna Prize winner talks about his (in this case) own work, about which you have just heard, but in a bit more detail. That took us to 5:45pm. And just to round things off, Jim Simons is giving a public lecture at 8pm, which I suppose I could skip but I think I’m not going to. (The result is that most of this post will be written after it, but right at this very moment it is not yet 8pm.)

I didn’t manage to maintain my ignorance of the fourth Fields medallist, because I was sitting only a few rows behind the medallists, and when Martin Hairer turned up wearing a suit, there was no longer any room for doubt. However, there was a small element of surprise in the way that the medals were announced. Ingrid Daubechies (president of the IMU) told us that they had made short videos about each medallist, and also about the Nevanlinna Prize winner, who was Subhash Khot. So for each winner in turn, she told us that a video was about to start. An animation of a Fields medal then rotated on the large screens at the front of the hall, and when it settled down one could see the name of the next winner. The beginning of each video was drowned out by the resulting applause (and also a cheer for Bhargava and an even louder one for Mirzakhani), but they were pretty good. At the end of each video, the winner went up on stage, to more applause, and sat down. Then when the five videos were over, the medals were presented, to each winner in turn, by the president of Korea.

Here they are, getting their medals/prize. It wasn’t easy to get good photos with a cheap camera on maximum zoom, but they give some idea.

ICM2014 ― opening ceremony














After those prizes were announced, we had the announcements of the Gauss prize and the Chern medal. The former is for mathematical work that has had a strong impact outside mathematics, and the latter is for lifetime achievement. The Gauss medal went to Stanley Osher and the Chern medal to Phillip Griffiths.

If you haven’t already seen it, the IMU page about the winners has links to very good short (but not too short) summaries of their work. I’m quite glad about that because I think it means I can get away with writing less about them myself. I also recommend this Google Plus post by John Baez about the work of Mirzakhani.

I have one remark to make about the Fields medals, which is that I think that this time round there were an unusually large number of people who could easily have got medals, including other women. (This last point is important — one should think of Mirzakhani’s medal as the new normal rather than as some freak event.) I have two words to say about them: Mikhail Gromov. To spell it out, he is an extreme, but by no means unique, example of a mathematician who did not get a Fields medal but whose reputation would be pretty much unaltered if he had. In the end it’s the theorems that count, and there have been some wonderful theorems proved by people who just missed out this year.

Other aspects of the ceremony were much as one would expect, but there was rather less time devoted to long and repetitive speeches about the host country than I have been used to at other ICMs, which was welcome.

That is not to say that interesting facts about the host country were entirely ignored. The final speech of the ceremony was given by Martin Groetschel, who told us several interesting things, one of which was the number of mathematics papers published in international journals by Koreans in 1981. He asked us to guess, so I’m giving you the opportunity to guess before reading on.

Now Korea is 11th in the world for the number of mathematical publications. Of course, one can question what this really means, but it certainly means something when you hear that the answer to the question above is 3. So in just one generation a serious mathematical tradition has been created from almost nothing.

He also told us the names of the people on various committees. Here they are, except that I couldn’t quite copy all of them down fast enough.

The Fields Medal committee consisted of Daubechies, Ambrosio, Eisenbud, Fukaya, Ghys, Dick Gross, Kirwan, Kollar, Kontsevich, Struwe, Zeitouni and Günter Ziegler.

The program committee consisted of Carlos Kenig (chair), Bolthausen, Alice Chang, de Melo, Esnault, me, Kannan, Jong Hae Keum, Le Bris, Lubotsky, Nesetril and Okounkov.

The ICM executive committee (if that’s the right phrase) for the next four years will be Shigefumi Mori (president), Helge Holden (secretary), Alicia Dickenstein (VP), Vaughan Jones (VP), Dick Gross, Hyungju Park, Christiane Rousseau, Vasudevan Srinivas, John Toland and Wendelin Werner.

He also told us about various initiatives of the IMU, one of which sounded interesting (by which I don’t mean that the others didn’t). It’s called the adopt-a-graduate-student initiative. The idea is that the IMU will support researchers in developed countries who want to provide some kind of mentorship for graduate students in less developed countries working in a similar area who might otherwise not find it easy to receive appropriate guidance. Or something like that.

Ingrid Daubechies also told us about two other initiatives connected with the developing world. One was that the winner of the Chern Medal gets to nominate a good cause to receive a large amount of money. Stupidly I seem not to have written it down, but it may have been $250,000. Anyhow, that order of magnitude. Phillip Griffiths chose the African Mathematics Millennium Science Initiative, or AMMSI. The other was that the five winners of the Breakthrough Prizes in mathematics, Donaldson, Kontsevich, Lurie, Tao and Taylor, have each given $100,000 towards a $500,000 fund for helping graduate students from the developing world. I don’t know exactly what form the help will take, but the phrase “breakout graduate fellowships” was involved.

When I get time, I’ll try to write something about the Laudationes, but right now I need to sleep. I have to confess that during Jim Simons’s talk, my jet lag caught up with me in a major way and I simply couldn’t keep awake. So I don’t really have much to say about it, except that there was an amusing Q&A session where several people asked long rambling “questions” that left Jim Simons himself amusingly nonplussed. His repeated requests for short pithy questions were ignored.

Just before I finish, I’ve remembered an amusing thing that happened during the early part of the ceremony, when some traditional dancing was taking place (or at least I assume it was traditional). At one point some men in masks appeared, who looked like this.

Masked dancers

Masked dancers

Just while we’re at it, here are some more dancers.

Dancers of various kinds

Dancers of various kinds

Anyhow, when the men in masks came on stage, there were screams of terror from Mirzakhani’s daughter, who looked about two and a half, and delightful, and she (the daughter) took a long time to be calmed down. I think my six-year-old son might have felt the same way — he had to leave a pantomime version of Hansel and Gretel, to which he had been taken as a birthday treat when he was five, almost the instant it started, and still has those tendencies.

Choosing the Right Visualisation Tool for your Task – Reblog

This is a reblog of DataMarket blog.

We’re frequently asked: What is the best tool to visualize data?

There is obviously no single answer to that question. It depends on the task at hand, and what you want to achieve.

Here’s an attempt to categorize these tasks and point to some of the tools we’ve found to be useful to complete them:

The right tool for the task

Simple one-off charts

The most common tool for simple charting is clearly Excel. It is possible to make near-perfect charts of most chart types using Excel – if you know what you’re doing. Many Excel defaults are sub-optimal, some of the chart types they offer are simply for show and have no practical application. 3D cone shaped “bars” anyone? And Excel makes no attempt at guiding a novice user to the best chart for what she wants to achieve. Here are three alternatives we’ve found useful:

  • Tableau is fast becoming the number one tool for many data visualization professionals. It’s client software (Windows only) that’s available for $999 and gives you a user-friendly way to create well crafted visualizations on top of data that can be imported from all of the most common data file formats. Common charting in Tableau is straight-forward, while some of the more advanced functionality may be less so. Then again, Tableau enables you to create pretty elaborate interactive data applications that can be published online and work on all common browser types, including tablets and mobile handsets. For the non-programmer that sees data visualization as an important part of his job, Tableau is probably the tool for you.
  • DataGraph is a little-known tool that deserves a lot more attention. A very different beast, DataGraph is a Mac-only application ($90 on the AppStore) originally designed to create proper charts for scientific publications, but has become a powerful tool to create a wide variety of charts for any occasion. Nothing we’ve tested comes close to DataGraph when creating crystal-clear, beautiful charts that are also done “right” as far as most of the information visualization literature is concerned. The workflow and interface may take a while to get the grips of, and some of the more advanced functionality may lie hidden even from an avid user for months of usage, but a wide range of samples, aggressive development and an active user community make DataGraph a really interesting solution for professional charting. If you are looking for a tool to create beautiful, yet easy to understand, static charts DataGraph may be your tool of choice. And if your medium is print, DataGraph outshines any other application on the market.
    • The best way to see samples of DataGraph’s capabilities is to download the free trial and browse the samples/templates on the application’s startup screen.
  • R is an open-source programing environment for statistical computing and graphics. A super powerful tool, R takes some programming skills to even get started, but is becoming a standard tool for any self-respecting “data scientist”. An interpreted, command line controlled environment, R does a lot more than graphics as it enables all sorts of crunching and statistical computing, even with enormous data sets. In fact we’d say that the graphics are indeed a little bit of a weak spot of R. Not to complain about the data presentation from the information visualization standpoint, most of the charts that R creates would not be considered refined and therefore needs polishing in other software such as Adobe Illustrator to be ready for publication. Not to be missed if working with R is the ggplot2 package that helps overcome some of the thornier of making charts and graphs for R look proper. If you can program, and need a powerful tool to do graphical analysis, R is your tool, but be prepared to spend significant time to make your outcome look good enough for publication, either in R or by exporting the graphics to another piece of software for touch-up.
    • The R Graphical Manual holds an enormous collection of browsable samples of graphics created using R – and the code and data used to make a lot of them.

Videos and custom high-resolution graphics

If you are creating data visualization videos or high-resolution data graphics, Processing is your tool. Processing is an open source integrated development environment (IDE) that uses a simplified version of Java as its programming language and is especially geared towards developing visual applications.

Processing is great for rapid development of custom data visualization applications that can either be run directly from the IDE, compiled into stand-alone applications or published as Java Applets for publishing on the web.

Java Applets are less than optimal for web publication (ok, they simply suck for a variety of reasons), but a complementary open-source project – Processing.js – has ported Processing to JavaScript using the canvas element for rendering the visuals (canvas is a way to render and control bitmap rendering in modern web browsers using JavaScript). This is a far superior way to take processing work online, and strongly recommended in favor to the Applet.

The area where we have found that Processing really shines as a data visualization tool, is in creating videos. It comes with a video class called MovieMaker that allows you to compose videos programmatically, frame-by-frame. Each frame may well require some serious crunching and take a long time to calculate before it is appended to a growing video file. The results can be quite stunning. Many of the best known data visualization videos are made using this method, including:

Many other great examples showing the power of Processing – and for a lot more than just videos – can be found’s Exhibition Archives.

As can be seen from these examples Processing is obviously also great for rendering static, high-resolution bitmap visualizations.

So if data driven videos, or high-resolution graphics are your thing, and you’re not afraid of programming, we recommend Processing.

Charts for the Web

There are plenty – dozens, if not hundreds – of programming libraries that allow you to add charts to your web sites. Frankly, most of them are sh*t. Some of the more flashy ones use Flash or even Silverlight for their graphics, and there are strong reasons for not depending on browser plugins for delivering your graphics.

We believe we have tested most of the libraries out there, and there are only two we feel comfortable recommending, each has its pros and cons depending on what you are looking for:

  • Highcharts is a JavaScript charting library that renders vector based, interactive charts in SVG (or VML for older versions of Internet Explorer). It is free for non-commercial use, and commercial licenses start at $80. It is a flexible and well designed library that includes all the most common chart types with plenty of customization and interactivity options. Interestingly enough even though Highcharts is a commercial solution, the source code is available to developers that want to make their own modifications or additions. With plenty of examples, good documentation and active user forums, Highcharts is a great choice for most development projects that need charting.
  • gRaphaël is another JavaScript charting library built on top of Raphaël (see below). Like HighCharts, gRaphaël renders SVG graphics on modern browsers, falling back to VML for IE <9. While holding a lot of promise, gRaphaël is not a very mature library and with limited capabilities, few chart types, even fewer examples and pretty much non-existent documentation. It is however available under proper open source licenses and could serve as a base for great things for those that want to extend these humble beginnings.

Other libraries and solutions that may be worth checking out are the popular commercial solution amCharts, Google’s hosted Chart Tools and jQuery library Flot.

Special Requirements and Custom Visualizations

If you want full control of the look, feel and interactivity of your charts, or if you want to create a custom data visualization for the web from scratch, the out-of-the box libraries mentioned above will not suffice.

In fact – you’ll be surprised how soon you run into limitations that will force you to compromise on your design. Seemingly simple preferences such as “I don’t want drop shadows on the lines in my line chart”, or “I want to control what happens when a user clicks the X-axis” and you may already be stretching it with your chosen library. But consider yourself warned: The compromises may well be worth it. You may not have the time and resources to spend diving deeper, let alone writing yet-another-charting-tool™

However, if you are not one to compromise on your standards, or if you want to take it up a notch and follow the lead of some of the wonderful and engaging data journalism happening at the likes of the NY Times and The Guardian, you’re looking for something that a charting library is simply not designed to do.

The tool for you will probably be one of the following:

  • Raphaël, gRaphaël’s (see above) big brother. Raphaël is a powerful JavaScript library to work with vector graphics. It renders SVG graphics for modern browsers and falls back to VML for Internet Explorer 6, 7 and 8. It comes with a range of good looking samples and decent documentation. Raphaël is open source, and any developer should be able to hit the ground running with it to develop nice looking things quite fast. We don’t recommend Raphaël for the advanced charting part, but for entirely custom data visualizations or small data apps it may very well be the right tool for the task.
  • Protovis is an open source JavaScript visualization toolkit. Rather than simply controlling at a low level the lines and areas that are to be drawn, Protovis allows the developer to specify how data should be encoded in marks – such as bars, dots and lines – to represent it. This approach allows inheritance and scales that enable a developer to construct custom charts types and layouts that can easily take in new data without the need to write any additional code. Protovis natively uses SVG to render graphics, but a couple of efforts have been made to enable VML rendering making Protovis an option for older versions of Internet Explorer that still account for a significant proportion of traffic on the web.Protovis is originally written by Mike Bostock (now data scientist at Square) and Jeffrey Heer of the Stanford Visualization Group. Their architectural approach is ingenious, but it also takes a bit of an effort to wrap your head around, so be prepared for somewhat of a learning curve. Luckily there are plenty of complete and well-written examples and decent documentation. Once you get going, you will be amazed at the flexibility and power that the Protovis approach provides.
  • D3.js or “D3″ for short is in many ways the successor of Protovis. In fact Protovis is no longer under active development by the original team due to the fact that its primary developer – Mike Bostock – is now working on D3 instead.D3 builds on many of the concepts of Protovis. The main difference is that instead of having an intermediate representation that separates the rendering of the SVG (or HTML) from the programming interface, D3 binds the data directly to the DOM representation. If you don’t understand what that means – don’t worry, you don’t have to. But it has a couple of consequences that may or may not make D3 more attractive for your needs.The first one is that it – almost without exception – makes rendering faster and thereby animations and smooth transitions from one state to another more feasible. The second is that it will only work on browsers that support SVG so that you will be leaving Internet Explorer 7 and 8 users behind – and due to the deep DOM integration, enabling VML rendering for D3 is a far bigger task than for Protovis – and one that nobody has embarked on yet.

After thorough research of the available options, we chose Protovis as the base for building out DataMarket’s visualization capabilities with an eye on D3 as our future solution when modern browsers finally saturate the market. We see that horizon about 2 years from now.


Elliptical Answers to Why Winter Mornings Are So Long – Rebloged from

Reblogged from Elliptical Answers to Why Winter Mornings Are So Long – by John O’Neil.

As the parent of teenage boys who have to be dragged out of bed on school days, I had been looking forward to earlier sunrises once the winter solstice was past. But early January mornings seemed darker than ever while at the same time, the sky was clearly lighter around 5 p.m.

Tony Cenicola/The New York Times

FIGURE 8 An analemma shows the Sun’s varying positions over a year.

It turned out that what I suspected was actually true — by Jan. 2, there were 12 more minutes of sunlight in the afternoons, but 3 fewer minutes in the morning. It also turned out that the reasons for it were complicated, as I discovered in a series of phone and e-mail conversations with Jay M. Pasachoff, a professor of astronomy at Williams College, and a former student of his, Joseph Gangestad, who received his Ph.D. in orbital mechanics from Purdue.

They pointed me to the Equation of Time, a grandly named formula relating to the fact that not all days are 24 hours, if you track noon by the position of the Sun instead of on a clock.

We’ve all seen a readout of the Equation of Time, Dr. Pasachoff said. It’s that uneven figure 8 that can be found on globes in a deserted part of the Pacific, a shape known as an analemma.

If Earth’s axis were perpendicular to its orbit instead of tilted, and if its orbit were a circle instead of an ellipse, the Sun would appear in the same spot in the sky each day and clocks and sundials would always match. Instead, they can be as much as 16 minutes apart, and that’s where things get complicated.

As Earth moves toward winter solstice, you have “different things going on at the same time,” Dr. Pasachoff said.

Earth’s tilt means that every day during the fall, the angle at which we view the Sun changes. It appears farther south and travels a shorter arc across the sky, affecting sunrise and sunset equally, and making the day shorter.

The changes in the solar time follow a different cycle. In the early 1600s, Kepler discovered that planets move faster at the part of their orbit that is closest to the sun, the perihelion. For Earth, perihelion comes a little after the winter solstice, so from November on, Earth is accelerating.

That increased speed means we reach the Sun’s maximum a little earlier each day, which pushes solar noon backward against clock time. That shift is amplified because the Sun is traveling a little south each day, while clocks only count its east to west traverse.

Add it all together and you get sunrise and sunset times that are not symmetrical. In the weeks before the winter solstice, sunrise is being pushed later by both the changing angle of the Sun and the slowing of solar time. But sunset is being pushed in both directions — earlier by the Sun’s angle and later by the change in solar time.

The result is more darkness in the morning and less in the afternoon. That’s why the earliest sunset of 2012, at 4:29 p.m., in New York fell as soon as Nov. 30, according to theNational Oceanic and Atmospheric Administration’s solar calculator, while mornings continued to stay dark later. After the solstice, Earth continued its acceleration until reaching perihelion on Jan. 2. So the sunrise continued to slide, reaching its latest point, 7:20 a.m., on Dec. 28. There it stood until Jan. 11, when we finally got another minute of morning light. By Feb. 7, sunrise will be all the way back to 7 a.m.

“It’s hard to wrap the mind around this problem, which is really a figment of our timekeeping system,” Dr. Gangestad said. That is, we would never notice it if we all just used sundials.