Now Reading: Dark Data

Now Reading: Dark Data by David Hand

I first came across a mention of this book in the Summer 2020 number of Imperial, the magazine for the Imperial College Community in a feature note about the book.

It sounded like an interesting read and I had a look for the Princeton University Press book and to my surprise I found an version in Italian published by Rizzoli a few months earlier… I wonder how that worked out. It was cheaper and I was tempted to give it a go in Italian with the name Il tradimento dei numeri (i.e. “The betrayal of the numbers”…). I wonder what hidden story is behind all this…

In the end I decided to go for the English version… Let’s see how it goes.

David Hand is emeritus professor of mathematics at Imperial College London, a former president of the Royal Statistical Society, and a Fellow of the British Academy.

There is a website dedicated to the book: https://darkdata.website

A new “Mathematician’s Apology” – Reblog

In the two and a half years (or so) since I left academia for industry, I’ve worked with a number of math majors and math PhDs outside of academia and talked to a number of current grad students who were considering going into industry. As a result, my perspective on the role of the math research […]

via A new “Mathematician’s Apology” — Low Dimensional Topology

Artificial Intelligence – Debunking Myths

Exploring around the interwebs, I came across this article by Rupert Goodwins in ArsTechnica about debunking myths about Artificial Intelligence. 

HAL 9000 in the film 2001.

It is a good read and it you have a few minutes to spare, do give it a go.

Rupert addresses the following myths:

  1. AI’s makes machines that can think.
  2. AI will not be bound by human ethics.
  3. AI will get out of control
  4. Breakthroughs in AI will all happen in sudden jumps.

It is true that there are a number of effort to try to replicate (and therefore understand) human thought. Some examples include the Blue Brain project in the École Polytechnique Fédérale de Lausanne (EPFL) in Switzerland. However, this does not imply that they will get immediately a machine such as HAL or C3-PO.

This is because the brain is fat more complex than the current efforts are able to simulate. As a matter of fact, even simpler brains are significantly more complex for simulation. This does not mean that we should not try to understand and learn how brains work.

Part of the problem is that it is difficult to even define what we mean by “thought”— the so called hard problem. So finding a solution to the strong AI problem is not going to be here soon, but we should definitely try.

So, once that myth is out of the way, the idea that a Terminator-like robot is around the corner is put into perspective. Sure, there are attempts at getting some self-driving cars and such but we are not quite there yet. All in all, it is true that a number of technological advances can be used for good or bad causes, and that is surely something that we all should bear in mind.

Should You Ever Use a Pie Chart?

Much has been said in pro and against the use of pie charts… And the discussion is by no means something new For example in 1923, the American economist Karl G. Karsten warned us against the pie chart. Karsten’s claims in his book Charts and Graphs are remarkably similar to those heard today:

The disadvantages of the pie chart are many. It is worthless for study and research purposes. In the first place the human eye cannot easily compare as to length the various arcs about the circle, lying as they do in different directions. In the second place, the human eye is not naturally skilled in comparing angles… In the third place, the human eye is not an expert judge of comparative sizes or areas, especially those as irregular as the segments of parts of the circle. There is no way by which the parts of this round unit can be compared so accurately and quickly as the parts of a straight line or bar.

If you are interested to read more about the history of the controversial use of the pie chart, take a look at this post by Dan Kopf.

Pie Chart
Pie Chart

You have big data? MIT researchers can help shrinking it!

In a lot of machine learning and data science applications, it is not unusual to use matrices to represent data. It is indeed a very convenient way to keep the information but also to do manipulations, calculations and other useful tricks. As the size of the data increases, of course the size of the matrices grows too and that can be a bit problematic. Finding a way to reduce the size of these matrices while keeping the information is a challenge that a lot of us have faced. Using techniques that exploit the sparsity of the matrices, or even reducing the dimensionality via principal components is common practice.

Reading the latest World Economic Forum Newsletter I came to find out about a new algorithm that MIT researchers will present in the ACM Symposium on Theory of Computing in June. The algorithm is said to find the smallest possible approximation of an original matrix, guaranteeing reliable computations. Indeed the best way to determine how well de “reduced” matrix approximated the original one you need to measure the “distance” between them and a common distance to use the the typical Euclidean measure that we are all familiar with… What? You say you aren’t?… Remember Pythagoras?

The square of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the other two sides.

Pythagoras

There you go… all you have to do is extend it to n-dimensions et voilà… That is not the only way to measure distance. Think for example the way in which you more in a grid-city such in Manhattan, New York City… You cannot take move diagonally (except in Broadway I suppose…) so you need to go north-sound or east-west. That distance is actually called the “Manhattan distance“.

Mathematicians refer to “norms” when talking about distance measurement and indeed both the Euclidean and Manhattan distances mentioned above are norms:

  • Manhattan distance is a 1-norm measure, the sum of differences are raise to the power of 1.
  • Euclidean distance is a 2-norm measure, the sum of differences are raise to the power of 2.
  • etc…

So what about the MIT algorithm proposed by Richard Peng and Michael Cohen? Well they show that their algorithm is optimal for “reducing” matrices under any norm! The first step is to assign each row of the original matrix a “weight”. A row’s weight is related to the number of other rows that it is similar to. It also determines the likelihood that the row will be included in the reduced matrix.

Let us imagine that the row is indeed included in the reduced matrix. Then its values will be multiplied according to its weight. So, for instance, if 10 rows are similar to one another, but not to any other rows of the matrix, each will have a 10 percent chance of getting into the condensed matrix. If one of them does, its entries will all be multiplied by 10, so that it will reflect the contribution of the other nine rows.

You would think that using the Manhattan distance would be simpler than the Euclidian one when calculating the weights… Well you would be wrong! The previous best effort to reduce a matrix under the 1-norm would return a matrix whose number of rows was proportional to the number of columns of the original matrix raised to the power of 2.5. In the case of the Euclidean distance it would return a matrix whose number of rows is proportional to the number of columns of the original matrix times its own logarithm.

The MIT algorithm of Peng and Cohen is able to reduce matrices under the 1-norm as well as it does under the 2-norm. One important thing is that for the Euclidean norm, the reduction is as good as that of other algorithms… and that is because they use the same best algorithm out there… However, for the 1-norm it uses it recursively!

Interested in reading the paper? Well go to the ArXiV and take a look!

Choosing the Right Visualisation Tool for your Task – Reblog

This is a reblog of DataMarket blog.

We’re frequently asked: What is the best tool to visualize data?

There is obviously no single answer to that question. It depends on the task at hand, and what you want to achieve.

Here’s an attempt to categorize these tasks and point to some of the tools we’ve found to be useful to complete them:

The right tool for the task

Simple one-off charts

The most common tool for simple charting is clearly Excel. It is possible to make near-perfect charts of most chart types using Excel – if you know what you’re doing. Many Excel defaults are sub-optimal, some of the chart types they offer are simply for show and have no practical application. 3D cone shaped “bars” anyone? And Excel makes no attempt at guiding a novice user to the best chart for what she wants to achieve. Here are three alternatives we’ve found useful:

  • Tableau is fast becoming the number one tool for many data visualization professionals. It’s client software (Windows only) that’s available for $999 and gives you a user-friendly way to create well crafted visualizations on top of data that can be imported from all of the most common data file formats. Common charting in Tableau is straight-forward, while some of the more advanced functionality may be less so. Then again, Tableau enables you to create pretty elaborate interactive data applications that can be published online and work on all common browser types, including tablets and mobile handsets. For the non-programmer that sees data visualization as an important part of his job, Tableau is probably the tool for you.
  • DataGraph is a little-known tool that deserves a lot more attention. A very different beast, DataGraph is a Mac-only application ($90 on the AppStore) originally designed to create proper charts for scientific publications, but has become a powerful tool to create a wide variety of charts for any occasion. Nothing we’ve tested comes close to DataGraph when creating crystal-clear, beautiful charts that are also done “right” as far as most of the information visualization literature is concerned. The workflow and interface may take a while to get the grips of, and some of the more advanced functionality may lie hidden even from an avid user for months of usage, but a wide range of samples, aggressive development and an active user community make DataGraph a really interesting solution for professional charting. If you are looking for a tool to create beautiful, yet easy to understand, static charts DataGraph may be your tool of choice. And if your medium is print, DataGraph outshines any other application on the market.
    • The best way to see samples of DataGraph’s capabilities is to download the free trial and browse the samples/templates on the application’s startup screen.
  • R is an open-source programing environment for statistical computing and graphics. A super powerful tool, R takes some programming skills to even get started, but is becoming a standard tool for any self-respecting “data scientist”. An interpreted, command line controlled environment, R does a lot more than graphics as it enables all sorts of crunching and statistical computing, even with enormous data sets. In fact we’d say that the graphics are indeed a little bit of a weak spot of R. Not to complain about the data presentation from the information visualization standpoint, most of the charts that R creates would not be considered refined and therefore needs polishing in other software such as Adobe Illustrator to be ready for publication. Not to be missed if working with R is the ggplot2 package that helps overcome some of the thornier of making charts and graphs for R look proper. If you can program, and need a powerful tool to do graphical analysis, R is your tool, but be prepared to spend significant time to make your outcome look good enough for publication, either in R or by exporting the graphics to another piece of software for touch-up.
    • The R Graphical Manual holds an enormous collection of browsable samples of graphics created using R – and the code and data used to make a lot of them.

Videos and custom high-resolution graphics

If you are creating data visualization videos or high-resolution data graphics, Processing is your tool. Processing is an open source integrated development environment (IDE) that uses a simplified version of Java as its programming language and is especially geared towards developing visual applications.

Processing is great for rapid development of custom data visualization applications that can either be run directly from the IDE, compiled into stand-alone applications or published as Java Applets for publishing on the web.

Java Applets are less than optimal for web publication (ok, they simply suck for a variety of reasons), but a complementary open-source project – Processing.js – has ported Processing to JavaScript using the canvas element for rendering the visuals (canvas is a way to render and control bitmap rendering in modern web browsers using JavaScript). This is a far superior way to take processing work online, and strongly recommended in favor to the Applet.

The area where we have found that Processing really shines as a data visualization tool, is in creating videos. It comes with a video class called MovieMaker that allows you to compose videos programmatically, frame-by-frame. Each frame may well require some serious crunching and take a long time to calculate before it is appended to a growing video file. The results can be quite stunning. Many of the best known data visualization videos are made using this method, including:

Many other great examples showing the power of Processing – and for a lot more than just videos – can be found inProcessing.org’s Exhibition Archives.

As can be seen from these examples Processing is obviously also great for rendering static, high-resolution bitmap visualizations.

So if data driven videos, or high-resolution graphics are your thing, and you’re not afraid of programming, we recommend Processing.

Charts for the Web

There are plenty – dozens, if not hundreds – of programming libraries that allow you to add charts to your web sites. Frankly, most of them are sh*t. Some of the more flashy ones use Flash or even Silverlight for their graphics, and there are strong reasons for not depending on browser plugins for delivering your graphics.

We believe we have tested most of the libraries out there, and there are only two we feel comfortable recommending, each has its pros and cons depending on what you are looking for:

  • Highcharts is a JavaScript charting library that renders vector based, interactive charts in SVG (or VML for older versions of Internet Explorer). It is free for non-commercial use, and commercial licenses start at $80. It is a flexible and well designed library that includes all the most common chart types with plenty of customization and interactivity options. Interestingly enough even though Highcharts is a commercial solution, the source code is available to developers that want to make their own modifications or additions. With plenty of examples, good documentation and active user forums, Highcharts is a great choice for most development projects that need charting.
  • gRaphaël is another JavaScript charting library built on top of Raphaël (see below). Like HighCharts, gRaphaël renders SVG graphics on modern browsers, falling back to VML for IE <9. While holding a lot of promise, gRaphaël is not a very mature library and with limited capabilities, few chart types, even fewer examples and pretty much non-existent documentation. It is however available under proper open source licenses and could serve as a base for great things for those that want to extend these humble beginnings.

Other libraries and solutions that may be worth checking out are the popular commercial solution amCharts, Google’s hosted Chart Tools and jQuery library Flot.

Special Requirements and Custom Visualizations

If you want full control of the look, feel and interactivity of your charts, or if you want to create a custom data visualization for the web from scratch, the out-of-the box libraries mentioned above will not suffice.

In fact – you’ll be surprised how soon you run into limitations that will force you to compromise on your design. Seemingly simple preferences such as “I don’t want drop shadows on the lines in my line chart”, or “I want to control what happens when a user clicks the X-axis” and you may already be stretching it with your chosen library. But consider yourself warned: The compromises may well be worth it. You may not have the time and resources to spend diving deeper, let alone writing yet-another-charting-tool™

However, if you are not one to compromise on your standards, or if you want to take it up a notch and follow the lead of some of the wonderful and engaging data journalism happening at the likes of the NY Times and The Guardian, you’re looking for something that a charting library is simply not designed to do.

The tool for you will probably be one of the following:

  • Raphaël, gRaphaël’s (see above) big brother. Raphaël is a powerful JavaScript library to work with vector graphics. It renders SVG graphics for modern browsers and falls back to VML for Internet Explorer 6, 7 and 8. It comes with a range of good looking samples and decent documentation. Raphaël is open source, and any developer should be able to hit the ground running with it to develop nice looking things quite fast. We don’t recommend Raphaël for the advanced charting part, but for entirely custom data visualizations or small data apps it may very well be the right tool for the task.
  • Protovis is an open source JavaScript visualization toolkit. Rather than simply controlling at a low level the lines and areas that are to be drawn, Protovis allows the developer to specify how data should be encoded in marks – such as bars, dots and lines – to represent it. This approach allows inheritance and scales that enable a developer to construct custom charts types and layouts that can easily take in new data without the need to write any additional code. Protovis natively uses SVG to render graphics, but a couple of efforts have been made to enable VML rendering making Protovis an option for older versions of Internet Explorer that still account for a significant proportion of traffic on the web.Protovis is originally written by Mike Bostock (now data scientist at Square) and Jeffrey Heer of the Stanford Visualization Group. Their architectural approach is ingenious, but it also takes a bit of an effort to wrap your head around, so be prepared for somewhat of a learning curve. Luckily there are plenty of complete and well-written examples and decent documentation. Once you get going, you will be amazed at the flexibility and power that the Protovis approach provides.
  • D3.js or “D3″ for short is in many ways the successor of Protovis. In fact Protovis is no longer under active development by the original team due to the fact that its primary developer – Mike Bostock – is now working on D3 instead.D3 builds on many of the concepts of Protovis. The main difference is that instead of having an intermediate representation that separates the rendering of the SVG (or HTML) from the programming interface, D3 binds the data directly to the DOM representation. If you don’t understand what that means – don’t worry, you don’t have to. But it has a couple of consequences that may or may not make D3 more attractive for your needs.The first one is that it – almost without exception – makes rendering faster and thereby animations and smooth transitions from one state to another more feasible. The second is that it will only work on browsers that support SVG so that you will be leaving Internet Explorer 7 and 8 users behind – and due to the deep DOM integration, enabling VML rendering for D3 is a far bigger task than for Protovis – and one that nobody has embarked on yet.

After thorough research of the available options, we chose Protovis as the base for building out DataMarket’s visualization capabilities with an eye on D3 as our future solution when modern browsers finally saturate the market. We see that horizon about 2 years from now.