Celebrating π day at the office thanks to the nice people of people.io
Now Reading: Thinking Mahines by Luke Dormehl
In the two and a half years (or so) since I left academia for industry, I’ve worked with a number of math majors and math PhDs outside of academia and talked to a number of current grad students who were considering going into industry. As a result, my perspective on the role of the math research […]
It is a good read and it you have a few minutes to spare, do give it a go.
Rupert addresses the following myths:
It is true that there are a number of effort to try to replicate (and therefore understand) human thought. Some examples include the Blue Brain project in the École Polytechnique Fédérale de Lausanne (EPFL) in Switzerland. However, this does not imply that they will get immediately a machine such as HAL or C3-PO.
This is because the brain is fat more complex than the current efforts are able to simulate. As a matter of fact, even simpler brains are significantly more complex for simulation. This does not mean that we should not try to understand and learn how brains work.
Part of the problem is that it is difficult to even define what we mean by “thought”— the so called hard problem. So finding a solution to the strong AI problem is not going to be here soon, but we should definitely try.
So, once that myth is out of the way, the idea that a Terminator-like robot is around the corner is put into perspective. Sure, there are attempts at getting some self-driving cars and such but we are not quite there yet. All in all, it is true that a number of technological advances can be used for good or bad causes, and that is surely something that we all should bear in mind.
Much has been said in pro and against the use of pie charts… And the discussion is by no means something new For example in 1923, the American economist Karl G. Karsten warned us against the pie chart. Karsten’s claims in his book Charts and Graphs are remarkably similar to those heard today:
The disadvantages of the pie chart are many. It is worthless for study and research purposes. In the first place the human eye cannot easily compare as to length the various arcs about the circle, lying as they do in different directions. In the second place, the human eye is not naturally skilled in comparing angles… In the third place, the human eye is not an expert judge of comparative sizes or areas, especially those as irregular as the segments of parts of the circle. There is no way by which the parts of this round unit can be compared so accurately and quickly as the parts of a straight line or bar.
If you are interested to read more about the history of the controversial use of the pie chart, take a look at this post by Dan Kopf.
In a lot of machine learning and data science applications, it is not unusual to use matrices to represent data. It is indeed a very convenient way to keep the information but also to do manipulations, calculations and other useful tricks. As the size of the data increases, of course the size of the matrices grows too and that can be a bit problematic. Finding a way to reduce the size of these matrices while keeping the information is a challenge that a lot of us have faced. Using techniques that exploit the sparsity of the matrices, or even reducing the dimensionality via principal components is common practice.
Reading the latest World Economic Forum Newsletter I came to find out about a new algorithm that MIT researchers will present in the ACM Symposium on Theory of Computing in June. The algorithm is said to find the smallest possible approximation of an original matrix, guaranteeing reliable computations. Indeed the best way to determine how well de “reduced” matrix approximated the original one you need to measure the “distance” between them and a common distance to use the the typical Euclidean measure that we are all familiar with… What? You say you aren’t?… Remember Pythagoras?
The square of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the other two sides.
There you go… all you have to do is extend it to n-dimensions et voilà… That is not the only way to measure distance. Think for example the way in which you more in a grid-city such in Manhattan, New York City… You cannot take move diagonally (except in Broadway I suppose…) so you need to go north-sound or east-west. That distance is actually called the “Manhattan distance“.
Mathematicians refer to “norms” when talking about distance measurement and indeed both the Euclidean and Manhattan distances mentioned above are norms:
So what about the MIT algorithm proposed by Richard Peng and Michael Cohen? Well they show that their algorithm is optimal for “reducing” matrices under any norm! The first step is to assign each row of the original matrix a “weight”. A row’s weight is related to the number of other rows that it is similar to. It also determines the likelihood that the row will be included in the reduced matrix.
Let us imagine that the row is indeed included in the reduced matrix. Then its values will be multiplied according to its weight. So, for instance, if 10 rows are similar to one another, but not to any other rows of the matrix, each will have a 10 percent chance of getting into the condensed matrix. If one of them does, its entries will all be multiplied by 10, so that it will reflect the contribution of the other nine rows.
You would think that using the Manhattan distance would be simpler than the Euclidian one when calculating the weights… Well you would be wrong! The previous best effort to reduce a matrix under the 1-norm would return a matrix whose number of rows was proportional to the number of columns of the original matrix raised to the power of 2.5. In the case of the Euclidean distance it would return a matrix whose number of rows is proportional to the number of columns of the original matrix times its own logarithm.
The MIT algorithm of Peng and Cohen is able to reduce matrices under the 1-norm as well as it does under the 2-norm. One important thing is that for the Euclidean norm, the reduction is as good as that of other algorithms… and that is because they use the same best algorithm out there… However, for the 1-norm it uses it recursively!
Interested in reading the paper? Well go to the ArXiV and take a look!
It’s not who has the best algorithm that wins. It’s who has the most data!
This is a reblog of DataMarket blog.
We’re frequently asked: What is the best tool to visualize data?
There is obviously no single answer to that question. It depends on the task at hand, and what you want to achieve.
Here’s an attempt to categorize these tasks and point to some of the tools we’ve found to be useful to complete them:
The most common tool for simple charting is clearly Excel. It is possible to make near-perfect charts of most chart types using Excel – if you know what you’re doing. Many Excel defaults are sub-optimal, some of the chart types they offer are simply for show and have no practical application. 3D cone shaped “bars” anyone? And Excel makes no attempt at guiding a novice user to the best chart for what she wants to achieve. Here are three alternatives we’ve found useful:
If you are creating data visualization videos or high-resolution data graphics, Processing is your tool. Processing is an open source integrated development environment (IDE) that uses a simplified version of Java as its programming language and is especially geared towards developing visual applications.
Processing is great for rapid development of custom data visualization applications that can either be run directly from the IDE, compiled into stand-alone applications or published as Java Applets for publishing on the web.
The area where we have found that Processing really shines as a data visualization tool, is in creating videos. It comes with a video class called MovieMaker that allows you to compose videos programmatically, frame-by-frame. Each frame may well require some serious crunching and take a long time to calculate before it is appended to a growing video file. The results can be quite stunning. Many of the best known data visualization videos are made using this method, including:
Many other great examples showing the power of Processing – and for a lot more than just videos – can be found inProcessing.org’s Exhibition Archives.
As can be seen from these examples Processing is obviously also great for rendering static, high-resolution bitmap visualizations.
So if data driven videos, or high-resolution graphics are your thing, and you’re not afraid of programming, we recommend Processing.
There are plenty – dozens, if not hundreds – of programming libraries that allow you to add charts to your web sites. Frankly, most of them are sh*t. Some of the more flashy ones use Flash or even Silverlight for their graphics, and there are strong reasons for not depending on browser plugins for delivering your graphics.
We believe we have tested most of the libraries out there, and there are only two we feel comfortable recommending, each has its pros and cons depending on what you are looking for:
If you want full control of the look, feel and interactivity of your charts, or if you want to create a custom data visualization for the web from scratch, the out-of-the box libraries mentioned above will not suffice.
In fact – you’ll be surprised how soon you run into limitations that will force you to compromise on your design. Seemingly simple preferences such as “I don’t want drop shadows on the lines in my line chart”, or “I want to control what happens when a user clicks the X-axis” and you may already be stretching it with your chosen library. But consider yourself warned: The compromises may well be worth it. You may not have the time and resources to spend diving deeper, let alone writing yet-another-charting-tool™
However, if you are not one to compromise on your standards, or if you want to take it up a notch and follow the lead of some of the wonderful and engaging data journalism happening at the likes of the NY Times and The Guardian, you’re looking for something that a charting library is simply not designed to do.
The tool for you will probably be one of the following:
After thorough research of the available options, we chose Protovis as the base for building out DataMarket’s visualization capabilities with an eye on D3 as our future solution when modern browsers finally saturate the market. We see that horizon about 2 years from now.