Random thoughts about random subjects… From science to literature and between manga and watercolours, passing by data science and rugby; including film, physics and fiction, programming, pictures and puns.
I recently asked the guys at the Data Science class at GA to bring a good example of a “daft graph” and they all did that with gusto. As usual, there was a certain cable news channel that was mentioned a lot of times for their misleading use of graphics.
1919– Extracts from an Investigation into the Physical Properties of Books as They Are At Present Published. The Society of Calligraphers, Boston.
This is a small pamphlet that was designed and authored by the graphic designer W.A. Dwiggins and his cousin L.B. Sigfried. It pilloried the format of books and his concern for the poor methods of printing trade books in the US at that time.
The book was published by the imaginary Society of Calligraphers and the stinging investigation was a hoax cooked up by Dwiggins – nevertheless it did have an effect on publishing in the US following its wide distribution.
The graph by Dwiggins shows the reduction in book quality since 1910.
If you’re interested in a career in data, and you’re familiar with the set of skills you’ll need to master, you know that Python and R are two of the most popular languages for data analysis. If you’re not exactly sure which to start learning first, you’re reading the right article.
When it comes to data analysis, both Python and R are simple (and free) to install and relatively easy to get started with. If you’re a newcomer to the world of data science and don’t have experience in either language, or with programming in general, it makes sense to be unsure whether to learn R or Python first.
Luckily, you can’t really go wrong with either.
The Case for R
R has a long and trusted history and a robust supporting community in the data industry. Together, those facts mean that you can rely on online support from others in the field if you need assistance or have questions about using the language. Plus, there are plenty of publicly released packages, more than 5,000 in fact, that you can download to use in tandem with R to extend its capabilities to new heights. That makes R great for conducting complex exploratory data analysis. R also integrates well with other computer languages like C++, Java, and C.
When you need to do heavy statistical analysis or graphing, R’s your go-to. Common mathematical operations like matrix multiplication work straight out of the box, and the language’s array-oriented syntax makes it easier to translate from math to code, especially for someone with no or minimal programming background.
The Case for Python
Python is a general-purpose programming language that can pretty much do anything you need it to: data munging, data engineering, data wrangling, website scraping, web app building, and more. It’s simpler to master than R if you have previously learned an object-oriented programming language like Java or C++.
In addition, because Python is an object-oriented programming language, it’s easier to write large-scale, maintainable, and robust code with it than with R. Using Python, the prototype code that you write on your own computer can be used as production code if needed.
Although Python doesn’t have as comprehensive a set of packages and libraries available to data professionals as R, the combination of Python with tools like Pandas, Numpy, Scipy, Scikit-Learn, and Seaborn will get you pretty darn close. The language is also slowly becoming more useful for tasks like machine learning, and basic to intermediate statistical work (formerly just R’s domain).
Choosing Between Python and R
Here are a few guidelines for determining whether to begin your data language studies with Python or with R.
Choose the language to begin with based on your personal preference, on which comes more naturally to you, which is easier to grasp from the get-go. To give you a sense of what to expect, mathematicians and statisticians tend to prefer R, whereas computer scientists and software engineers tend to favor Python. The best news is that once you learn to program well in one language, it’s pretty easy to pick up others.
You can also make the Python vs. R call based on a project you know you’ll be working on in your data studies. If you’re working with data that’s been gathered and cleaned for you, and your main focus is the analysis of that data, go with R. If you have to work with dirty or jumbled data, or to scrape data from websites, files, or other data sources, you should start learning, or advancing your studies in, Python.
Once you have the basics of data analysis under your belt, another criterion for evaluating which language to further your skills in is what language your teammates are using. If you’re all literally speaking the same language, it’ll make collaboration—as well as learning from each other—much easier.
Jobs calling for skill in Python compared to R have increased similarly over the last few years.
That said, as you can see, Python has started to overtake R in data jobs. Thanks to the expansion of the Python ecosystem, tools for nearly every aspect of computing are readily available in the language. In addition, since Python can be used to develop web applications, it enables companies to employ crossover between Python developers and data science teams. That’s a major boon given the shortage of data experts in the current marketplace.
The Bottom Line
In general, you can’t err whether you choose to learn Python first or R first for data analysis. Each language has its pros and cons for different scenarios and tasks. In addition, there are actually libraries to use Python with R, and vice versa—so learning one won’t preclude you from being able to learn and use the other. Perhaps the best solution is to use the above guidelines to decide which of the two languages to begin with, then fortify your skill set by learning the other one.
The German language is so perfectly suited for these syndromes, coming down with them in any other language just won’t do.
At some point in the last couple of decades, parents in Germany started coming down with Kevinismus— a strange propensity to give their kids wholly un-German, American-sounding names like Justin, Mandy, Dennis, Cindy, and Kevin. Kids with these names tend to be less successful and have more behavior problems in school. Studies of the Kevinismus phenomenon attribute these effects to a combination of teachers’ prejudices toward the names, and the lower social status of parents who choose names like Kevin.
Föhn is the name for a specific wind that cools air as it draws up one side of a mountain, and then warms it as it compresses coming down the other side. These winds are believed to cause headaches and other feelings of illness. Many a 19th century German lady took to her fainting couch with a cold compress, suffering from Föhnkrankheit.
Kreislaufzusammenbruch, or “circulatory collapse,” sounds deathly serious, but it’s used quite commonly in Germany to mean something like “feeling woozy” or “I don’t think I can come into work today.”
Hörsturz refers to a sudden loss of hearing, which in Germany is apparently frequently caused by stress. Strangely, while every German knows at least 5 people who have had a bout of Hörsturz, it is practically unheard of anywhere else.
Frühjahrsmüdigkeit or “early year tiredness” can be translated as “spring fatigue.” Is it from the change in the weather? Changing sunlight patterns? Hormone imbalance? Allergies? As afflictions go, Frühjahrsmüdigkeit is much less fun than our “spring fever,” which is instead associated with increased vim, vigor, pep, and randiness.
Fernweh is the opposite of homesickness. It is the longing for travel, or getting out there beyond the horizon, what you might call… awaysickness.
Putzen means “to clean” and Fimmel is a mania or obsession. Putzfimmel is an obsession with cleaning. It is not unheard of outside of Germany, but elsewhere it is less culturally embedded and less fun to say.
An old-fashioned type of miserable lovesickness that was named “Werther’s fever” for the hero of Goethe’s The Sorrows of Young Werther. Poor young Werther suffers for the love of a peasant girl who is already married. Death is his only way out. A generation of sensitive young men brought made Werthersfieber quite fashionable in the late 18th century.
Ostalgie is nostalgia for the old way of life in East Germany (“ost” means East). If you miss your old Trabant and those weekly visits from the secret police, you may have Ostalgie.
Zeitkrankheit is “time sickness” or “illness of the times.” It’s a general term for whatever the damaging mindset or preoccupations of a certain era are.
Weltschmerz or “world pain,” is a sadness brought on by a realization that the world cannot be the way you wish it would be. It’s more emotional than pessimism, and more painful than ennui.
Ichschmerz is like Weltschmerz, but it is dissatisfaction with the self rather than the world. Which is probably what Weltschmerz really boils down to most of the time.
Lebensmüdigkeit translates as despair or world-weariness, but it also more literally means “life tiredness.” When someone does something stupidly dangerous, you might sarcastically ask, “What are you doing? Are you lebensmüde?!”
Zivilisationskrankheit, or “civilization sickness” is a problem caused by living in the modern world. Stress, obesity, eating disorders, carpal tunnel syndrome and diseases like type 2 diabetes are all examples.
Torschlusspanik or “gate closing panic” is the anxiety-inducing awareness that as time goes on, life’s opportunities just keep getting fewer and fewer and there’s no way to know which ones you should be taking before they close forever. It’s a Zivilisationskrankheit that may result in Weltschmerz, Ichschmerz, or Lebensmüdigkeit.
Slowly but steadily, data are forcing their way into every nook and cranny of every industry, company, and job. Managers who aren’t data savvy, who can’t conduct basic analyses, interpret more complex ones, and interact with data scientists are already at a disadvantage. Companies without a large and growing cadre of data-savvy managers are similarly disadvantaged.
Fortunately, you don’t have to be a data scientist or a Bayesian statistician to tease useful insights from data. This post explores an exercise I’ve used for 20 years to help those with an open mind (and a pencil, paper, and calculator) get started. One post won’t make you data savvy, but it will help you become data literate, open your eyes to the millions of small data opportunities, and enable you work a bit more effectively with data scientists, analytics, and all things quantitative.
While the exercise is very much a how-to, each step also illustrates an important concept in analytics — from understanding variation to visualization.
First, start with something that interests, even bothers, you at work, like consistently late-starting meetings. Whatever it is, form it up as a question and write it down: “Meetings always seem to start late. Is that really true?”
Next, think through the data that can help answer your question, and develop a plan for creating them. Write down all the relevant definitions and your protocol for collecting the data. For this particular example, you have to define when the meeting actually begins. Is it the time someone says, “Ok, let’s begin.”? Or the time the real business of the meeting starts? Does kibitzing count?
Now collect the data. It is critical that you trust the data. And, as you go, you’re almost certain to find gaps in data collection. You may find that even though a meeting has started, it starts anew when a more senior person joins in. Modify your definition and protocol as you go along.
Sooner than you think, you’ll be ready to start drawing some pictures. Good pictures make it easier for you to both understand the data and communicate main points to others. There are plenty of good tools to help, but I like to draw my first picture by hand. My go-to plot is a time-series plot, where the horizontal axis has the date and time and the vertical axis has the variable of interest. Thus, a point on the graph below (click for a larger image) is the date and time of a meeting versus the number of minutes late.
Now return to the question that you started with and develop summary statistics. Have you discovered an answer? In this case, “Over a two-week period, 10% of the meetings I attended started on time. And on average, they started 12 minutes late.”
But don’t stop there. Answer the “so what?” question. In this case, “If those two weeks are typical, I waste an hour a day. And that costs the company $X/year.”
Many analyses end because there is no “so what?” Certainly if 80% of meetings start within a few minutes of their scheduled start times, the answer to the original question is, “No, meetings start pretty much on time,” and there is no need to go further.
But this case demands more, as some analyses do. Get a feel for variation. Understanding variation leads to a better feel for the overall problem, deeper insights, and novel ideas for improvement. Note on the picture that 8-20 minutes late is typical. A few meetings start right on time, others nearly a full 30 minutes late. It might be better if one could judge, “I can get to meetings 10 minutes late, just in time for them to start,” but the variation is too great.
Now ask, “What else does the data reveal?” It strikes me that five meetings began exactly on time, while every other meeting began at least seven minutes late. In this case, bringing meeting notes to bear reveals that all five meetings were called by the Vice President of Finance. Evidently, she starts all her meetings on time.
So where do you go from here? Are there important next steps? This example illustrates a common dichotomy. On a personal level, results pass both the “interesting” and “important” test. Most of us would give anything to get back an hour a day. And you may not be able to make all meetings start on time, but if the VP can, you can certainly start the meetings you control promptly.
On the company level, results so far only pass the interesting test. You don’t know whether your results are typical, nor whether others can be as hard-nosed as the VP when it comes to starting meetings. But a deeper look is surely in order: Are your results consistent with others’ experiences in the company? Are some days worse than others? Which starts later: conference calls or face-to-face meetings? Is there a relationship between meeting start time and most senior attendee? Return to step one, pose the next group of questions, and repeat the process. Keep the focus narrow — two or three questions at most.
I hope you’ll have fun with this exercise. Many find a primal joy in data. Hooked once, hooked for life. But whether you experience that primal joy or not, do not take this exercise lightly. There are fewer and fewer places for the “data illiterate” and, in my humble opinion, no more excuses.
I’d forgotten just how full the first day of an ICM is. First, you need to turn up early for the opening ceremony, so you end up sitting around for an hour and half or so before it even starts. Then there’s the ceremony itself, which lasts a couple of hours. Then in the afternoon you have talks about the four Fields Medallists and the Nevanlinna Prize winner, with virtually no breaks. Then after a massive ten minutes, the Nevanlinna Prize winner talks about his (in this case) own work, about which you have just heard, but in a bit more detail. That took us to 5:45pm. And just to round things off, Jim Simons is giving a public lecture at 8pm, which I suppose I could skip but I think I’m not going to. (The result is that most of this post will be written after it, but right at this very moment it is not yet 8pm.)
I didn’t manage to maintain my ignorance of the fourth Fields medallist, because I was sitting only a few rows behind the medallists, and when Martin Hairer turned up wearing a suit, there was no longer any room for doubt. However, there was a small element of surprise in the way that the medals were announced. Ingrid Daubechies (president of the IMU) told us that they had made short videos about each medallist, and also about the Nevanlinna Prize winner, who was Subhash Khot. So for each winner in turn, she told us that a video was about to start. An animation of a Fields medal then rotated on the large screens at the front of the hall, and when it settled down one could see the name of the next winner. The beginning of each video was drowned out by the resulting applause (and also a cheer for Bhargava and an even louder one for Mirzakhani), but they were pretty good. At the end of each video, the winner went up on stage, to more applause, and sat down. Then when the five videos were over, the medals were presented, to each winner in turn, by the president of Korea.
Here they are, getting their medals/prize. It wasn’t easy to get good photos with a cheap camera on maximum zoom, but they give some idea.
After those prizes were announced, we had the announcements of the Gauss prize and the Chern medal. The former is for mathematical work that has had a strong impact outside mathematics, and the latter is for lifetime achievement. The Gauss medal went to Stanley Osher and the Chern medal to Phillip Griffiths.
If you haven’t already seen it, the IMU page about the winners has links to very good short (but not too short) summaries of their work. I’m quite glad about that because I think it means I can get away with writing less about them myself. I also recommend this Google Plus post by John Baez about the work of Mirzakhani.
I have one remark to make about the Fields medals, which is that I think that this time round there were an unusually large number of people who could easily have got medals, including other women. (This last point is important — one should think of Mirzakhani’s medal as the new normal rather than as some freak event.) I have two words to say about them: Mikhail Gromov. To spell it out, he is an extreme, but by no means unique, example of a mathematician who did not get a Fields medal but whose reputation would be pretty much unaltered if he had. In the end it’s the theorems that count, and there have been some wonderful theorems proved by people who just missed out this year.
Other aspects of the ceremony were much as one would expect, but there was rather less time devoted to long and repetitive speeches about the host country than I have been used to at other ICMs, which was welcome.
That is not to say that interesting facts about the host country were entirely ignored. The final speech of the ceremony was given by Martin Groetschel, who told us several interesting things, one of which was the number of mathematics papers published in international journals by Koreans in 1981. He asked us to guess, so I’m giving you the opportunity to guess before reading on.
Now Korea is 11th in the world for the number of mathematical publications. Of course, one can question what this really means, but it certainly means something when you hear that the answer to the question above is 3. So in just one generation a serious mathematical tradition has been created from almost nothing.
He also told us the names of the people on various committees. Here they are, except that I couldn’t quite copy all of them down fast enough.
The Fields Medal committee consisted of Daubechies, Ambrosio, Eisenbud, Fukaya, Ghys, Dick Gross, Kirwan, Kollar, Kontsevich, Struwe, Zeitouni and Günter Ziegler.
The program committee consisted of Carlos Kenig (chair), Bolthausen, Alice Chang, de Melo, Esnault, me, Kannan, Jong Hae Keum, Le Bris, Lubotsky, Nesetril and Okounkov.
The ICM executive committee (if that’s the right phrase) for the next four years will be Shigefumi Mori (president), Helge Holden (secretary), Alicia Dickenstein (VP), Vaughan Jones (VP), Dick Gross, Hyungju Park, Christiane Rousseau, Vasudevan Srinivas, John Toland and Wendelin Werner.
He also told us about various initiatives of the IMU, one of which sounded interesting (by which I don’t mean that the others didn’t). It’s called the adopt-a-graduate-student initiative. The idea is that the IMU will support researchers in developed countries who want to provide some kind of mentorship for graduate students in less developed countries working in a similar area who might otherwise not find it easy to receive appropriate guidance. Or something like that.
Ingrid Daubechies also told us about two other initiatives connected with the developing world. One was that the winner of the Chern Medal gets to nominate a good cause to receive a large amount of money. Stupidly I seem not to have written it down, but it may have been $250,000. Anyhow, that order of magnitude. Phillip Griffiths chose the African Mathematics Millennium Science Initiative, or AMMSI. The other was that the five winners of the Breakthrough Prizes in mathematics, Donaldson, Kontsevich, Lurie, Tao and Taylor, have each given $100,000 towards a $500,000 fund for helping graduate students from the developing world. I don’t know exactly what form the help will take, but the phrase “breakout graduate fellowships” was involved.
When I get time, I’ll try to write something about the Laudationes, but right now I need to sleep. I have to confess that during Jim Simons’s talk, my jet lag caught up with me in a major way and I simply couldn’t keep awake. So I don’t really have much to say about it, except that there was an amusing Q&A session where several people asked long rambling “questions” that left Jim Simons himself amusingly nonplussed. His repeated requests for short pithy questions were ignored.
Just before I finish, I’ve remembered an amusing thing that happened during the early part of the ceremony, when some traditional dancing was taking place (or at least I assume it was traditional). At one point some men in masks appeared, who looked like this.
Just while we’re at it, here are some more dancers.
Dancers of various kinds
Anyhow, when the men in masks came on stage, there were screams of terror from Mirzakhani’s daughter, who looked about two and a half, and delightful, and she (the daughter) took a long time to be calmed down. I think my six-year-old son might have felt the same way — he had to leave a pantomime version of Hansel and Gretel, to which he had been taken as a birthday treat when he was five, almost the instant it started, and still has those tendencies.
A interrobang writ in wine? (Photo courtesy of Alasdair Gillon.)
Happy new year! Are you ready for a hair of the dog? Earlier this month, Dr Jesús Rogel-Salazar, a physicist with interests in quantum mechanics, ultra cold matter, nonlinear optics, computational physics — and punctuation, as it turns out — got in touch on Twitter to ask:
Any idea if inverted interrobangs are/were in use, or are still people using the ¡combination?/¿combination!
Dr Rogel-Salazar didn’t say so explicitly, but I understood his question to refer to the use of punctuation in Spanish, where questions and exclamations are book-ended by normal and rotated marks, like ¿this? and ¡this!
The interrobang, of course, is this mark, ‘‽’, the single-character union of ‘?’ and ‘!’ invented by Martin K. Speckter back in 1962. Since then, however, “interrobang” has also passed into (relatively) common usage to refer to the use of both marks at the end of a sentence, thus: ‘?!’ or ‘!?’.
Now there is technically an inverted interrobang intended for use in Spanish and culturally-related languages such as Catalan and Galician. (Assuming that your browser can display it, it looks like this: ‘⸘’.) As far as I know, the “gnaborretni”, as it is called, is a purely theoretical mark; while the interrobang occasionally surfaces in public (notably in an opinion of the Court of Appeals), I don’t recall ever having come across a gnaborretni. I passed Dr Rogel-Salazar’s query on to Alasdair Gillon, a friend of mine who lives and works in Spain, to see if he could shed some light on it. Here is his reply:
I have never seen the ¿combination! Not anywhere. I may have seen ¡¿this?! once or twice.
Actually, especially in social networking, the upside down marks are disappearing altogether, and people are just going with the rest of the world. You never see it in WhatsApp, SMS or Facebook messages, etc.
I have definitely never seen the inverted interrobang. In fact, I would say I’ve never seen an upright one in Spain, except perhaps for this advert for wine [top right], which caught my eye in Barcelona recently and made me think of you. What else could it be?
What else indeed?
So, have any Shady Characters readers come across the gnaborretni, in either its pure (⸘) or debased forms (¡¿)? Is Spanish losing the pleasing rotational symmetry of its questions and exclamations?
The science-associated blogosphere and Twitterverse were abuzz today with the news of a Gotcha! story published in today’s Science, the premier science publication from the American Association for Advancement of Science. Reporter John Bohannon, working for Science, fabricated a completely fictitious research paper detailing the purported “anti-cancer properties of a substance extracted from a lichen”, and submitted it under an assumed name to no less than 304 Open Access journals all over the world, over a course of 10 months. He notes:
… it should have been promptly rejected. Any reviewer with more than a high-school knowledge of chemistry and the ability to understand a basic data plot should have spotted the paper’s short-comings immediately. Its experiments are so hopelessly flawed that the results are meaningless.
Nevertheless, 157 journals, out of the 255 that provided a decision to the author’snom de guerre, accepted the paper. As Bohannon indicates:
Acceptance was the norm, not the exception. The paper was accepted by journals hosted by industry titans Sage and Elsevier (Note: Bohannon also mentions Wolters Kluwer in the report). The paper was accepted by journals published by prestigious academic institutions such as Kobe University in Japan. It was accepted by scholarly society journals. It was even accepted by journals for which the paper’s topic was utterly inappropriate, such as the Journal of Experimental & Clinical Assisted Reproduction.
This operation, termed a ‘sting’ in Bohannon’s story, ostensibly tested the weaknesses, especially poor quality control exercised, of the Peer Review system of the Open Access publishing process. Bohannon chose only those journals which adhered to the standard Open Access model, the author pays if the paper is published. When a journal accepted either the original, or a revised (superficially, retaining all the fatal flaws) version, Bohannon sent an email requesting to withdraw the paper citing a ‘serious flaw’ in the experiment which ‘invalidates the conclusion’. Bohannon notes that about 60% of the final decisions appeared to have been made with no apparent sign of any peer review; that the acceptance rate was 70% after review, only 12% of which identified any scientific flaws – and about half of them were nevertheless accepted by editorial discretion despite bad reviews.
As noted by some scientists and Open Access publishers like Hindawi whose journals rejected the submission, the poor quality control evinced by this sting is not directly attributable to the Open Access model. A scientific journal that doesn’t perform peer review or does a shoddy job of it is critically detrimental to overall ethos of scientific publishing and actively undermines the process and credibility of scientific research and the communication of the observations thereof, regardless of whether the journal is Open Access or Pay-for-Play.
And that is one of the major criticisms of this report. Wrote Michael B Eisen, UC Berkeley Professor and co-founder of the Public Library of Science (PLoS; incidentally, the premier Open Access journal PLOS One was one of the few to flag the ethical flaws in, as well as reject, the submission) in his blog today:
… it’s nuts to construe this as a problem unique to open access publishing, if for no other reason than the study didn’t do the control of submitting the same paper to subscription-based publishers […] We obviously don’t know what subscription journals would have done with this paper, but there is every reason to believe that a large number of them would also have accepted the paper […] Like OA journals, a lot of subscription-based journals have businesses based on accepting lots of papers with little regard to their importance or even validity…
I agree. This report cannot highlight any kind of comparison between Open Access and subscription-based journals. The shock-and-horror comes only if one places a priori Open Access journals on a hallowed pedestal for no good reason. For me, one aspect of the revealed deplorable picture stood out in particular – the question: Are all Open Access Journals created equal? The answer to that would seem to be an obvious ‘No’, especially given the outcome of this sting. But then it would beg the follow-up question, if this had indeed been a serious and genuine paper, would the author (in this case, Bohannon) seek out obscure OA journals for publishing it?
As I commented on Prof. Eisen’s blog, rather than criticizing the Open Access model, the most obvious solution to ameliorate this kind of situation seems to be to institute a measure of quality assessment for Open Access journals. I am not an expert in the publishing business, but surely some kind of reasonable and workable metric can be worked out in the same way Thomson Reuters did all those years ago for Pay-for-Play journals? Dr. Eva Amsen of the Faculty of 1000 (and an erstwhile blog colleague at Nature Blogs) pointed out in reply that a simple solution would be to quality control for peer review via an Open Peer Review process. She wrote:
… This same issue of Science features an interview with Vitek Tracz, about F1000Research’s open peer review system. We include all peer reviewer names and their comments with all papers, so you can see exactly who looked at a paper and what they said.
Prof. Eisen, a passionate proponent of the Open Access system and someone who has been trying for a long time to reform the scientific publishing industryfrom within, agrees that more than a “repudiation [of the Open Access model] for enabling fraud”, what this report reveals is the disturbing lesson that the Peer Review system, as currently exists, is broken. He wrote:
… the lesson people should take home from this story not that open access is bad, but that peer review is a joke. If a nakedly bogus paper is able to get through journals that actually peer reviewed it, think about how many legitimate, but deeply flawed, papers must also get through. […] there has been a lot of smoke lately about the “reproducibility” problem in biomedical science, in which people have found that a majority of published papers report facts that turn out not to be true. This all adds up to showing that peer review simply doesn’t work. […] There are deep problems with science publishing. But the way to fix this is not to curtain open access publishing. It is to fix peer review.
I couldn’t agree more. Even those who swear by peer review must acknowledge that the peer review system, as it exists now, is not a magic wand that can separate the grain from the chaff by a simple touch. I mean, look at the thriving Elsevier Journal Homeopathy, allegedly peer reviewed… Has that ever stemmed the bilge it churns out on a regular basis?
But the other question that really, really bothers me is more fundamental: As Bohannon notes, “about one-third of the journals targeted in this sting are based in India — overtly or as revealed by the location of editors and bank accounts — making it the world’s largest base for open-access publishing; and among the India-based journals in my sample, 64 accepted the fatally flawed papers and only 15 rejected it.”
Yikes! How and when did India become this haven for dubious, low quality Open-Access publishing? (For the context, see this interactive map of the sting.)
We’re frequently asked: What is the best tool to visualize data?
There is obviously no single answer to that question. It depends on the task at hand, and what you want to achieve.
Here’s an attempt to categorize these tasks and point to some of the tools we’ve found to be useful to complete them:
The right tool for the task
Simple one-off charts
The most common tool for simple charting is clearly Excel. It is possible to make near-perfect charts of most chart types using Excel – if you know what you’re doing. Many Excel defaults are sub-optimal, some of the chart types they offer are simply for show and have no practical application. 3D cone shaped “bars” anyone? And Excel makes no attempt at guiding a novice user to the best chart for what she wants to achieve. Here are three alternatives we’ve found useful:
Tableau is fast becoming the number one tool for many data visualization professionals. It’s client software (Windows only) that’s available for $999 and gives you a user-friendly way to create well crafted visualizations on top of data that can be imported from all of the most common data file formats. Common charting in Tableau is straight-forward, while some of the more advanced functionality may be less so. Then again, Tableau enables you to create pretty elaborate interactive data applications that can be published online and work on all common browser types, including tablets and mobile handsets. For the non-programmer that sees data visualization as an important part of his job, Tableau is probably the tool for you.
DataGraph is a little-known tool that deserves a lot more attention. A very different beast, DataGraph is a Mac-only application ($90 on the AppStore) originally designed to create proper charts for scientific publications, but has become a powerful tool to create a wide variety of charts for any occasion. Nothing we’ve tested comes close to DataGraph when creating crystal-clear, beautiful charts that are also done “right” as far as most of the information visualization literature is concerned. The workflow and interface may take a while to get the grips of, and some of the more advanced functionality may lie hidden even from an avid user for months of usage, but a wide range of samples, aggressive development and an active user community make DataGraph a really interesting solution for professional charting. If you are looking for a tool to create beautiful, yet easy to understand, static charts DataGraph may be your tool of choice. And if your medium is print, DataGraph outshines any other application on the market.
The best way to see samples of DataGraph’s capabilities is to download the free trial and browse the samples/templates on the application’s startup screen.
R is an open-source programing environment for statistical computing and graphics. A super powerful tool, R takes some programming skills to even get started, but is becoming a standard tool for any self-respecting “data scientist”. An interpreted, command line controlled environment, R does a lot more than graphics as it enables all sorts of crunching and statistical computing, even with enormous data sets. In fact we’d say that the graphics are indeed a little bit of a weak spot of R. Not to complain about the data presentation from the information visualization standpoint, most of the charts that R creates would not be considered refined and therefore needs polishing in other software such as Adobe Illustrator to be ready for publication. Not to be missed if working with R is the ggplot2 package that helps overcome some of the thornier of making charts and graphs for R look proper. If you can program, and need a powerful tool to do graphical analysis, R is your tool, but be prepared to spend significant time to make your outcome look good enough for publication, either in R or by exporting the graphics to another piece of software for touch-up.
The R Graphical Manual holds an enormous collection of browsable samples of graphics created using R – and the code and data used to make a lot of them.
Videos and custom high-resolution graphics
If you are creating data visualization videos or high-resolution data graphics, Processing is your tool. Processing is an open source integrated development environment (IDE) that uses a simplified version of Java as its programming language and is especially geared towards developing visual applications.
Processing is great for rapid development of custom data visualization applications that can either be run directly from the IDE, compiled into stand-alone applications or published as Java Applets for publishing on the web.
The area where we have found that Processing really shines as a data visualization tool, is in creating videos. It comes with a video class called MovieMaker that allows you to compose videos programmatically, frame-by-frame. Each frame may well require some serious crunching and take a long time to calculate before it is appended to a growing video file. The results can be quite stunning. Many of the best known data visualization videos are made using this method, including:
As can be seen from these examples Processing is obviously also great for rendering static, high-resolution bitmap visualizations.
So if data driven videos, or high-resolution graphics are your thing, and you’re not afraid of programming, we recommend Processing.
Charts for the Web
There are plenty – dozens, if not hundreds – of programming libraries that allow you to add charts to your web sites. Frankly, most of them are sh*t. Some of the more flashy ones use Flash or even Silverlight for their graphics, and there are strong reasons for not depending on browser plugins for delivering your graphics.
We believe we have tested most of the libraries out there, and there are only two we feel comfortable recommending, each has its pros and cons depending on what you are looking for:
Other libraries and solutions that may be worth checking out are the popular commercial solution amCharts, Google’s hosted Chart Tools and jQuery library Flot.
Special Requirements and Custom Visualizations
If you want full control of the look, feel and interactivity of your charts, or if you want to create a custom data visualization for the web from scratch, the out-of-the box libraries mentioned above will not suffice.
In fact – you’ll be surprised how soon you run into limitations that will force you to compromise on your design. Seemingly simple preferences such as “I don’t want drop shadows on the lines in my line chart”, or “I want to control what happens when a user clicks the X-axis” and you may already be stretching it with your chosen library. But consider yourself warned: The compromises may well be worth it. You may not have the time and resources to spend diving deeper, let alone writing yet-another-charting-tool™
However, if you are not one to compromise on your standards, or if you want to take it up a notch and follow the lead of some of the wonderful and engaging data journalism happening at the likes of the NY Times and The Guardian, you’re looking for something that a charting library is simply not designed to do.
The tool for you will probably be one of the following:
D3.js or “D3″ for short is in many ways the successor of Protovis. In fact Protovis is no longer under active development by the original team due to the fact that its primary developer – Mike Bostock – is now working on D3 instead.D3 builds on many of the concepts of Protovis. The main difference is that instead of having an intermediate representation that separates the rendering of the SVG (or HTML) from the programming interface, D3 binds the data directly to the DOM representation. If you don’t understand what that means – don’t worry, you don’t have to. But it has a couple of consequences that may or may not make D3 more attractive for your needs.The first one is that it – almost without exception – makes rendering faster and thereby animations and smooth transitions from one state to another more feasible. The second is that it will only work on browsers that support SVG so that you will be leaving Internet Explorer 7 and 8 users behind – and due to the deep DOM integration, enabling VML rendering for D3 is a far bigger task than for Protovis – and one that nobody has embarked on yet.
After thorough research of the available options, we chose Protovis as the base for building out DataMarket’s visualization capabilities with an eye on D3 as our future solution when modern browsers finally saturate the market. We see that horizon about 2 years from now.
As the parent of teenage boys who have to be dragged out of bed on school days, I had been looking forward to earlier sunrises once the winter solstice was past. But early January mornings seemed darker than ever while at the same time, the sky was clearly lighter around 5 p.m.
FIGURE 8 An analemma shows the Sun’s varying positions over a year.
It turned out that what I suspected was actually true — by Jan. 2, there were 12 more minutes of sunlight in the afternoons, but 3 fewer minutes in the morning. It also turned out that the reasons for it were complicated, as I discovered in a series of phone and e-mail conversations with Jay M. Pasachoff, a professor of astronomy at Williams College, and a former student of his, Joseph Gangestad, who received his Ph.D. in orbital mechanics from Purdue.
They pointed me to the Equation of Time, a grandly named formula relating to the fact that not all days are 24 hours, if you track noon by the position of the Sun instead of on a clock.
We’ve all seen a readout of the Equation of Time, Dr. Pasachoff said. It’s that uneven figure 8 that can be found on globes in a deserted part of the Pacific, a shape known as an analemma.
If Earth’s axis were perpendicular to its orbit instead of tilted, and if its orbit were a circle instead of an ellipse, the Sun would appear in the same spot in the sky each day and clocks and sundials would always match. Instead, they can be as much as 16 minutes apart, and that’s where things get complicated.
As Earth moves toward winter solstice, you have “different things going on at the same time,” Dr. Pasachoff said.
Earth’s tilt means that every day during the fall, the angle at which we view the Sun changes. It appears farther south and travels a shorter arc across the sky, affecting sunrise and sunset equally, and making the day shorter.
The changes in the solar time follow a different cycle. In the early 1600s, Kepler discovered that planets move faster at the part of their orbit that is closest to the sun, the perihelion. For Earth, perihelion comes a little after the winter solstice, so from November on, Earth is accelerating.
That increased speed means we reach the Sun’s maximum a little earlier each day, which pushes solar noon backward against clock time. That shift is amplified because the Sun is traveling a little south each day, while clocks only count its east to west traverse.
Add it all together and you get sunrise and sunset times that are not symmetrical. In the weeks before the winter solstice, sunrise is being pushed later by both the changing angle of the Sun and the slowing of solar time. But sunset is being pushed in both directions — earlier by the Sun’s angle and later by the change in solar time.
The result is more darkness in the morning and less in the afternoon. That’s why the earliest sunset of 2012, at 4:29 p.m., in New York fell as soon as Nov. 30, according to theNational Oceanic and Atmospheric Administration’s solar calculator, while mornings continued to stay dark later. After the solstice, Earth continued its acceleration until reaching perihelion on Jan. 2. So the sunrise continued to slide, reaching its latest point, 7:20 a.m., on Dec. 28. There it stood until Jan. 11, when we finally got another minute of morning light. By Feb. 7, sunrise will be all the way back to 7 a.m.
“It’s hard to wrap the mind around this problem, which is really a figment of our timekeeping system,” Dr. Gangestad said. That is, we would never notice it if we all just used sundials.