Working with data and approaching data-based competitions
We are getting close to the end my 11-week Data Science class at General Assembly. As in the previous time, I had a whale of a time talking to people who are genuinely interested in data, analytics, science and models. Some of the projects this time have been Kaggle competitions. This has brought some advantages as the data is readily available, but other challenges do arise. It is effectively a game of whack-a-mole, right? Some times data is masked, or hashed, there my be too much of it or limited information.
In any case, the fact that you can submit your predictions and be ranked among other competitors does raise the question as to how (and more importantly why) do you gain and extra basis point in your score? In some cases this may indeed be important, but my view here is that since “all models are wrong” the truly important thing is to ask how comfortable are you with the score obtained and whether your business or application is resilient to that kind of error (think airplane safety versus ice-cream flavour choices). This discussion reminded me of a recent episode of Talking Machines, a podcast about machine learning that I recommended to you readers some time ago.
In episode 13 of Talking Machines Katherine Gorman and Ryan Adams interviewed Claudia Perlich, the Chief Data Scientist at DStillery. Claudia has won a number of competitions. She was trying to avoid talking about the subject, and I am glad the interviewers stirred the conversation that way. Her secret to winning a large number of competitions according to her is that she “finds something wrong with the data”. She explains that she likes getting intimately familiar with the data and often she comes across something that should not be there and can thus be exploited.
She talked about a particular breast cancer data model competition where they build the most predictive model “not because we understand medicine” she explained, “but because we realised that the patient identifier, which was just a random number, was by far the most predictive feature”. The story behind that dataset is that it was compiled from different data sources, some were from screening centres, others from treatment centres. As such, she explains “the base rate, i.e. the natural percentage of the population that was affected was very different and you could back this out from the patient identifier”. If they had been explicit about this, then the modelling would have been treated differently. I particularly like the fact that she highlights that these exploits are of importance in a competition environment but not in “real applications”.
When asked about her approach to finding these exploits, she explains that she looks at the data “in the screen, like the matrix, you have these things flashing down and what works very well for me is a certain expectation or intuition of what you should be seeing”. As an example Claudia mentions that things that should not be sorted and appear sorted in the dataset may be an indication of manipulation. Another example is, features that should be numeric but you see certain values that appear over and over again for no apparent reason typically means that someone for instance replaced a missing value with the average or the median. A practical tip she offers is that if a nearest-neighbour model performs better than other algorithms is an indication of potential duplicates in the dataset!
As I was explaining to one of the guys in the course, a lot of the times it is not just having tools and models at your disposal, but experience with their use and outcomes is very important too. I was glad to hear Claudia echo those thoughts! “There is no grand theory behind it, no recommended toolset” – she says. After all, she has been quoted as saying that:
There is no clean or dirty data, just data you don’t understand
Like me, Claudia dislikes when someone else cleans data on my behalf, as that creates sometimes more issues as in many cases assumptions about the data are made prior to its usage. That is not to say that you do not need to manipulate your data, but at least you know what transformations you have applied to it and the assumptions you have made.
I highly recommend that you listen to the podcast, it is a very good and informative episode. You can do so here.