Data science continues to generate excitement and yet real-world results can often disappoint business stakeholders. How can we mitigate risk and ensure results match expectations? Working as a technical data scientist at the interface between R&D and commercial operations has given me an insight into the traps that lie in our path. I present a personal view on the most common failure modes of data science projects.
The long version with slides and explanatory text below. Slides in one pdf here.
There is some discussion at Hacker News
First, about me:
This talk is based on conversations I've had with many senior data scientists over the last few years. Many companies seems to go through a pattern of hiring a data science team only for the entire team to quit or be fired around 12 months later. Why is the failure rate so high?
Do a data audit before you begin. Check for missing data, or dirty data. For example, you might find that a database has different transactions stored in dollar and yen amounts, without indicating which was which. This actually happened.
Don't torture your data scientists by witholding access to the data and tools they need to do their job. This senior data scientist took six weeks to get permission to install python and R. She was so happy!
Well, she was until she sent me this shortly afterwards:
Now, allow me to introduce this guy:
He was a product manager at an online auction site that you may have heard of. His story was about an A/B test of a new prototype algorithm for the main product search engine. The test was successful and the new algorithm was moved to production.
Unfortunately, after much time and expense, it was realised that there was a bug in the A/B test code: the prototype had not been used. They had accidentally tested the old algorithm against itself. The results were nonsense.
This was the problem:
Oh, by the way:
Likewise - the opposite is very often true:
The neural network they developed had a very high accuracy but, strangely, it always decided to send asthma sufferers home. Weird, since asthmatics are actually at high risk of complications from pneumonia.
It turned out that asthmatics who present with pneumonia are always admitted to Intensive Care. Because of this, there were no cases of any asthmatics dying in the training data. The model concluded that asthmatics were low risk, when the opposite was actually true. The model had great accuracy but if deployed in production it would certainly have killed people.
The real data will have weird outliers, or be boring. It will be too dynamic. It will be either too predictable or not predictable enough. Use live data from the beginning or your project will end in misery and self-hatred. Just like this poor leopard, weasel thing.
 See here for more.
Please get in touch with your own data disaster. Let's get some data behind this and move Data Science forwards. All results will be fully anonymised. Use the form below or contact me via:
Add Your Data to the Survey
I'll email you with the results of the survey when it's completed.