Developing a high performance machine learning algorithm

Developing a high performance machine learning algorithm
Text boxes around some words, AI-generated nonsense words in fact.

[The following is my mini-chapter contribution to High Performance Python, 2nd Edition by Ian Ozsvald and Micha Gorelick (O'Reilly)]

We had overpromised to our customers and our machine learning models were making a stream of errors in production. Our banking customers depend on our software to accurately extract data from financial documents, but our claim to ‘human-level accuracy’ was looking extremely shaky. I dreaded receiving another email from a customer, complaining that our accuracy was inadequate.

Like good data scientists, we had performed error-analysis, finding that most of our errors were a result of failing to locate text accurately. ‘Text detection’ was a crucial early step in our pipeline, but the existing open-source and commercial solutions were all deeply inadequate. We needed far higher accuracy than existing methods could deliver.

I initiated a R&D project to build a new text detector based on the then cutting-edge convolutional neural network (CNN) object detection models. I assembled a great team of machine learning researchers - and a bunch of GPUs - and we set to work.

Our problem seemed simple - the aim of a text detector is to find the coordinates of all of the words in an image, represented as rectangles called ‘text boxes’. We were fortunate in that we could easily create synthetic documents with text at known coordinates - we wouldn’t need to use expensively hand-annotated data to train the models.

Unfortunately, far from being simple, the research project turned into a mess of complexity. At every turn, our technical approach seemed to be the wrong one. More than once, team-members told me that what we were trying to do was impossible. But the journey was rich with hard-won lessons, which I want to share with you here.

Eyeballing the data

First, at the beginning we didn’t spend enough time eyeballing the data. Training neural network models is a slow process and far too many times we found out after a week or more of training that the resulting model was defective. In nearly all cases it was because of minor errors in the data generation process that we could have discovered if we’d spent an hour or two just looking at the training data before starting the training phase.

For example, we eventually realised that our synthetic text generation script was removing lines of text if they extended beyond the edges of the page. Unfortunately, the script wasn’t removing the ground truth annotation at the same time - so our ground truth didn’t match the training images! Careful examination of the training data would have revealed this immediately.

Bespoke analysis tools

Which brings me to the second lesson: the importance of creating bespoke analysis tools. Our research accelerated once we had designed interactive visualisation tools in order to navigate the data sets and perform error analysis. Our scientists used the visualisation tools to quickly discover the weaknesses of each model and allowed us to quickly iterate, fixing one issue at a time. If we had started out by creating the right tools at the outset, we would have avoided wasting months of time.

For example, we devised a heatmap visualisation with the height and width of text boxes on the X and Y axes, which used the intensity of colour to represent the accuracy in each size bin. Using the heatmap, we discovered that our model - although strong on benchmarks - had a severe defect: it was poor at detecting long words. Long words are rare, so don’t affect benchmark performance much but we could hardly tell our customers that we just can’t extract long words from their documents!

We solved this problem by changing the convolutional kernel to a rectangle instead of the more standard square (https://en.wikipedia.org/wiki/Convolutional_neural_network). Since words are normally wider than they are tall, setting hyper-parameters such as kernel shape and size correctly was critical to gaining very high performance.

Another very useful tool was an interactive 2D scatter plot of all of the errors made by the model, using various dimensions on the axes (e.g. confidence score, spatial position etc). Each datapoint on the scatter plot represented a single error made by the model. By hovering over a data point, we could instantly see a crop of the text that created the error - allowing us to quickly formulate a hypothesis for what might be causing the issue.

Using the appropriate evaluation metrics

Thirdly, we learnt to be very careful about evaluation metrics. We realised quite late in the project that the academic task of object detection was very different from our real-world task of text detection and the standard metrics were unsuitable - even downright misleading.

The academic metrics were designed to check that a model can find the rough coordinates of a cat in an image, not the exact coordinates of a word in a document page. Unfortunately, data extraction from documents requires extremely precise coordinates. Missing a digit from a financial document because your text box was in slightly the wrong place can have serious repercussions!

We had to create an entirely new metric for our use case - a non-trivial undertaking. If you need extremely high performance, then you must optimise for a metric that matches your real-world use-case very closely.

Classical scientific methods: hypothesis driven research

But the most important lesson was this: we succeeded because we used hypothesis driven research methods at every step. For every hurdle we encountered, we formulated a list of hypotheses and devised experiments to test each hypothesis in turn. Structuring the research in this way allowed us to make consistent progress over the 18 months we’d allocated for the research.

It was a complex project and we had to overcome many technical challenges in order to gain the exceptional performance we were aiming at. Classical scientific methods helped us to reign in the complexity. Careful experimental design and basic research skills like good note-taking were critical. 

Our team was combined of our internal machine learning scientists and members of a university research team. So effective scientific communication was also critical - the importance of regular research meetings and oral presentations should not be overlooked.

In the end, after much misery and frustration we overcame the obstacles and produced a neural network model that performed better than anything else available. Our products started working properly and most importantly, our customers stopped contacting me with complaints about poor performance. The project was a success.