The latest in the Government’s testing debacle is an IT glitch that has been blamed for thousands of coronavirus cases being left out of the daily briefing last week. A data blunder was responsible for almost 16,000 cases being left off official numbers between late September and early October. Apparently the glitch occurred when data was being transferred from one system to another and the Government has blamed IT and human error.
But the reputational fall out from issues like this can be significant, particularly when it’s a situation as sensitive as this. The Government could do without negative headlines that could have been avoided. Systems do not break themselves – there are several ways it could have happened, and they are important to bear in mind when dealing with new systems:
- Bugs: there was a bug in the system that had not been either identified or prioritised for a fix and the data triggered that bug. Often platforms are tested to expected capacity and then assumed to work with any volume of data – but databases and feeds slow down and break if they are not able to scale to cope with significant surges. A data volume limit may be hidden until volumes balloon and then has to be fixed once identified.
- Lack of testing: another possible reason is that an update was made to the system and there was insufficient testing of the system carried out prior to the change going live, resulting in an error which took a few days to spot i.e. this could be a break in the feed to the contact tracing part of the system which wasn’t noticed as backlogs were being worked through. Eventually they noticed the area storing new contacts to trace was not being added to – then the bug was found and corrected.
With new platforms there are always bugs. And fixing them should be prioritised based on the potential impact – a bug with a ‘Criticality One’ issue should stop a go live as it is fundamental and will cause the system to fall over. But a ‘Criticality Two’ issue might be allowed to go into the live environment, assuming there aren’t too many glitches to compromise the system. Volume testing a platform is also important, to ensure it can cope with unexpected data surges.
Detecting glitches and testing is key
Once live, systems are always subject to change, improvements are always happening, and fixes are being applied. Good practice ensures that before a fix is put into live, it is fully tested to ensure no unexpected consequences occur. If a system is being rushed through and too little time is given to testing, or if the testing environment doesn’t resemble the ‘live’ environment, then a fix could instigate a new error.
Detecting glitches and testing is key, but it is a balance between time, rigour and the cost benefits – all tricky to evaluate in an intense environment. This situation is a pressure cooker for the Government – and they can’t afford to make any more mistakes.
If you’d like to find out more about how we can help you please call me on +44 (0) 7813 900 337 or email [email protected]