Oops, Did I Send that to Everyone? Validation Lessons from One Bad Morning.

The Setup

This morning, I accidentally sent an invoice from a colleague to the wrong email address. Not just any email address, but to a professional association mailing list. The entire mailing list, essentially making this private document public.

So I asked for everyone’s cooperation in deleting the message immediately, and then begged the colleague’s forgiveness. There was little else I could do to correct the error.

Besides being very embarrassing and a breach of my colleague’s privacy, this also might be a great lesson on a few aspects of Computerized Systems Validation!

So without further ado, let’s squeeze the proverbial lemon…

A flattened lemon on pavement — There’s more than one way to squeeze a lemon…

Our Robot Overlords

This morning’s story involves three different software systems:

Rico the Receipt Bot: I subscribe to a receipt processing service. Conveniently, any receipts, invoices or bills that I receive by email can just be forwarded to Rico, who then queues the receipts for further processing en route to my accounting system.
Emma the Email Client: When forwarding an email, Emma (like most email clients) conveniently suggests recipients based on the first few characters being typed. Even better, it chooses the best candidates from a large contact list based on various factors such as where the match occurs, contact frequency and time since last contact. Unless I select something different, the current top match is auto-completed in the ‘To:’ fields.
Lisa the ListServer Bot: The professional association’s mailing list is run by software that conveniently makes sending messages to the whole mailing list as easy as sending the message to one address. Lisa even appends instructions to each email, and probably has some other features that we don’t see, like scanning our emails to minimize spam or viruses while ensuring our legitimate messages get through.

Everything is a Go!

On the surface it makes sense to look at these three systems separately – after all they are from different vendors, seem to be doing loosely related tasks, they even sit in different geographic locations.

We would naturally want to confirm that Lisa, Emma and Rico work correctly and as expected. Their providers have certainly tested each of them to a degree I could never duplicate as an end user, and so conceivably, having vetted each software/service before using them, I could trust (but verify) their work.

And of course a few simple tests and a review of historic interactions confirm they work as expected: Rico receives the receipts I send and consistently pulls the correct information out of them; Emma lists valid email addresses from the contact list in order of preference, auto-completing the ‘To:’ field with the selected email; and Lisa quickly and happily forwards any emails received to people on the list, whether they want them or not.

All is well, everything is working correctly, our three systems are validated – sign the forms and archive! Until…

Book-Keeping Day

It’s bookkeeping day in my office, so I was forwarding several receipts in a row to the Receipt service. I know from operational experience that Rico will let me know if I sent something that doesn’t look like a receipt, or is a duplicate. No such errors, Rico was doing its job correctly.

I thought everything was fine and carried on with my task. A few minutes later, I get a copy of the last receipt I sent in my mailbox, apparently addressed from me. I was confused, thinking it might have bounced back from Rico. It took a few minutes to realise it was from the mailing list! What did I do?

I went to the audit trail (the Sent Items folder) to see what happened – sure enough, I somehow sent the receipt to the entire mailing list. How? I’m not sure – there are several things that could have happened here, but my audit trail doesn’t give enough detail to figure out which. I could have struck the wrong key when typing in the email address; or perhaps my recent emailing history changed the order of suggestions?

Either way, Emma guarantees that a legitimate email address goes into the ‘to’ field, whether I’ve made a typo or not (a feature!). In this case the auto-completed ‘to’ address was the reply-address for a mailing list, not my intention at all. Probably the worst candidate in my address book for sheer number of recipients. But Emma was doing its job correctly.

And of course, upon receiving the email from a legitimate member of the mailing list, Lisa dutifully forwarded the email along with its attachment, to everyone on the list. Lisa was doing its job correctly too.

PEBKAC and Murphy’s law

As often happens, the error seems to have started at the interface with Humans. Otherwise, each part of the system was working correctly, as expected, to produce a highly undesirable outcome.

This brings up an important point: when complex systems made up of otherwise reliable components fail, it is unlikely that the failure was caused by one of those components operating as expected. This may seem like an obvious statement – I’m basically saying that a component proven reliable will probably be reliable!

However in practice we tend to put most of our validation effort into proving that a system works as expected under normal operation.

Since we’ve established the reliability of its components, it’s far more likely that such a system will either fail completely and obviously (e.g. due to complete hardware failure), or it will fail in some subtle ways somewhere in the boundaries between components. In the worst cases the moving parts of the system end up working together to intensify the effects of this initial, small failure.

The Boundary Law

In complex systems, almost everything interesting happens at the boundaries between components. Take the ocean, for example. All the action happens at the shore; at the surface; at the coral reef; or on the ocean floor. In comparison, the vast volume of empty ocean has little going on save a few creatures moving from one boundary to the next.

In a modern multi-component computerized system, the boundaries between components are where we have to make assumptions about the validity of inputs and outputs. They also tend to be where the humans interact with the system, for example to operate, maintain, or just observe things.

In today’s email scenario, the error happened at that user interface, and was exasperated by the convenient features of the program. The input of just two erroneous characters, subtle and undetected, created a major problem.

But Ms. Auditor, Brendan once said…

Now please don’t misunderstand! I’m not suggesting that we skip qualifying individual components and just black-box everything. Nor am I saying we shouldn’t test for normal operation under expected conditions. These are fundamental and necessary parts of any validation process.

What I am saying is that we can’t stop there. This is especially tempting when we’re purchasing components off the shelf with qualification certificates, software verification routines, and other assurances from the manufacturer.

We need to validate the combined, overall system, scoped for each intended use as installed. We need to identify risks and their consequences, and ensure coverage for failure modes, extreme and unexpected inputs that would lead to the least desirable consequences. Finally, the scope of validation needs to cover the human interaction component, including training, the UI, and any SOPs that interface with the system.

This may sound like a lot of work, and for some systems it would be. Here’s the thing though – in regulated studies, the failures we most want to avoid are ones where either we have a catastrophic loss of data, or else the system is quietly producing erroneous data. We’ve become very good at worrying about the former, since it affects the bottom line. But from a Human Health perspective the latter is arguably much more important.

Better than a Stick in the Eye

The automation around us that we take for granted can be so convenient that we often forget (or don’t realize) how many different system are acting together for each action we take. We also tend to overstate the reliability of a system when it seems to be made up of multiple, individually reliable components.

In my case, a few good ‘what if’ questions around my bookkeeping system could have made it obvious that every time I emailed a receipt out, there was a risk that sensitive information may go to the wrong recipient.

So what now? I really don’t want this mistake to happen again, so it’s worthwhile doing something about it. Obviously adding a QC step to my bookkeeping SOP would help. But that adds some burden: When I’m forwarding multiple emails in a row to a robot, might I be tempted to skip my QC here and there?

Can I turn off auto-complete on bookkeeping days? Or maybe some additional automation to ensure these emails can ONLY go to Rico (a ‘send to Rico’ button)? Or perhaps email is just a bad idea for transferring private information? Hmmm…

Now, isn’t this so much more fun than bookkeeping?