Ben Torkington

NZ Covid Locations bot post mortem

My wrap-up post on what I learned when making a Covid-19 Location of Interest bot on Twitter

On Tuesday 17th August 2021, the Delta variant of Covid 19 was discovered in Aotearoa New Zealand, and the country moved once again into its most stringent lockdown protocol, Alert Level 4. By the time it was discovered, the virus had been spreading in the community for almost two weeks, and as more and more positive cases were identified, so too were locations they had visited while potentially infectious.

All eyes were on the Ministry of Health's (MoH) official website where the locations were listed. As the number of locations grew, anxiety, and one frustration in particular: There was no easy to identify which locations were newly discovered, and people had to scan an ever growing list manually.

The solution #

The kind of idea that's just perfect. Delightully simple, needing nothing added or taken away. It looked like something I could whip up without much trouble, and a few hours later:

Every few minutes, the bot looked at the official MoH list of locations, isolated any new ones, and posted a tweet with the name, address and approximate exposure time.

It was instantly popular, it gained 1000 followers by noon, growing eventually to 3700. It felt good to make something that was so simple and not only useful, but quite importantly so. It would have been great if the story ended there. It didn't.

How it works #

The bot was made up of three main parts:

There's nothing particularly sophisticated going on here, this is all digital plumbing in its simplest form. It'd be an approachable project for a junior developer. I was just… there.

HTML Scraping #

Ideally, when you access data with an automated process, you want to use something called an Application Programming Interface, or API. When data is presented for human consumption, it's usually on a web page with formatted tables, styled text, branding images and colour schemes. It's what makes your favourite websites look familiar, and helps visually distinguish them from other websites.

Computers don't benefit from any of that. Font styles, nicely formatted dates, the use of commas to separate thousands in large numbers, these are all comforts for human creatures, but noisy distractions for an automated system, and so are not present when accessing data via an API.

Unfortunately, the MoH didn't have an API for Covid exposure locations, and the only way to access it was with HTML Scraping.

In short, HTML Scraping involves writing a program which understands how information is formatted on a web page, and turns it into the kind of data we'd hope to be able to get from an API. It's fallible, because designs of web pages often change, and aren't strictly specified in the first place.

For example, say there is a count of the current number of Covid infections on the web page. Let's say right now it's displaying 950 cases. What's going to happen when that number reaches a thousand. Will it be displayed as "1000", or "1,000"? Short of asking the people who make the website, we can't know. Extrapolate this out to every data point you want to aqcuire, and you've got too many questions to ask in an email.

It's not that HTML Scraping is difficult to get right (I found it very easy), it's that it's impossible know that it'll continue to be right, as the way the data is presented changes over time. The MoH did change the presentation, as they are at liberty to do. The last thing a web author needs is to be beholden to some third party who's using their content as input to an automated system.

In summary, HTML Scraping is only the right way when there is literally no other way. It's all I had at the time, so that's how I did it.

The first breakdown #

On the very first day of operation, the bot broke. Table headings identifying the separate regions of locations - at the time Auckland and Coromandel - were moved to somewhere I wasn't looking for them. Two things were possible, either the bot started printing the word undefined in place of the location, or would simply fail. My bot did the latter.

The fix was easy, but as I was already fearing at that point, temporary.

Clean Data #

Another issue is the ultimate source of the data. While I did have some help from MoH staff, I was only able to surmise the exact workflow taking place inside the organisation. It seems that the data was being entered into the website not out of any automated process, but manually, by humans under intense pressure. These people make mistakes.

The mistakes can manifest in any way. An address can have a typo, or simply be outright wrong. The timing of a potential exposure could be entered the wrong way around, and appear to finish before it had started.

As far as the bot was concerned, this was truly unavoidable. Garbage in, garbage out, as the saying goes. The bot simply took the textual information from the website, and posted it verbatim on Twitter, without trying to interpret or verify it in any way. Normally, you want to be assured of Clean Data in any automated system, but again, this is all we had.

I particularly didn't want to verify or clean any data myself, either manually or automatically. Doing so would create something called a Second Source of Truth, and in the context of Covid exposure locations, which are arguably a matter of life or death, would be irresponsible in my opinion. I am not privy to the actual correct information possessed by the MoH, and trying to correct the data myself would be second-guessing the health department in the middle of a public health crisis.

Readers of the bot were always encouraged to verify all information concerning them via the official channels. Repeated queries from Twitter users about the correctness of each piece of information could not be answered.

The lack of an obvious Primary Key #

A Primary Key is a means of uniquely identifying a thing. A system that stores names of people, for example, will inevitably end up storing two or more people with the same name. People change their names. This makes a name alone a poor choice of primary key. For this reason, databases use some other means of uniquely identifying people and things. This might be your driver license number, or NHI number. Sometimes it is simply a large randomly generated number such as 2A0090D7-2398-4B77-ABC0-68BEEBE46DF8 when it's not important for the identifier to handled by humans.

Nothing like this was available from scraping the site, so the bot simply took the name, address, and exposure date together for this purpose. The name and address alone wouldn't work, as a location often had several distinct exposures.

This turned out to be quite fallible:

The second breakage #

A number of locations had their details edited. For many this involved a postcode being added to their address. Sometimes only an errant whitespace was removed. All these changes caused the bot to not recognise the locations as having been already tweeted, and the bots followers were subject to around a hundred locations being tweeted in duplicate.

This was a consequence of me assuming the addresses of locations wouldn't be edited (even slightly), and the extremely robotic characteristic of treating "321 Queen St Auckland" and "321 Queen St, Auckland" as two completely different addresses.

I apologised for the duplicates, and despite numerous assurances from my followers I was doing a good job, felt a huge sense of professional embarassment. What I thought was beautifully simple code turned out to have sharp edges.

I realised then that I'd focussed on my strengths and not my weaknesses. I mostly work with clean data, and that's wasn't what I had. Perhaps other people are better suited to this job, which has clearly hit gaps in my expertise. I fretted. Am I doing more harm than good? Will this happen again? I stressed, and I cried.

Ethics #

I considered, but never ended up writing a small charter of the principles I intended to adhere to in operating the bot, but as it grew in popularity I felt a distinct sense of responsibility to operate with with the most care possible. It must be neutral, factual, and accessible by all.

The discipline of medicine has as its founding principle a saying: Primum non nocere, meaning: first of all, do no harm. I think software developers would do well to similarly be guided by this.

To that end, the obvious interpretation here would be to ensure that the bot must to the best extent possible not introduce any errors or omissions in the data. Essential workers were bound by law to familiarise themselves with exposure locations on a daily basis. If one were to miss a location due to carelessness on my part, this would be egregiously negligent.

While "only" 3700 people were using the bot (possibly many more, I can't count people who viewed the feed using Twitter Lists, or by manually visiting the bot's profile page), the sense of responsibility started to feel immediately awkward. People were relying on this for criticial information at a time of crisis, and the task of relaying official government information in a crisis really should sit with the government alone.

It was clear though, that the technical team within the MoH did not have the resources to take this on, and were fighting their own fires, trying to build a plane while flying it. After only a couple of days where my lunch was replaced with panicked scrambling to fix the bot once again, I also started to feel the pressure and resulting burnout. This thing had all but consumed me for two days. I continuted to maintain the bot, but with trepidation.

The third breakage #

This time, the MoH had improved their site by allowing people to change the sort order of the table to put the newest locations at the top. This again broke the bot, at the same time as solving the problem the bot set out to solve in the first place.

In retrospect, this is when I should have pulled the plug. The bot had served its purpose now, and people could get what they need from the official source.

Tweets of encouragement and anticipation of me fixing it yet again continued to flow, so once again I patched the bot up. As a consequence of the change MoH made, different regions were no longer split into separate tables, making it difficult for the bot to distinguish regions without examining the address. I imagine many of my colleages were a bit miffed at my reluctance to do this, it's relatively easy and cheap to do, but I was always adamant I didn't want to add a Second Source of Truth and clung to this harder than I should have.

By now, locations in Wellington and across the Central North Island had been added. The bot no longer distinguished them, and was objectively worse for this.

In my haste to fix it, I botched something and tweeted out another whole heap of duplicates, this time all my fault. No tears were shed, no emotions were felt, by now I was burned out and living in a nightmare.

Enter the API #

In response to the now numerous requests for a machine-readable data source, the MoH published a GitHub respository with a CSV and a GeoJSON. It wasn't an API per se, but it was 90% of one. There was much rejoicing, by now there were several projects making use of the data, and this would enable them all to completely do away with error-prone HTML Scraping. In theory.

However, there was still no Primary Key, the name field inexplicably had a date mixed in with it, and other awkward issues. Still, it meant we're no longer relying on a web page that kept changing, and many of us grabbed it and ran.

Eventually, a unique primary key was added, the name field was fixed, but by now it was clear that the data was no cleaner than that on the site. For a day, it looked like the GitHub repo was updated earlier than the website. The next day, it was the other way around. Some projects were now better served by HTML Scraping, others by the GitHub repo. Messy.

The bot finally ran unattended for several days, and I finally got back to my paying job which I was now really behind on, and starting to get pressure there.

The fourth, and final breakage #

On the evening of August 31, I noticed no new locations had been posted, at the end of a long day catching up on other work. I was exhausted by then, so I let out a sigh before going to bed, intending to look at it in the morning.

On taking a look, I noticed the GeoJSON feed I was using was missing just under 50 locations. Communicating with my new friends running other Covid location projects, it turned out there were a different number of records depending on whether your looked at the web, the CSV, or the GeoJSON.

Data was actually missing. This can't be compensated for. There was only one thing I could do about that: STOP.

Stop running a data feed of crucial health information that was unofficial and fallible, that I was in no position to be able to fix, and which was starting to have a great cost on my happiness and wages. I wept again, and this time fired off the tweet before I talked myself out of it once more. I put on my trainers and went for a long, slow walk. My watch congratulated me on a 7-day workout week, which I had no recollection of.

Conclusion #

We need open data. This is the kind of thing made possible by it.

I won't speculate on exactly why we didn't have this from the get-go for this outbreak 17 months after Covid first hit Aotearoa. I simply don't know the inner workings of the MoH, but I know the immense pressure they have been under meant that trying to implement in the midst of an outbreak was always going to be hit and miss. In terms of getting the contact tracing data out to the general public, they've done a good job.

I'm proud of what I did, how I went about it, and my reasons for stopping doing it. It genuinely helped people, and that's a good feeling I'll never forget.

I probably wouldn't do something like this again without better access to the source of the data.

I connected with a huge number of people from the media and the tech community in Aotearoa, and every interaction was uplifting and helpful.

Acknowledgements #

First to @NatDudley for the idea that kicked this off, and for their guidance on matters of usability. These are important to me, but I know from following Nat's work they'd done the mahi here, and deferring to their judgement on this was the easiest decision to make.

Harkanwal Singh, who'd already been working overnight by the time I started. He's been immersed in this whole time, manually checking and processing the data into high quality visualisations for The Spinoff. I admire his aptitude, but most of all respect his ethics. It's been great being in touch.

Ken Tsang (@jxeeno), who runs Covid19NearMe, a very impressive interactive project spanning Australia and Aotearoa.

Nat Torkington, who I know can always rely on for a reality check and objective advice.

The Aotearoa tech community, too many to name, who've given me a lot of feedback and advice, all of it gracefully and constructively. I am proud to count you as colleagues.

Makers of other bots, to whom I pass the torch. Good luck, go well. 💪

To all of you who've sent me notes of thanks and encouragement. These lovely messages have been the wind in my sails. It's truly been a pleasure to help you. 🥰

The Media! It ain't so bad when we work together, eh? 😏

Kia kaha, kia tupato ❤️

← Home