Matt Peperell | Invisible characters

A few days ago I wrote about an issue which was particularly tricky to debug. Here’s another such story from a few months ago.

Context

At this client, we support the websites used by members of the public when applying for or renewing passports online, along with the associated applications, systems, and third-party providers needed for these.

Problem

A specific customer’s passport application was being rejected (due to failure of automated validation checks) and appeared in one of our error queues with the message notifying us that the phone number was missing. We’re used to that; it happens once in a while. The engineer working on the problem asked the customer service team to obtain the applicant’s phone number, and then manually updated the record by copying and pasting it from the email containing it received from the customer service team.

The engineer then, as was established procedure for amending records in this way, resubmitted the application from the error queue. It was again rejected and landed back in the error queue. This time the message notified us that the phone number was invalid.

An initial visual inspection showed nothing out of the ordinary.

Investigation

After performing basic investigation (i.e. a visual inspection of the phone number), our team’s next port of call is typically the application developers. They confirmed to us that there is a regular expression on the front end when users make applications in order to prevent mal-formed phone numbers from being introduced.

At this point, the engineer sought advice from a senior engineer - and this is where I joined the investigation.

Inspecting visually (i.e. not using the website) I again confirmed that the phone number matched the regular expression. A second set of eyes helps, plus performing these checks helps the senior engineer to get up to speed with the details of the problem.

I then clarified the origin of the phone number and realised that it never came from the website (being instead entered manually), and formed the conclusion that the website regular expression would never play a role. It was at this point I suspected Unicode – somewhere between luck and suspicion borne from experience.

To prove or disprove this hypothesis, I then used the tool hexdump (with the -C option) to view the byte-level representation of the number. My expectation was that all bytes would be in the hex range 30-39 (digits 0-9), or 2b (a plus, +, for international numbers), and 20 (a space). Instead we saw

echo +6421333333​ | hexdump -C
00000000  2b 36 34 32 31 33 33 33  33 33 33 e2 80 8b 0a     |+6421333333....|
0000000f

(number masked for privacy reasons)

So there’s a stray character. It would be a simple matter of deleting the erroneous character and re-amending the record, but having gone this far it’s both prudent and intellectually/forensically interesting to identify the nature of the malformation.

My initial interpretation was 808b, which is an invalid byte sequence in UTF-8 encoding. After a bit of circling around I then noticed I’d missed a byte out and the full value was e2808b. A quick websearch for this sequence identified this as being a zero-width space character.

Conclusion - this was indeed a funky Unicode character.

Summary of findings

The format of the phone number was mangled in a way invisible to the eye: it contained a Unicode character, specifically a zero-width space. This was brought along when the phone number was copy&pasted from the original email from the customer service team. To be clear: this is no fault of the original engineer, nor of the customer service team who sent us the document in this format. It’s an artifact of how office-based text editors such as Word sometimes present text, and end users should not be expected to know about this behaviour.

Recommended actions

I forget the reason why there was a missing phone number despite the website having validation. Perhaps the validation had only recently been added. That detail isn’t important to the story but I felt I should cover it here in case anyone wonders. But with that validation in place, no further application updates are needed because the issue should not occur again.

Manual updates to records are rare; we see approximately 2 a month. We already perform input validation in the form of a regular expression when we get the phone number directly from the user via the website. The receiving systems to which we send applications (e.g. for identity checking or for referee validation) also have error detection - this being the reason that periodically requests end up in the error queue for manual investigation. The effort required to make an application change to detect these and give a more specific error is probably not counteracted with the amount of time it would save, based on this low occurence rate.

There might be some merit in teaching the application support team about some likely culprits when Unicode extended characters cause problems. (Non-breaking spaces and smart quotes also being occasionally encountered)

One could make the suggestion of typing the number by hand rather than risk introducing formatting errors such as this, but the risk of introducing a typographical, transpositional or some other error is far too great.

Addendum: But what is a zero-width space?

It’s a way to indicate that a line-break may appear at a given point in a chunk of text, but that if the column of text is sufficiently wide that the line break is not necessary.

To demonstrate the effect, in this paragraph of text I’ve writte out numbers in English but separated with ZWSPs. Resize this window by making it narrow to see it in action.

OneTwoThreeFourFiveSixSevenEightNineTenElevenTwelveThirteenFourteenFifteenSixteenSeventeenEighteenNineteenTwenty.

It doesn’t make sense to have linebreaks in a phone number, but the document that was sent to us was a text document, and the text editor used (Word, I think) doesn’t know what it means to be a phone number, so it didn’t know not to add one. ZWSPs are invisible to the naked eye, on screen, and when printed so it only matters in very subtle cases such as this – far too nuanced for something like Word to care about having intelligent heurestics to make the determination of whether to add one or not!