Troubleshooting Invalid Characters in Website Forms

I found this when I had trouble pasting a phone number from Google Voice, into Zoho Contacts.

4158826300 typed in, works
415-882-6300 typed in, works
(415) 882-6300 typed in, works
(415) 882-6300‬ pasted from Google Voice, invalid (has UTF-8 characters)
415-882-6300, pasted from Google Voice, invalid (it’s not the parentheses)
(415) 882-6300 pasted into Atom from Google Voice, copied from Atom into the form, invalid
+14158826300 Google Voice, right-click Copy Phone Number, works
(415) 882-6300 from Google Voice, Edit, select, paste into my programmer’s editor, copy to the form, works

Change the Website Form

The website form should either use UTF-8 (or perhaps UTF-16) encoding for the characters.

If you are the developer of the website, you can easily specify what encoding to use.

Then, with HTML5 you can specify what character ranges are acceptible, and how many characters are allowed. For example, for a City field, only allow the characters that make sense for the language of your website, and only allow probably 30 characters. Maybe you only allow the Latin alphabet characters (used in Europe and North America and South America, but most other languages in the world that aren’t based on Latin have “Latin spellings”).

Using JavaScript to validate form entries is more complex, and harder to debug and maintain. I suggest only using JavaScript for validation after the HTML5 validation is done.

But don’t reject a form entry for characters that are not visible to your users. That is very bad user interface.

Are There Simple Filter Tools to Show Lines with Invalid Characters?

How to Find Non-ASCII Characters in Text Files in Linux | Baeldung on Linux — https://www.baeldung.com/linux/find-non-ascii-chars is good, shows a lot of Linux commands for checking file contents.

grep --color='auto' -n "[^[:alnum:]]" junk.txt shows only (, ), and – as not alphanumeric. So, that isn’t a valid test for these stray characters.

(The Baeldung page has an option that doesn’t work on my Linux Mint: grep --color='auto' -n "[^[:ascii:]]" the “ascii” isn’t known, use alnum instead.)

grep --color='auto' -P -n "[^\x00-\x7F]" junk.txt also shows only the line(s) that has irregular characters, but doesn’t show you what those characters are.

Similarly, perl -ne 'print if /[^[:ascii:]]/' junk.txt doesn’t show what those characters are.

On my computer, pcregrep --color='auto' -n "[^[:ascii:]]" junk.txt gives “Command ‘pcregrep’ not found, but can be installed with: sudo apt install pcregrep”.

How Do You See The “Invisible” Characters?

Linux Mint (and many other Linux varieties), have a utility to show the hexadecimal dump of a file.

Atom in UTF-8 encoding doesn’t show the characters. Setting Atom to “Windows 1252” encoding, shows the invalid characters. On the bottom toolbar in Atom, right of where it shows either “LF” or “CRLF” (for the type of line endings), is the character encoding, probably shows “UTF-8”.

In Atom, those stray characters display like this in Windows 1252 encoding:

(415) 882-6300â€¬ pasted from Google Voice, invalid

The invalid characters, from a site that uses UTF-8 encoding, that don’t work pasted into that form, as hexadecimal characters are: e2 80 ac

Is The Problem the Page Encoding?

Google Voice, and Zoho Mail, and Zoho Contacts, all have content-type text/html; charset=utf-8 encoding.

So, it isn’t the Zoho Contacts form’s encoding, it’s the JavaScript they use for testing for valid phone numbers.

How to Convert Text File Encoding

Bash/Linux Find non-ASCII character in a .txt file and replace it with an ASCII character – Stack Overflow — https://stackoverflow.com/questions/44663963/bash-linux-find-non-ascii-character-in-a-txt-file-and-replace-it-with-an-ascii has a good way to convert multiple-byte formats to a different encoding that is single-byte.

“I think the problem is that È in utf-8 is a multibyte character consisting of \xc3 and \x88 and sed can’t seem to deal with that for whatever reason… Another method could be to convert the whole file using iconv. iso-8859-15 (latin-9) is one example of single-byte character encoding. The command to convert the file using iconv would be:

iconv -f utf-8 -t iso-8859-15 -o <converted-file> <input-file>

Atom Changing File Formats

Note: Opening a file in Atom in UTF-8 mode, then switching to “Windows 1252” encoding and saving it, then switching back to UTF-8, changes the non-ASCII characters.