Troubleshooting Invalid Characters in Website Forms

I found this when I had trouble pasting a phone number from Google Voice, into Zoho Contacts.

4158826300 typed in, works
415-882-6300 typed in, works
(415) 882-6300 typed in, works
(415) 882-6300‬ pasted from Google Voice, invalid (has UTF-8 characters)
415-882-6300, pasted from Google Voice, invalid (it’s not the parentheses)
(415) 882-6300 pasted into Atom from Google Voice, copied from Atom into the form, invalid
+14158826300 Google Voice, right-click Copy Phone Number, works
(415) 882-6300 from Google Voice, Edit, select, paste into my programmer’s editor, copy to the form, works

Change the Website Form

The website form should either use UTF-8 (or perhaps UTF-16) encoding for the characters.

If you are the developer of the website, you can easily specify what encoding to use.

Then, with HTML5 you can specify what character ranges are acceptible, and how many characters are allowed. For example, for a City field, only allow the characters that make sense for the language of your website, and only allow probably 30 characters. Maybe you only allow the Latin alphabet characters (used in Europe and North America and South America, but most other languages in the world that aren’t based on Latin have “Latin spellings”).

Using JavaScript to validate form entries is more complex, and harder to debug and maintain. I suggest only using JavaScript for validation after the HTML5 validation is done.

But don’t reject a form entry for characters that are not visible to your users. That is very bad user interface.

Are There Simple Filter Tools to Show Lines with Invalid Characters?

How to Find Non-ASCII Characters in Text Files in Linux | Baeldung on Linux — https://www.baeldung.com/linux/find-non-ascii-chars is good, shows a lot of Linux commands for checking file contents.

grep --color='auto' -n "[^[:alnum:]]" junk.txt shows only (, ), and – as not alphanumeric. So, that isn’t a valid test for these stray characters.

(The Baeldung page has an option that doesn’t work on my Linux Mint: grep --color='auto' -n "[^[:ascii:]]" the “ascii” isn’t known, use alnum instead.)

grep --color='auto' -P -n "[^\x00-\x7F]" junk.txt also shows only the line(s) that has irregular characters, but doesn’t show you what those characters are.

Similarly, perl -ne 'print if /[^[:ascii:]]/' junk.txt doesn’t show what those characters are.

On my computer, pcregrep --color='auto' -n "[^[:ascii:]]" junk.txt gives “Command ‘pcregrep’ not found, but can be installed with: sudo apt install pcregrep”.

How Do You See The “Invisible” Characters?

Linux Mint (and many other Linux varieties), have a utility to show the hexadecimal dump of a file.

hd junk.txt
00000000 34 31 35 38 38 32 36 33 30 30 20 77 6f 72 6b 73 |4158826300 works|
00000010 0a 34 31 35 2d 38 38 32 2d 36 33 30 30 20 77 6f |.415-882-6300 wo|
00000020 72 6b 73 0a e2 80 aa 28 34 31 35 29 20 38 38 32 |rks….(415) 882|
00000030 2d 36 33 30 30 e2 80 ac 20 63 6f 70 69 65 64 20 |-6300… copied |
00000040 66 72 6f 6d 20 47 6f 6f 67 6c 65 20 56 6f 69 63 |from Google Voic|
00000050 65 2c 20 69 6e 76 61 6c 69 64 0a e2 80 aa 28 34 |e, invalid….(4|
00000060 31 35 29 20 38 38 32 2d 36 33 30 30 e2 80 ac 20 |15) 882-6300… |
00000070 70 61 73 74 65 64 20 68 65 72 65 20 66 72 6f 6d |pasted here from|
00000080 20 47 6f 6f 67 6c 65 20 56 6f 69 63 65 2c 20 63 | Google Voice, c|
00000090 6f 70 69 65 64 20 66 72 6f 6d 20 68 65 72 65 2c |opied from here,|
000000a0 20 69 6e 76 61 6c 69 64 0a 28 34 31 35 29 20 38 | invalid.(415) 8|

Atom in UTF-8 encoding doesn’t show the characters. Setting Atom to “Windows 1252” encoding, shows the invalid characters. On the bottom toolbar in Atom, right of where it shows either “LF” or “CRLF” (for the type of line endings), is the character encoding, probably shows “UTF-8”.

In Atom, those stray characters display like this in Windows 1252 encoding:

(415) 882-6300‬ pasted from Google Voice, invalid

The invalid characters, from a site that uses UTF-8 encoding, that don’t work pasted into that form, as hexadecimal characters are: e2 80 ac

Is The Problem the Page Encoding?

Google Voice, and Zoho Mail, and Zoho Contacts, all have content-type text/html; charset=utf-8 encoding.

So, it isn’t the Zoho Contacts form’s encoding, it’s the JavaScript they use for testing for valid phone numbers.

How to Convert Text File Encoding

Bash/Linux Find non-ASCII character in a .txt file and replace it with an ASCII character – Stack Overflow — https://stackoverflow.com/questions/44663963/bash-linux-find-non-ascii-character-in-a-txt-file-and-replace-it-with-an-ascii has a good way to convert multiple-byte formats to a different encoding that is single-byte.

“I think the problem is that È in utf-8 is a multibyte character consisting of \xc3 and \x88 and sed can’t seem to deal with that for whatever reason… Another method could be to convert the whole file using iconv. iso-8859-15 (latin-9) is one example of single-byte character encoding. The command to convert the file using iconv would be:

iconv -f utf-8 -t iso-8859-15 -o <converted-file> <input-file>

Atom Changing File Formats

Note: Opening a file in Atom in UTF-8 mode, then switching to “Windows 1252” encoding and saving it, then switching back to UTF-8, changes the non-ASCII characters.


Posted

in

,

by

Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.