Manually Decoding UTF-8 From ANSI Or Similar

You know the problem, one you’ve probably seen before, where you open a document and for some reason some of the symbols have morphed into something much less intelligble. For example, you’re reading a document and suddenly you see “onâ€‘premise” and you have to decipher what was missing. Normally you can pick it up from context, but if it’s incredibly important that you transcribe the correct character, I can walk you through correcting those mistakes.

Simple Solution

The easiest way to solve this is to try and force the document to open in UTF-8 encoding which 9 times out of 10 will solve your problem. It depends on the programs you’re using and what you can install, but when you do this you should see everything returned to it’s proper character.

Manual Process

If you can’t change the encoding in the program you’re using (maybe it’s not available or IT has locked out that feature) you can usually solve it with some creative online researching. This process involves a few different steps, so let’s get started.

Determine Problematic Encoding

First we need to figure out how the document is currently being read. Given this issue crops up a lot on Windows machines, it’s safe to assume that it’s usually a problem with Windows-1252 encoding which often gets referred to as ANSI encoding. In the cases where it’s not, it’s probably ISO/IEC 8859-1 or ISO/IEC-8859-15 which are slightly different but can be solved using a similar method.

The way to determine exactly what encoding is to look at the characters you can see. Usually there will be exactly three of them (to match the three bytes in UTF-8 encoding) and you’ll need to look very closely. Using the earlier example we have â€‘ which if you look closely is an a with a circumflex above it, a Euro symbol, and an open single-quotation mark (not a standard apostrophe!). That last character is found in the ANSI encoding set but not the two others we have, so by the process of elimination we know we’re dealing with ANSI/Windows-1252.

Determine Hex Values

Now we need to determine the hex values of those characters. Since each character is one byte long, the hex value will be exactly 2 digits, each from 0-F (if you’re unfamiliar, hex goes 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F because it is hexadecimal, a base-16 counting system). You can use the tables on the wiki pages to determin, by finding the character in the Character Set section and then looking at the row heading for the first digit and the column heading for the second digit.

In this case we start with â which is in row E_ and column _2 and thus it’s hex value is E2. Next we have € which is in row 8_ and column _0 and thus is 80. Lastly ‘ is in 9_ and _1 or 91.

Determine Unicode Character.

There’s a wonderful page that lists all the unicode characters and their descriptive names that you can find here. It uses a table system and we just need to work out how our bytes translate. On this page make sure you set the “display format for UTF-8 encoding” setting to “hex”. Now we need to find the right initial character, you’ll notice in the UTF-8 column on the first page that it shows a single UTF-8 hex value, so we need to go much higher. In the “go to other block” select box, scroll down to “U+08A0… U+08FF: Arabic Extended-A”, this should reload the page and we’ll see that we’re into UTF-8 characters with three hex bytes and they start with e0. We’re looking for e2, so we probably still need to go a bit further. In the “go to other block” section change to “U+2000… U+206F: General Punctuation”. You’ll notice we now start with e2 80 in the hex section which is really close. Scroll down and you’ll see e2 80 91 is ‑, also known as NON-BREAKING HYPHEN. We’ve found the character that had issues being encoded!

If you’re concerned that there’s so many Unicode pages and how do you find the right one so quickly aside from enumerating them in order until you hit the right bytes, well you can also use some assumptions. I knew it was probably some form of punctuation given the type of sentence, and probably a type of hyphen or dash, thus I thought it would be in something to do with punctuation or diacritical marks, which narrows down where I needed to look.

Good luck and happy decoding!