| Print |
Localization Engineering

Why do I get garbage characters, boxes or question marks?


Part 1 - Introduction to character sets, glyphs, fonts and encoding

Character corruption or "garbage" characters are no doubt far from being fun. While character corruption was mostly a concern for the developers in the past, it became a common issue that anybody can face as Internet and computers became a part of our daily life. In these series, I will try to go over the most common corruption issues, reasons behind character corruption and how to avoid them. I will try to stay as far possible as I can from the technical jargon but will provide links to other resources for the ones who are interested in checking related content.

Part 1 will focus on what a "character" is and how it ends up being displayed on the computer screen. On the following posts, I will take a deeper look in to the most common corruption issues we face on the web and desktop world.

"This is an A and next time I hit this key, I want to see an A on my screen"

Unfortunately, computers are not that simple. Besides, they can only speak in numbers and therefore will not like this statement. To overcome this issue they converted this to computer language and defined a set of binary codes to represent the alphabet. But which alphabet? English alphabet? German alphabet? Russian alphabet? All of them? This is where the term "character set" comes up...

Character set, character encoding and why we need them?

Easy enough, a character set consists of a set of characters. Latin-1 character set for example, contains all the characters that are used in Western European languages and also some others (as I said, I won't get into details, you can refer to Wikipedia for the complete coverage of Latin-1 character set). Latin-1 character set contains a total of 191 characters.

Character encoding pairs the sequence of characters in a given character set with a sequence of numbers that a computer can understand. For instance, the letter "A" is represented with the hexadecimal code of 0041 in the Latin-1 code page. 

What about the other characters though? If Latin-1 only contains 191 characters, how does the computer understand/display the rest? Chinese alone has thousands of different characters. This is why there are many character sets, like JIS for Japanese, Guobiao for Chinese and so on.

So?

Let me try to explain this with an example. Let's say English is the only language you speak. When somebody speaks German next to you, you still hear it but only as a sound and you do not relate any of the sounds to emotions, actions etc. In some cases, you cannot even tell the language person next to you is speaking unless you are familiar with the sounds. The German person still uses a, b, c and so on but the combination does not mean anything to you as your brain does not know how to interpret these sounds. 

Same applies to the computers. When I tell "Give me the 0041" to the computer (remember the A in Latin-1?), I also need to tell the computer which code page to use to interpret the 0041. As usual, there is a default and if I don't tell the computer what I want, it will use the default character set that it understands and will display the corresponding character.

Glyphs

There is one more thing that we need to touch base before we can move forward with the fun stuff. A glyph is a graphical representation of a character. In a nut shell, "A" understood by our brain is the character and glyph is what you see on the paper. Again, same applies to the computer and 0041 in Latin-1 character set is an entity called "LATIN CAPITAL LETTER A". What you see on your screen when you type an "A" is the glyph and this is where the definition "font" comes in. 0041 is always the "LATIN CAPITAL LETTER A" for the computer but depending on the font that you choose, the graphical representation on the screen will be different.

Even with this much information, I think you can tell why character corruption occurs. There are two main reasons:

- You tell 0041 to the computer but the computer interprets it using the incorrect code page and therefore 0041 does not mean a "LATIN CAPITAL LETTER A" to the computer

- 0041 and the corresponding code page are fine but the computer does not know how to graphically represent "LATIN CAPITAL LETTER A" (glyph).

This concludes our introduction to code pages, character sets, glyphs and fonts. In Part 2 we will investigate the reasons with some examples and see how we can avoid character corruption and garbage characters. Stay tuned...




 
Need to contact us fast?
skype - uniQode
msn - info@uniqode.com
Looking for a foreign voice? Check our
British voices talents, Spanish and many others...
Clients/Consultants
Go to Qlook