Thursday, February 22, 2007

Monday Rant: Text Substitution

The alphabet, in English at least, is widely agreed to consist of 26 letters. Each letter has a capital and lower-case form, so from the point of view of a typewriter or computer, there are 52 symbols to deal with. I have yet to meet a computer or typewriter that has serious difficulties with these 52 basic English-language symbols.

Other languages use accent marks to denote changes in pronunciation in letters or syllables. English-language designed computers sometimes do have difficulties with these accents, which is understandable given the design parameters. Recent computers typically have relatively easy-to-implement software modifications that allow convenient inclusion of such minor changes when working in a language other than English.

However, despite all the work that has been conducted in shifting the basic set of characters from language to language, other essential features of written languages seem to have been neglected.

Written English includes many more than the basic set of 52 characters, although for some reason many people ignore the other characters. Numbers are trivially easy, and are rarely a problem. But why did the software developers neglect the comma, the period, and especially the apostrophe?

These are not superfluous symbols, unnecessary for standard English. Apostrophes have their place in normal, every-day English, and are not restricted to some rarefied ivory-tower usage only. So why do I see so many emails, the most basic of computer-mediated communication, with inconsistent and apparently random symbols inserted for apostrophes, quotation marks, and other symbols? How many times have you seen ? or & or $ thrown into the text of a message in place of " or ' ? It's jarring, and disrupts reading, to see such inappropriate shyte. Why does the software feel the need to insert the symbol for the Euro in place of something so pedestrian as a quotation mark?

What the hell? I simply do not understand why this happens at all, let alone how amazingly frequent it is. This bizarre, stupid phenomenon is not restricted to translations between languages - English-to-English gets just as mangled as any phrase sent through machine-translation! Why? WHY?

Comic from Bob the Angry Flower


Richard Gadsden said...

Since you asked...

The problem is that there are multiple different symbols for apostrophes and quotation marks.

The standard apostrophe is ' which is U+0027 in Unicode, or 39 in ASCII. Stick to that and you will be OK.

The problem is that some applications (e.g. Microsoft Word) replace it with a right single quotation mark, which is 146 in Windows (aka the Windows 1252 codepage) but nothing else recognises 146 as being a curly quote. This means that if it passes through a UNIX system, which is pretty normal, then there's a good change that the 146 won't get recognised properly. Unicode has U+2019 for the character and that should be recognised consistently. But too many time, applications convert from Unicode to a code page, and then don't declare the codepage correctly. Windows is very inclined to declare Windows-1252 as being ISO-8859-1, which are not the same (and rsquo is one of the differences).

HTML "solves" this by using character entities - like ’ but they create at least as many problems as they solve.

Carlo said...

Wow, so I guess the problem may be non-standardization? I have no love for Microsoft, so the list of 'things I wish MS did better' goes on...

I always assume with such things that the problem has to do with some archaic system that was originally put into practice, is now no longer efficient, but stays in practice... like the American keyboard itself...

TheBrummell said...

Thanks for the info, Richard. I figured it was something like that, but I didn't know the details.

This, of course, leads right into the next question, which is "why do these different applications code COMMONLY USED characters differently?" They convert other characters, like A B C, without problem. It's the difference between a proximate cause (codepage differences) and an ultimate cause (arbitrary decision-making in creating those codepages).

Is there one single person who made the decision to ignore apostrophes and quotation marks, that we can beat savagely?