Full Sail: Power User Tips

Browsing with Character.

Which browser has the most character(s)?

 

          Which type of system can display the greater number of different characters, a text-based system or a graphical system? Which can handle the greater number of different characters, Lynx or the Netscape that you get with Sympatico? The answer to the first question is obviously "a graphical system". Surprisingly enough, the answer to the second question is "Lynx".

          Most Windows-based graphical browsers can only handle the single Windows character set they have installed as the default character set (and handle that incorrectly). Usually that is an incomplete version of the "cp1252" or "iso-8859-1-windows-3.1-latin-1" character set (missing the 's', 'S', 'z', and 'Z' with the caron accents and missing the new Euro symbol). The Unicode numeric entities for the Windows characters may or may not be recognised properly but the entity names are usually not. The numeric HTML character entity, "…" may display as an horizontal ellipsis, '…', but even the latest browsers at the MT&T/Sympatico display will not recognise the synonym entity name, "…", '…' although lynx does and displays it as '...' if that character is not available on your machine (and lynx has been configured correctly) or uses the appropriate character if you do have it available.

          With graphical browsers, if an HTML numeric character entity is specified on a web page that is not in the limited character set installed then the browser just displays '?' for the character. HTML entity names get printed in full without rendering or translation. Even those characters that are in the cp1252 character set are not properly recognised if the standard HTML entity numbers or names are used for them instead of the invalid "€" to "Ÿ" which Windows-based web-design software is prone to use.

          Lynx, on the other hand, can determine (if it is told) which characters your system can display and can substitute an approximation. If your PC does not have the appropriate character, Lynx can:

  1. drop an accent and display a letter unaccented -- á or 'á' ('á' or 'á'), small a with an acute accent displays as 'a';
     
  2. add a punctuation mark to indicate the accent -- ä or 'ä' ('ä' or 'ä'), small a, dieresis or umlaut mark displays as 'a:';
     
  3. substitute a character string for the single character -- √ or '√' ('√' or '√'), the radical or square root sign is displayed as ' SQRT ' (with the spaces included).

 

 

What do you mean when you write "if it is told"?

          <OLD FOGEY> In the "old days" we didn't have all of those characters. We had to make do with sixty-four characters -- upper-case letters, the digits '0' to '9', and some punctuation characters. And we considered ourselves to be lucky to have them. </OLD FOGEY>

          As bits in computers became cheaper, most computer character sets were expanded to include lower-case letters as well as upper-case. Each computer company used its own proprietary encoding scheme. IBM's EBCDIC encoding was developed to minimize the computing needed to convert punch-card data to characters and vice versa. Other companies preferred an encoding scheme that made sorting character data easier. Everyone went their own way, making exchange of data difficult until the ASCII standard was developed. Then only IBM dared to be different, still using EBCDIC. The others used the ASCII standard which specified the encoding of 128 printable and control characters using seven bits.

          Then integrated circuits and microprocessors were invented. Personal computers came on the market that used eight-bit bytes. It was convenient to use all eight bits to encode characters. Most of them defined the 95 characters from 32 to 126 in their computers to display the same printable characters as was defined by the ASCII character set. Each one, however, chose to display a completely different set of characters for the rest of the possible codes. Some displayed graphical symbols and some chose to include accented letters to increase their computer's utility to non-English users. Some, like IBM, chose to use a mixture of accented characters and characters suitable for producing charts and forms with boxes and bar charts.

          At the time Windows was being developed, an international standard for the use of the upper 128 characters out of the possible 256 had emerged. Microsoft chose to incorporate this standard character set into Windows and then include a few additional characters in the space reserved for high control codes.

          Different languages have different needs for special characters so someone (I think at IBM, but I am not sure) came up with the idea of special "code pages" for the computer. While similar in the lower half of the character set, the code pages differed in which characters they provided in the upper half of the character set.

          An international effort has also been made to catalog and define codes for all the characters needed for various languages. This standard is called "Unicode" (text-only site is here) and the Unicode standard has become the standard for the Internet. Not all languages have been included into the standard yet but more are added with each new version. The characters for writing Cherokee are a recently approved addition and proposals for other additions include Egyptian Hieroglyphics and the fictional Klingon Alphabet.

          While different computers have different character sets depending on the manufacturer or the installed code pages, almost all of the commonly-used characters have found their way into the Unicode standard. It is possible, then, to determine which characters are displayable on your computer and present you with those when they appear on a web page. Those that are unsupported by your computer can be approximated. Lynx was written with all of these in mind -- at least, all of those which have defined code pages or fixed character sets that lynx has been informed about.

          As an example, "&#12363;&#12431;&#12373;&#12365;" ("かわさき" in Japanese hiragana text) would be displayed by lynx as "kawasaki" unless lynx was told that you were using a Japanese syllabic code page but would show up as Japanese text for any computer that supported it. Most graphical browsers, however, would default to displaying "????" if they were not specially configured to display Japanese.

 

 

Some character sets your computer could have:

Code Page 437 (the standard IBM PC character set)

[A picture of the PC character set.]

Code Page 850 (the IBM PC's multilingual character set)

[A picture of the code-page 850 characters.]

The Windows character set.

[A picture of the typical Windows character set]

The Macintosh character set.

[A picture of a typical Macintosh character set]

You could be using a Cyrillic font.

[A picture of a Russian font.]
... and, yes, the font designer messed up a few characters.

 

 

 

How do I configure lynx to get the right characters?

Configuring lynx to configure lynx.

          <CONFESSION> I almost left this part out. I was so used to operating with lynx in the "advanced" mode that I forgot that not all of the configuration options were available for users operating in "novice" or "intermediate" modes. Thanks to a proofreader ("Hi, Barbara!"), the oversight was noticed before press time. </CONFESSION>

          You have to have lynx set to "ADVANCED" user mode in order to have available the options for changing your character set configuration. If you are already operating in advanced mode, skip to "Actually changing the character set." below. If you are not (or are not sure) then:

          1. Press 's' to get the Setup menu. If your User mode setting reads "User mode: [ADVANCED____]" then skip to the character set configuration.

          2. If the setting reads "[NOVICE______]" or "[INTERMEDIATE]" then you have to select the User mode: field and press ENTER to get the pop-up menu for selecting your user mode. It will look something like this (but the selections may be in a different order):

                        +--------------+
                        | NOVICE       |
                        | INTERMEDIATE |
                        | ADVANCED     |
                        +--------------+

          3. Move to the "ADVANCED" choice in the pop-up window and press ENTER to select it.

          4. Press '>' (shifted '.' on most PCs) to get to the bottom of the Setup form and select the "[Save Settings]" button. Press the ENTER key to save your setting. The next time you get into the Setup form you will have more options available including the ones changed below.

          Note: If you miss the menu and prompts at the bottom of the screen, you can always switch back to novice or intermediate mode once the character set changes below have been made. Just repeat the steps above, selecting the appropriate mode instead of "ADVANCED".

 

 

Actually changing the character set.

          1. Press 's' to get the Setup menu. Part of it will look like this:

   Character set:
   [ISO Latin 1.........]

   Assume charset if unknown :
   [iso-8859-1....................]

(Your settings may vary but these are usually the default settings.)

          2. Use your down-arrow key to move to the "Character set:" field with the default [ISO Latin 1.........] setting and press your ENTER key to get a menu of the character sets supported. There are currently forty of them:

  1. ISO Latin 1
  2. ISO Latin 2
  3. Other ISO Latin
  4. WinLatin1 (cp1252)
  5. DEC Multinational
  6. Macintosh (8 bit)
  7. NeXT character set
  8. KOI8-R Cyrillic
  9. Chinese
  10. Japanese (EUC)
  11. Japanese (SJIS)
  12. Korean
  13. Taipei (Big5)
  14. Vietnamese (VISCII)
  15. 7 bit approximations
  16. Transparent
  17. IBM PC character set
  18. IBM PC codepage 850
  19. PC Latin2 CP 852
  20. DosCyrillic (cp866)
  21. DosArabic (cp864)
  22. DosGreek (cp737)
  23. DosGreek2 (cp869)
  24. DosHebrew (cp862)
  25. WinLatin2 (cp1250)
  26. WinCyrillic (cp1251)
  27. WinGreek (cp1253)
  28. WinHebrew (cp1255)
  29. WinArabic (cp1256)
  30. ISO Latin 3
  31. ISO Latin 4
  32. ISO 8859-5 Cyrillic
  33. ISO 8859-6 Arabic
  34. ISO 8859-7 Greek
  35. ISO 8859-8 Hebrew
  36. ISO 8859-9 (Latin 5)
  37. ISO 8859-10
  38. UNICODE UTF 8
  39. RFC 1345 w/o Intro
  40. RFC 1345 Mnemonic

          The most likely selections that users on the Chebucto Community Net will require are:

  • "7 bit approximations" -- use this if you have an older computer that doesn't support the full IBM PC, Windows, or Macintosh character sets. (TRS-80, Commodore 64, Apple ][e, TI-99/4A and the like comes to mind.)
     
  • "IBM PC character set" -- select this if you are connecting with a 'vanilla' IBM PC and a text-based terminal programme or if you are using a Windows-based terminal programme (such as "Terminal" or "HyperTerminal") that insists on using a font with the PC character set.
     
  • "IBM PC codepage 850" -- if you really want to see all of the ISO-8859-1 characters but don't have an ISO-8859-1 font, you may be able to configure your PC to switch to 'code page 850'. It contains all of the characters in the ISO-8859-1 character set but in a different order and also contains some of the box-drawing characters. (The box characters that have single horizontal lines and double vertical ones (or vice versa) and some of the math symbols are replaced with the characters from the ISO-8859-1 character set that would otherwise be missing.) If you then select "IBM PC codepage 850" in the lynx configuration menu, lynx will correctly translate all of the ISO-8859-1 characters to their code page 850 counterparts.
     
  • "Macintosh (8 bit)" -- select this if you are using a relatively new Macintosh with the default character set.

          The rest of the possible selections are rather exotic and I suspect that anyone using one of them will know it and select the appropriate one.

          3. You can then scroll down to the "Assume charset if unknown :" field with its default setting of "[iso-8859-1....................]". Presumably, the selection here should match your "Character set:" setting but I have found that some settings will be changed to "iso-8859-1" when the configuration is saved no matter what you select. The safest thing to select is probably that charset selection that corresponds to your "Character set:" setting and then let lynx change it if it wants to do so. The correspondence is:

Character Set:            Assume charset if unknown :

ISO Latin 1               iso-8859-1
ISO Latin 2               iso-8859-2
Other ISO Latin           x-iso-8859-other
WinLatin1 (cp1252)        iso-8859-1-windows-3.1-latin-1
DEC Multinational         dec-mcs
Macintosh (8 bit)         macintosh
NeXT character set        x-next
KOI8-R Cyrillic           koi8-r
Chinese                   euc-cn
Japanese (EUC)            euc-jp
Japanese (SJIS)           shift_jis
Korean                    euc-kr
Taipei (Big5)             big5
Vietnamese (VISCII)       viscii
7 bit approximations      us-ascii
Transparent               x-transparent
IBM PC character set      cp437
IBM PC codepage 850       cp850
PC Latin2 CP 852          cp852
DosCyrillic (cp866)       cp866
DosArabic (cp864)         cp864
DosGreek (cp737)          cp737
DosGreek2 (cp869)         cp869
DosHebrew (cp862)         cp862
WinLatin2 (cp1250)        windows-1250
WinCyrillic (cp1251)      windows-1251
WinGreek (cp1253)         windows-1253
WinHebrew (cp1255)        windows-1255
WinArabic (cp1256)        windows-1256
ISO Latin 3               iso-8859-3
ISO Latin 4               iso-8859-4
ISO 8859-5 Cyrillic       iso-8859-5
ISO 8859-6 Arabic         iso-8859-6
ISO 8859-7 Greek          iso-8859-7
ISO 8859-8 Hebrew         iso-8859-8
ISO 8859-9 (Latin 5)      iso-8859-9
ISO 8859-10               iso-8859-10
UNICODE UTF 8             unicode-1-1-utf-8
RFC 1345 w/o Intro        mnemonic+ascii+0
RFC 1345 Mnemonic         mnemonic

          4. Once you have told lynx what character set you use and what to assume then press '>' (shifted '.' on most PCs) to get to the bottom of the Setup form and select the "[Save Settings]" button. Press the ENTER key to save your settings. If you have changed your settings while viewing a web page which might be affected by the change, you may not find the change is visible yet. The page will still be displayed with the old settings. Pressing Control-R to reload the page may still use the old settings. In that case, press the double-quote character, " twice to toggle lynx's double-quote parsing away from and back to normal. Lynx will then also use your new configuration settings when the page is re-rendered.

          With your lynx configuration set, you may now wish to see how the upper characters 160 to 255 in the ISO-8859-1 character set are displayed on your computer.

          That's all for now. Tune in to our next episodes,

  • "Creating Web Pages with Character(s)" and
     
  • "You Meet the Strangest Characters on the Web".

 

You may direct comments or suggestions about this column to:

Norman L. De Forest,  af380@chebucto.ns.ca

 

Back To The Beacon Index