ZenBase CD 中的 IRIZ 汉字库说明

~---------- Forwarded message ----------
Date: Mon, 28 Aug 1995 22:02:17 +0800
From: David Chiou <b83050@cctwin.ee.ntu.edu.tw>
To: b83050@cctwin.ee.ntu.edu.tw
Subject: cefintro.htm#WHY


   [影像] [影像] to Home Page
     _________________________________________________________________

                              The IRIZ KanjiBase

  by Christian Wittern and Urs App


     _________________________________________________________________

    1. Why KanjiBase?
    2. What is KanjiBase?
    3. How to use KanjiBase
    4. Technical information about our coding approach


     _________________________________________________________________

    Summary

   KanjiBase was developed by Christian Wittern in the framework of the
   Zen KnowledgeBase project. It is a new method to furnish lacking
   Chinese characters by placeholders that are both standardized and
   system-independent. It uses the Taiwanese government's CNS code (so
   far 48,000 characters) for supplementing extant codes such as JIS or
   Big5 or future ones such as Unicode. Thus one can continue using one's
   habitual word processing and database programs and assign stable,
   portable codes for characters that are not present in those codes.
   Due to this approach, you can continue using your habitual
   environment, whether it is Japanese, Taiwanese, mainland Chinese, or
   Korean Windows. The Macintosh (and later the Unix) environment will be
   supported, and the KanjiBase will soon be accessible on the Internet.
   In the present implementations, characters not available in your
   system can be searched in the KanjiBase and pasted into your document
   as a printable graphic image linked to an SGML-conform unambiguous and
   portable code. Documents containing KanjiBase characters can be
   printed on your ordinary printer by using, for example, MS Word for
   Windows, Word for Macintosh, or Werner Lemberg's CJK TeX method which
   works on several platforms.
     _________________________________________________________________

Why KanjiBase?



   Due to the structure of the Chinese script and the tools available
   today for processing it on computers, there always are Chinese
   characters that can not be input. Although they may make up less than
   1 to 5 % of a classical text, they pose a serious problem. So far,
   each individual and institution created its own code or placeholder
   for such characters, resulting in data that cannot be exchanged and
   conform to no commonly accepted standard.

   Rather than defining ad hoc a private encoding for every character
   missing from the code set in use, it is advisable to use standard
   references wherever possible; in this way, data become exchangeable
   and database maintenance possible. We carefully evaluated all
   available character codes for Chinese characters and came to the
   conclusion that the Taiwanese CNS code furnishes the best starting
   point as it is large, well defined, and builds on the widely used
   Big-5 code.

   However, what was needed was not just a large character set; rather,
   it was a method to use those characters in combination with whatever
   system and kanji code you have installed on your machine. In other
   words, a good method needs to be system-independent while not
   preventing the use of those systems. Like the way accented characters
   are handled on the World Wide Web, an entirely ASCII-based method of
   encoding characters was sought -- but in our case, we needed thousands
   and thousands of such references.

What is KanjiBase?

   The foundation of KanjiBase, the method invented by Christian Wittern
   to encode such an extended character set, works by inserting ASCII
   placeholders where a character is missing in your system or the
   national code that your are using. This can be useful for text
   databases or ordinary word processing requirements. However, through
   these references, one can also more easily convert texts among
   different encodings (such as JIS or GB or Big5) or achieve varied
   levels of unification for specific needs.

   In distinction to other large code sets, the Chinese National Standard
   (CNS) from which KanjiBase takes its codepoints has a very close
   relationship to the Big-5 code that is widely used today. Although
   other East-Asian code sets do not merge as well with KanjiBase as
   Big5, the same references can also be used to represent characters not
   in those other code sets (for example JIS in Japan or GB in mainland
   China). KanjiBase thus is a way to extend any of these code sets, not
   just Big5, and to let you continue working in the habitual OS and
   application environment while having many more Chinese characters at
   your disposal.

   The KanjiBase encoding not only facilitates and standardizes the use
   of lacking characters but can also serve as the foundation for
   character code conversions of various kinds. For example, in a Big-5
   to JIS conversion, many characters will be lacking in JIS. The
   KanjiBase encoding strategy allows representing these lacking
   characters by its placeholders which can be transformed into printable
   bitmaps if needed (for example for proofreading). Another example:
   When doing the same conversion, one can use the KanjiBase encoding in
   order to achieve different degrees of strictness of code conversion
   depending on one's needs. If one uses the characters in a scholarly
   article, one may want the strictest conversion which reflects even
   slight differences of the glyphs. On the other hand, when one aims at
   making a concordance, a higher degree of unification may be needed to
   facilitate looking up characters in the printed product. The code
   conversion tool suite that is currently being developed at the IRIZ
   includes a tool that demonstrates such different degrees of conversion
   strictness between from JIS to Big5 and vice-versa. However, other
   codes such as the mainland Chinese GB code or the Korean KSC can also
   be accommodated on this basis.

How to use KanjiBase



   Due to the whole logic of KanjiBase, no specialized tools or expensive
   equipment is needed for using our codes in your Chinese texts. Using
   our Windows implementation, or the Electronic Bodhidharma home page on
   the Internet (WWW.iijnet.or.jp/IRIZ/irizhome.html), you can look up a
   character and copy the code into your texts. While the Internet and
   Macintosh implementations of KanjiBase are still in preparation, our
   ZenBase CD1 contains the KanjiBase for Windows which delivers a tool
   to select characters and insert them into a word processing document,
   or to paste them to a clipboard where they are available to any
   Windows application. For the Macintosh, the support is more limited at
   this time; we only include a set of macros for use with Word6 that
   converts codes into bitmaps for reading and printing purposes.

  Implementation for Windows



   As a standalone implementation for Windows would defeat the purpose of
   supplementing current user environments, we have for a start built one
   that interfaces with the most commonly used word processing program
   today, MS Word for Windows version 6 (English, Japanese, and Chinese
   versions were tested). After having installed KanjiBase for Windows on
   your system, you can set the option "Paste to Word" to on, and it will
   automatically paste the KanjiBase code of the needed character into
   your document. The CEF2BMP macro in our Kanjitools for Winword
   transforms the code into a displayable and printable bitmap. The code
   itself is embedded as a hidden comment, so that even saving the
   document as a text file will not obliterate the inserted KanjiBase
   code.

   Interfaces for other word processing applications on Windows are of
   course possible, but for the time being we rather focus on Macintosh
   and Internet implementations. Please contact us if you are willing to
   construct such an interface yourself.

  Implementation for Macintosh

   At present, we have only a set of macros that work with Word6 on the
   Mac. They allow converting KanjiBase codes into bitmaps for reading
   and printing. The KanjiBase code is embedded as a hidden comment; thus
   you can save the file as a text file and will not lose this
   information. A fuller implementation on the Macintosh that also allows
   searching the KanjiBase and pasting codes into documents is in
   preparation.

  Implementations for other platforms



   Currently no other platforms are supported, but we are working on an
   Internet implementation that will serve the needs of students,
   teachers, and researchers.

  Werner Lembergs CJK TeX



   Werner Lemberg has developed a CJK TeX, a platform-independent
   implementation with great potential since the TeX typesetting system
   is available on most platforms. The CJK TeX package allows you to use
   Chinese, Korean and Japanese text in your LaTex documents; if needed,
   these languages can even be used at the same time. Mr. Lemberg also
   added support for CNS via the KanjiBase code references. The most
   recent version, version 2.5, is included on the ZenBase CD1; please
   refer to the included documentation for the details.

Technical information



   The codes used to construct the KanjiBase placeholders are constructed
   as in the following example: &C3-213A;. For better understanding, a
   detailed description is given here, less technically minded people can
   skip this without fear of missing important information. Several
   elements can be distinguished in the above example:
   The first and the last characters, & and ; are the opening and closing
   delimiters, they signal to the processing software and the human
   reader that the characters in between are to be treated differently
   from the other parts of the data stream. The following C signals to
   KanjiBase aware software that the following is a CNS code (For a
   code reference table of this code see ***).

   What follows up to the ; is the code designating the character itself.
   This code again consists of two parts: a classifier that specifies
   what kind of code from which of the areas covered by KanjiBase
   follows; and (after the dash) a four digit hexadecimal code. The
   following is a list of the allowed classifiers and their semantics:
     * 0 (example &C0-1234;): Big5 code. This will appear only in texts
       that have Big5 not as their base code. This covers the same area
       as CNS levels 1 and 2, but current implementations like KanjiBase
       for Windows allow only Big5 codes here. The codes are valid only
       in the range from A440 to C67E and C940 to F9D5, with the
       exception of the codes C94A and DDFC.
     * 3-7 CNS levels 3 to 7 (example &C4-423A;). The codes are in the
       range from 2121 to a maximum of 7C51, this varies in different
       levels. Additional levels will be added here as they become
       published :
     * X,Y (example &CY-1234;): These are temporary encodings of
       characters that are not yet assigned a CNS code. This assignment
       is temporary these characters might be included in additional
       levels of CNS that are planned. However the need to add characters
       that are not yet part of public codes will always be there. X
       codes are reserved for use by the IRIZ, Y codes are generally
       available. Hexadecimal codes are assigned sequentially beginning
       at 2121.
   U (see below)



   Applications that support KanjiBase tags should at least be able to
   process the first two types; but support for X and Y codes is strongly
   recommended.

   Another type of character will be encountered in documents. For
   characters from other East Asian code sets that are not available in
   KanjiBase (typically modern simplified characters), no private
   encoding should be used but rather a reference to the corresponding
   codepoint in Unicode. Such references should follow the
   recommendations developed by Rick Jelliffe for SGML Open and look like
   &U-4E00; for the Unicode character U+4E00.
    Authors:Christian Wittern and Urs App
    Last updated: 95/04/23