~---------- Forwarded message ----------
Date: Mon, 28 Aug 1995 22:02:17 +0800
From: David Chiou <b83050@cctwin.ee.ntu.edu.tw>
To: b83050@cctwin.ee.ntu.edu.tw
Subject: cefintro.htm#WHY
[影像] [影像] to Home Page
_________________________________________________________________
The IRIZ KanjiBase
by Christian Wittern and Urs App
_________________________________________________________________
1. Why KanjiBase?
2. What is KanjiBase?
3. How to use KanjiBase
4. Technical information about our coding approach
_________________________________________________________________
Summary
KanjiBase was developed by Christian Wittern in the framework of the
Zen KnowledgeBase project. It is a new method to furnish lacking
Chinese characters by placeholders that are both standardized and
system-independent. It uses the Taiwanese government's CNS code (so
far 48,000 characters) for supplementing extant codes such as JIS or
Big5 or future ones such as Unicode. Thus one can continue using one's
habitual word processing and database programs and assign stable,
portable codes for characters that are not present in those codes.
Due to this approach, you can continue using your habitual
environment, whether it is Japanese, Taiwanese, mainland Chinese, or
Korean Windows. The Macintosh (and later the Unix) environment will be
supported, and the KanjiBase will soon be accessible on the Internet.
In the present implementations, characters not available in your
system can be searched in the KanjiBase and pasted into your document
as a printable graphic image linked to an SGML-conform unambiguous and
portable code. Documents containing KanjiBase characters can be
printed on your ordinary printer by using, for example, MS Word for
Windows, Word for Macintosh, or Werner Lemberg's CJK TeX method which
works on several platforms.
_________________________________________________________________
Why KanjiBase?
Due to the structure of the Chinese script and the tools available
today for processing it on computers, there always are Chinese
characters that can not be input. Although they may make up less than
1 to 5 % of a classical text, they pose a serious problem. So far,
each individual and institution created its own code or placeholder
for such characters, resulting in data that cannot be exchanged and
conform to no commonly accepted standard.
Rather than defining ad hoc a private encoding for every character
missing from the code set in use, it is advisable to use standard
references wherever possible; in this way, data become exchangeable
and database maintenance possible. We carefully evaluated all
available character codes for Chinese characters and came to the
conclusion that the Taiwanese CNS code furnishes the best starting
point as it is large, well defined, and builds on the widely used
Big-5 code.
However, what was needed was not just a large character set; rather,
it was a method to use those characters in combination with whatever
system and kanji code you have installed on your machine. In other
words, a good method needs to be system-independent while not
preventing the use of those systems. Like the way accented characters
are handled on the World Wide Web, an entirely ASCII-based method of
encoding characters was sought -- but in our case, we needed thousands
and thousands of such references.
What is KanjiBase?
The foundation of KanjiBase, the method invented by Christian Wittern
to encode such an extended character set, works by inserting ASCII
placeholders where a character is missing in your system or the
national code that your are using. This can be useful for text
databases or ordinary word processing requirements. However, through
these references, one can also more easily convert texts among
different encodings (such as JIS or GB or Big5) or achieve varied
levels of unification for specific needs.
In distinction to other large code sets, the Chinese National Standard
(CNS) from which KanjiBase takes its codepoints has a very close
relationship to the Big-5 code that is widely used today. Although
other East-Asian code sets do not merge as well with KanjiBase as
Big5, the same references can also be used to represent characters not
in those other code sets (for example JIS in Japan or GB in mainland
China). KanjiBase thus is a way to extend any of these code sets, not
just Big5, and to let you continue working in the habitual OS and
application environment while having many more Chinese characters at
your disposal.
The KanjiBase encoding not only facilitates and standardizes the use
of lacking characters but can also serve as the foundation for
character code conversions of various kinds. For example, in a Big-5
to JIS conversion, many characters will be lacking in JIS. The
KanjiBase encoding strategy allows representing these lacking
characters by its placeholders which can be transformed into printable
bitmaps if needed (for example for proofreading). Another example:
When doing the same conversion, one can use the KanjiBase encoding in
order to achieve different degrees of strictness of code conversion
depending on one's needs. If one uses the characters in a scholarly
article, one may want the strictest conversion which reflects even
slight differences of the glyphs. On the other hand, when one aims at
making a concordance, a higher degree of unification may be needed to
facilitate looking up characters in the printed product. The code
conversion tool suite that is currently being developed at the IRIZ
includes a tool that demonstrates such different degrees of conversion
strictness between from JIS to Big5 and vice-versa. However, other
codes such as the mainland Chinese GB code or the Korean KSC can also
be accommodated on this basis.
How to use KanjiBase
Due to the whole logic of KanjiBase, no specialized tools or expensive
equipment is needed for using our codes in your Chinese texts. Using
our Windows implementation, or the Electronic Bodhidharma home page on
the Internet (WWW.iijnet.or.jp/IRIZ/irizhome.html), you can look up a
character and copy the code into your texts. While the Internet and
Macintosh implementations of KanjiBase are still in preparation, our
ZenBase CD1 contains the KanjiBase for Windows which delivers a tool
to select characters and insert them into a word processing document,
or to paste them to a clipboard where they are available to any
Windows application. For the Macintosh, the support is more limited at
this time; we only include a set of macros for use with Word6 that
converts codes into bitmaps for reading and printing purposes.
Implementation for Windows
As a standalone implementation for Windows would defeat the purpose of
supplementing current user environments, we have for a start built one
that interfaces with the most commonly used word processing program
today, MS Word for Windows version 6 (English, Japanese, and Chinese
versions were tested). After having installed KanjiBase for Windows on
your system, you can set the option "Paste to Word" to on, and it will
automatically paste the KanjiBase code of the needed character into
your document. The CEF2BMP macro in our Kanjitools for Winword
transforms the code into a displayable and printable bitmap. The code
itself is embedded as a hidden comment, so that even saving the
document as a text file will not obliterate the inserted KanjiBase
code.
Interfaces for other word processing applications on Windows are of
course possible, but for the time being we rather focus on Macintosh
and Internet implementations. Please contact us if you are willing to
construct such an interface yourself.
Implementation for Macintosh
At present, we have only a set of macros that work with Word6 on the
Mac. They allow converting KanjiBase codes into bitmaps for reading
and printing. The KanjiBase code is embedded as a hidden comment; thus
you can save the file as a text file and will not lose this
information. A fuller implementation on the Macintosh that also allows
searching the KanjiBase and pasting codes into documents is in
preparation.
Implementations for other platforms
Currently no other platforms are supported, but we are working on an
Internet implementation that will serve the needs of students,
teachers, and researchers.
Werner Lembergs CJK TeX
Werner Lemberg has developed a CJK TeX, a platform-independent
implementation with great potential since the TeX typesetting system
is available on most platforms. The CJK TeX package allows you to use
Chinese, Korean and Japanese text in your LaTex documents; if needed,
these languages can even be used at the same time. Mr. Lemberg also
added support for CNS via the KanjiBase code references. The most
recent version, version 2.5, is included on the ZenBase CD1; please
refer to the included documentation for the details.
Technical information
The codes used to construct the KanjiBase placeholders are constructed
as in the following example: &C3-213A;. For better understanding, a
detailed description is given here, less technically minded people can
skip this without fear of missing important information. Several
elements can be distinguished in the above example:
The first and the last characters, & and ; are the opening and closing
delimiters, they signal to the processing software and the human
reader that the characters in between are to be treated differently
from the other parts of the data stream. The following C signals to
KanjiBase aware software that the following is a CNS code (For a
code reference table of this code see ***).
What follows up to the ; is the code designating the character itself.
This code again consists of two parts: a classifier that specifies
what kind of code from which of the areas covered by KanjiBase
follows; and (after the dash) a four digit hexadecimal code. The
following is a list of the allowed classifiers and their semantics:
* 0 (example &C0-1234;): Big5 code. This will appear only in texts
that have Big5 not as their base code. This covers the same area
as CNS levels 1 and 2, but current implementations like KanjiBase
for Windows allow only Big5 codes here. The codes are valid only
in the range from A440 to C67E and C940 to F9D5, with the
exception of the codes C94A and DDFC.
* 3-7 CNS levels 3 to 7 (example &C4-423A;). The codes are in the
range from 2121 to a maximum of 7C51, this varies in different
levels. Additional levels will be added here as they become
published :
* X,Y (example &CY-1234;): These are temporary encodings of
characters that are not yet assigned a CNS code. This assignment
is temporary these characters might be included in additional
levels of CNS that are planned. However the need to add characters
that are not yet part of public codes will always be there. X
codes are reserved for use by the IRIZ, Y codes are generally
available. Hexadecimal codes are assigned sequentially beginning
at 2121.
U (see below)
Applications that support KanjiBase tags should at least be able to
process the first two types; but support for X and Y codes is strongly
recommended.
Another type of character will be encountered in documents. For
characters from other East Asian code sets that are not available in
KanjiBase (typically modern simplified characters), no private
encoding should be used but rather a reference to the corresponding
codepoint in Unicode. Such references should follow the
recommendations developed by Rick Jelliffe for SGML Open and look like
&U-4E00; for the Unicode character U+4E00.
Authors:Christian Wittern and Urs App
Last updated: 95/04/23