KERMIT PROTOCOL FOR JAPANESE TEXT FILE TRANSFER Christine Gianone, Frank da Cruz The Kermit Project Columbia University Dr. Hirofumi Fujii Japan National Laboratory for High Energy Physics 30 Sepember 1999 [ Updating the original draft of 28 March 1991 ] ABSTRACT Several different Kanji computer codes exist, so a unified method is needed for transferring Kanji text between computers that use different codes. Several methods are examined, and one is chosen. This proposal does not address transfer of Japanese files composed only of Roman and Katakana single-byte characters, for which well-defined mechanisms already exist. BACKGROUND The Kermit protocol transfers text files using a common intermediate representation for text. The protocol was extended in 1989-90 to allow for use of different standard character sets within Kermit packets. To transfer text written in a particular language, the sending program translates from its local character set to the standard set for that language, and the receiver translates from the standard set to its own local set. Thus, each Kermit program needs to know only its local character sets and the standard ones, and does not need to know about nonstandard sets used by other computers. MS-DOS Kermit 3.0 for the IBM PC, C-Kermit 5A for UNIX and VAX/VMS, and IBM mainframe Kermit 4.2 for MVS/TSO and VM/CMS were the first to implement character-set translation; this included most European languages written with Roman and Cyrillic alphabets (and, in some cases, also Greek, Hebrew, and Japanese Katakana). Extending the protocol to the full Japanese writing system, however, poses special problems because it requires mixture of three distinct character sets -- Roman, Katakana, and Kanji -- and because the Kanji character set is very large compared to Western alphabets or syllabaries. CHARACTER SETS USED IN JAPAN Japanese standards exist for three different character sets: 1. JIS Roman, ISO 646 Japanese Version, ISO registration number 14. A 94-character single-byte set identical to US ASCII except in positions 05/12 (ASCII backslash replaced by Yen sign) and 07/14 (ASCII tilde replaced by macron or overbar). Hereafter referred to as Roman. 2. JIS Katakana, ISO registration number 13. A 94-character single-byte set containing Katakana characters in columns 2 through 5, with columns 6 and 7 unused. JIS X 0201 specifies the combination of (1) and (2) into a single-byte 8-bit character set. 3. JIS X 0208 Multiple-Byte Character Set, ISO Registration Number 87. A two-byte character set comprising approximately 6000 Kanji characters, in which each byte is a 7-bit value, and the high-order ("8th") bit of each byte is unused. JIS X 0208 includes not only Kanji, but also Roman, Hiragana, Katakana, Greek, and Cyrillic characters. The non-Kanji JIS X 0208 characters are double width and not normally used. JIS X 0208 consists of approximately 6400 defined characters, with additional space reserved for nonstandard characters ("Gaiji"). Some Japanese computers use entirely different character sets, for example the EBCDIC Kanji that is used on IBM and similar mainframes. Most Japanese computers, however, use a combination of the three standard character sets. Different methods are used to allow characters from these different sets to coexist within a file. Shift-JIS, commonly found on PCs, uses special byte values 80-A0 (hex) and E0-FE (hex) as lead-ins for two-byte Kanji sequences, of which the second byte can have any value (the 8th bit can be 0 or 1). Bytes in the 00-7F range are Roman single-byte characters, and bytes in the A1-DF range are Katakana single-byte characters. Shift-JIS shifts each 96-byte Kanji plane left by two columns, so all control regions are filled with graphics, and also fills up the unused columns in Katakana with Kanji. Can translate to/from EUC by algorithm, no table necessary. Microsoft Code Page 932 and Hewlett Packard HP-15 are the same as Shift-JIS. Shift-JIS is also used on the Macintosh and on certain UNIX platforms such as Sony NEWS. JIS-7 embeds ISO 2022 character set designation sequences in the text to switch among double-byte Kanji and single-byte Roman/Katakana. All Kanji bytes are encoded with the 8th bit set to zero. HP-16 is the same as JIS-7. AT&T EUC (Extended Unix Code) for Japan (sometimes called JAE, for Japanese Application Environment) sets the 8th bit of each Kanji byte to 1, allowing Kanji bytes to be easily distinguished from Roman (ASCII) bytes, whose 8th bits are 0. A single-shift mechanism is used to select single-byte Katakana characters: 0XXXXXXX A Roman character (control or graphic). Stands alone. 10001111 Single Shift 3 (SS3). This means the next byte is a JIS Katakana character. The following byte also has its high-order bit set to 1. 100XXXXX A C1 control character. C1 controls other than SS3 can be used to designate Gaiji. 1YYXXXXX (YY is not 00) The first byte of a 2-byte Kanji code. The second byte also has its high-order bit set to 1. This scheme is compliant with ISO 2022 (JIS X 0202) as used in the 8-bit environment, with JIS Roman designated to G0 and invoked to GL, JIS Kanji designated to G1 and invoked in GR, JIS Katakana designated to G3 and invoked on a per-character basis with SS3. ALTERNATIVES Kermit's transfer character set for Japanese should be a national or international standard, or closely related to one. This leaves us with the following choices: 1. JIS X 0208 (Level 1) Advantages: . Simplicity. This character set contains all the characters of Roman and Katakana as well as Kanji and all characters are the same size. . Clarity. JIS X 0208 has an ISO registration number that can be used as an announcer. Disadvantages: . Noninvertability: the distinction between single-byte and double-byte Roman and Katakana characters is lost. . Transmission overhead of representing each Roman (ASCII) value in two bytes. . No computer uses JIS X 0208 by itself, so all Kermit programs will have to translate between file and transfer character sets. 2. Japanese EUC Advantages: . Matches common usage, which mixes three character sets within a file. . Half-width Roman and Katakana are not sacrificed. . No translation necessary for many computer systems, such as UNIX and VMS, which already use EUC. Disadvantages: . Variable-length characters. . High transmission overhead for 8-bit values in 7-bit environment. 3. ISO 10646 / UNICODE (Not ready in 1991, see below) 4. A combination of JIS Roman, JIS Katakana, and JIS Kanji, with ISO 2022 designators and shifts (ISO 2022-JP): Advantages: . Efficient transmission in the 7-bit environment. Disadvantages: . Complexity. Kermit program must fully implement ISO 2022. . Kermit's ISO 2022 extension has never been implemented [the idea was later dropped]. Japanese EUC was chosen as the transfer character-set for Japanese text after consultation among various constituencies in Japan. Implementations appeared in all major Kermit programs in 1991, with appropriate file character sets (JIS-7, Shift-JIS, DEC Kanji, various EBCDIC Kanjis) for each platform. The transmission penalty on 7-bit connections was addressed by Kermit's Locking Shift option, which also benefits other "right-handed" character sets, such as ISO 8859 Cyrillic, Hebrew, Greek, etc. Kermit name: JAPAN-EUC ISO Registration Numbers: 14 (Right half of JIS X 0201), 87 (JIS X 0208). Kermit Designator: I14/87/13. Unicode/ISO10646 support was added in 1999, with UCS-2 and UTF-8 allowed as both file and transfer character sets. In principal any Japanese character-set can be converted to Unicode, and Unicode can be converted to any Japanese character set (with the obvious potential for loss of non-JIS characters). REFERENCES 1. Gianone, Christine M., "A Kermit Protocol Extension for International Character Sets", Columbia University, April 1990. 2. JIS X 0208 Multiple-Byte Character Set. 3. JIS X 0201 Single-Byte Character Set. 4. ISO 2022 "... Code Extension Techniques" (also JIS X 2020). (End)