The Library of Congress >> Especially for Librarians and Archivists >> Standards
MARC Standards
MARC 21 HOME >> Specifications >> Character Sets >> Part 3

MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media

CHARACTER SETS AND ENCODING OPTIONS: Part 3

Unicode Encoding Environment

December 2007

Link disclaimerCONTENTS


INTRODUCTION

Use of the Universal Coded Character Set (UCS or ISO/IEC 10646) was approved in 1998 as a second encoding for MARC 21 records. To facilitate the movement of records between MARC-8 and Unicode environments, it was recommended for an initial period that the use of Unicode be restricted to a repertoire identical in extent to the MARC-8 repertoire. In 2007, however, such a restriction is no longer appropriate. The full UCS repertoire, as currently defined at the Unicode web site, is valid for encoding MARC 21 records, subject only to the constraints described below.

The Unicode Consortium web site is the most complete and authoritative resource to supplement the information on Unicode given in this document. Correspondences between the MARC-8 and Unicode encodings can be found in Part 5: Code Tables. Some of the complexities of conversion between MARC-8 and Unicode will be discussed in Part 4: Conversions between Encoding Environments.

Note: Only one encoding scheme may be used in a MARC 21 record: MARC-8 or Unicode


CONSTRAINTS ON UNICODE REPERTOIRE

Exclusions a priori

There are many undefined code points in Unicode codespace; they are reserved for character set expansion and may not be used by any application. Unicode designates a small number of assigned code points as either non-characters or deprecated characters; none of these should be included in a MARC 21 record. Neither is use of surrogate pairs allowed for representing code points beyond FFFF(hex) because surrogates (D800 to DFFF (hex)) have meaning only in a UTF-16 context. UTF-8 allows code points beyond FFFF(hex) to be encoded directly.

MARC-21 as a matter of policy avoids the use of characters in the Private Use Area (PUA) (E000-E8FF (hex)) as detrimental to effective information exchange. While the initial mapping of EACC to Unicode assigned several ideograph variants and certain other CJK characters to the PUA, those assignments have subsequently been remapped to standard Unicode code points. The latter should be used to represent those characters in future exchanges and, wherever feasible, to replace existing instances of the PUA code points.

Cautions

The original restriction of MARC 21 Unicode character repertoire to the MARC-8 repertoire is no longer practicable because of the increased availability of Unicode-encoded data sources that are not bound by such a limitation. Through a variety of techniques, only the most common being copy-and-paste, non-MARC-8 characters can and do get introduced into MARC 21 records. Frequently these characters will escape detection when a record is created, or even when used locally, but they may impede the effectiveness of the data interchange that is the primary purpose of MARC 21. Characters such as single quotation marks and apostrophe, compressed to a single character in ASCII because of space limitations, are among the most common to be encountered accidentally. Data in European languages are likely to contain precomposed Latin characters. Users of CJK data may discover characters from the Halfwidth and Fullwidth Forms block (FF00 to FFEF (hex)).

It is infeasible to identify a particular collection of Unicode characters to be prohibited from MARC 21 records. But creators of MARC 21 records should take into account the capabilities of their likely exchange partners as they choose to expand their working repertoire. For limited distributions, agreements among exchange partners can support aggressive repertoire expansion. For general distribution, a more conservative approach is warranted. Such an approach would minimize or avoid entirely the use of certain types of characters. For example, characters in the CJK Compatibility Ideographs area and the several Presentation Forms blocks were included in the Unicode repertoire primarily to accommodate pre-existing standards. In the future fewer applications can be expected to continue supporting the old standards; so avoiding these characters is wise.

The control function codes defined for MARC 21 are listed in Part 1. In addition, there are a few other format control characters available in Unicode encoding that may be useful for controlling bidirectional display. Aside from those, introduction of new control characters into MARC 21 records should be done only with the greatest caution. In particular, code points in the 0 to 1C(hex) range should not be used.


IMPLEMENTATION

UTF-8 encoding form

Unicode specifies three encoding forms, of which only one, UTF-8 (UCS Transformation Format 8), is authorized for use in MARC 21 records. UTF-8 transforms a full 32-bit representation of Unicode code points, or the original 16-bit representation of Unicode (now known as UTF-16), into 8-bit units (octets). A Unicode character can be represented in a single octet or a sequence of two, three, or four octets, depending on its code point.

Only values from 00(hex) to 7F(hex) require a single octet. This part of the repertoire is identical in its UTF-8 encoding to ASCII. This is the reason only ASCII characters are allowed in the leader and other parts of the MARC 21 record on which the parsing of the record depends; and conversely, the reason that UTF-8 is the only Unicode encoding form currently permitted in MARC 21.

In many contexts it is unnecessary to know what the transformed code points look like; knowing the scalar values is sufficient. In other situations, such as examining a dump of a MARC 21 record, or creating certain tables of values, it is necessary to be able to interpret the transformed octets. (See the section UTF-8 Transformation Details below for more information.)

Expressing lengths

Lengths in MARC 21 records are generally expressed in octets rather than characters. This distinction is important in Unicode encoding because of the variability of character length inherent in UTF-8. The record length contained in Leader positions 0-4, and field lengths and starting positions in directory entries are counts of octets, not characters.

MARC 21 encoding marker

A Unicode-encoded MARC 21 record must have value a in Leader position 9 (Character coding scheme).

MARC field 066

Field 066 (Character Sets Present) is not used in Unicode-encoded MARC 21 records in the Unicode environment. During conversion of MARC 21 records from MARC-8 encoding to Unicode, field 066 should be deleted.

MARC subfield $6 (Linkage)

Subfield $6 (Linkage) is used in MARC 21 records to link alternate graphic representations of the same data, to identify the presence of specific scripts in a field, and to flag fields in which the display/print directionality of data is right-to-left (e.g., for Arabic script). The subfield $6 script identification code in MARC-8-encoded MARC 21 records identifies MARC-8 character sets, rather than scripts per se; hence the code is irrelevant in the Unicode environment because the character set is always UCS, which has no script identification code value. The script identification code should be dropped from subfield $6 when converting to Unicode from MARC-8 encoding. The Field Orientation Code, which flags a field as having right-to-left display directionality, should be used in Unicode-encoded MARC 21 records. When present, the Field Orientation code is separated from the subfield $6 tag linkage data by two solidus (slash) characters (002F(hex)).

Combining marks (diacritics)

Unicode requires that separately encoded diacritical marks and similar combining characters used with base letters from the Latin and other scripts be encoded following the base letter they modify. This is the opposite of the MARC-8 rule for encoding order. Further, the rules that apply to base letters with more than one combining mark differ between the encodings. In MARC-8, the rule is to encode the combining marks from top to bottom. In Unicode, if one of the marks displays below the base letter and the other above, it is preferable to encode the one below the letter first. Multiple marks in the same typographic space (e.g., above the letter) should be encoded starting with the one that appears nearest the base letter, or, when at the same height, in the order in which they appear in the writing direction of the script, reading left to right (or right to left with right-to-left scripts).

Directionality of text

Data are recorded in logical order, from the first character to the last, regardless of field orientation. In the scripts included in the MARC-8 repertoire there are no exceptions to this rule. (One known exception occurs in the Thai script where, to conform with Thai data input standards, vowel characters that display before the consonants they are associated with are recorded before the consonants, instead of after them, which would be the logical way.) In bidirectional scripts such as Arabic and Hebrew, where the dominant writing direction is right to left but numbers are written left to right, the logical order rule still obtains. A full explanation of bidirectionality in Unicode encoding will be found in: The Bidirectional Algorithm; Unicode Standard Annex # 9.


UTF-8 TRANSFORMATION DETAILS

The UTF-8 transformation of a Unicode scalar value into an octet sequence is accomplished by reallocating its bits into octets that begin with bit sequences identifying the function of the octet.

Left-most bits Meaning of left-most bits for character encoding
0character composed of 1 octet
110first octet of 2-octet character
1110first octet of 3-octet character
11110first octet of 4-octet character
10octet is not the first octet of a character, it is the 2nd, 3rd, or 4th octet of a multi-octet character

The following patterns show how bits of the scalar value are allocated to UTF-8 octets.

Range(hex) Unicode scalar value UTF-8 value
0000 to 007F 00000000 0xxxxxxx 0xxxxxxx
0080 to 07FF 00000yyy yyxxxxxx 110yyyyy 10xxxxxx
0800 to FFFF zzzzyyyy yyzzzzzz 1110zzzz 10yyyyyy 10xxxxxx
10000 to 10FFFF 000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz

Note: x,y,z, and u are used show how bits get distributed among UTF-8 octets; the final two octets of a four-octet sequence, not shown here, are identical to the final two of the three-octet sequence. Observe that the first hex digit of a four-octet sequence will always be F because the initial octet begins with bits 11110. Similarly, the first hex digit of a three-octet sequence will always be E, and the first one of a two-octet sequence will be C or D. A second or subsequent octet will begin with hex 8,9,A, or B.

Example of three encodings expressed in binary and hexadecimal notation:
Character MARC-8 Unicode scalar value Unicode UTF-8
Comma 00101100 00000000 00101100 00101100
  2C (hex) 002C (hex) 2C (hex)
Latin small letter h 01101000 00000000 01101000 01101000
  68 (hex) 0068 (hex) 68 (hex)
Macron 11100101 00000011 00000100 11001100 10000100
  E5 (hex) 0304 (hex) CC84 (hex)
Hebrew letter tav 01111010 00000101 11101010 11010111 10101010
  7A (hex) 05EA (hex) D7AA (hex)
Script small l 11000001 00100001 00010011 11100010 10000100 10010011
  C1 (hex) 2113 (hex) E28493 (hex)

No example of a four-octet sequence is shown, but the preceding table gives the pattern. Characters beyond FFFF (hex) are very rarely encountered in MARC 21.


MARC 21 HOME >> Specifications >> Character Sets >> Part 3

The Library of Congress >> Especially for Librarians and Archivists >> Standards
( 12/04/2007 )
Contact Us