Korean Characters in Unicode

This text was created in the context of the GESS lecture ‘Cultural and Scientific History of East-Asia’, taught by Prof. Dr. Viktoria Eschbach-Szabo during the spring semester 2012 at ETH Zurich, where the author studies computer science. One can also find the text as a PDF file or the LaTeX source here.

Introduction

Communication around the whole world is as easy as never before. Messages of many different languages travel between all kinds of devices. Computers work with this heterogeneity and thus use standards for exchanging textual data, which provide the foundations of how symbols of various languages can be modelled in digital form. Some basic understanding of the layers between human writing systems and their representation as bits may be relevant for software developers in particular. A standard which has widely spread in the last years and meanwhile enjoys a great support by software tools is Unicode.

This article attempts to introduce some of the aspects when text is digitally processed. It first of all gives an overview of the Korean writing system. Korean has a couple of unique properties, which make it a prime example to study Unicode. The next part lays the foundations of the Unicode standard. It does so by also explaining the different layers of Unicode text, for instance how a character abstracts from glyphs. The problems of processing the Korean writing system with computers and how Unicode addresses these difficulties are described in the following part. One interesting question which arises is to what degree Korean characters should be decomposed into their parts for computing. Finally, the conclusion summarises important concepts to remember when software works with different languages.

Korean writing system

The Korean language has about 78 million speakers worldwide, most of them living on the Korean Peninsula. It can be written with two distinct scripts, both of which use Korean phonology: 한자 hanja and 한글 hangeul.1 To take a first example, the most common Korean family name is 金 in hanja or 김 in hangeul. The pronunciation is identical, namely gim. [5]

Development

The writing system hanja historically dominated in Korea for over a millennium. It is based on traditional Chinese characters, some of which are adapted in a way unique to Korean usage. Hanja is an example of a script using ideograms, that is characters symbolising ‘the idea of a thing without indicating the sounds used to say it [2]’. Solely a literary elite was able to master the difficulties implied by the ideograms, leaving the majority of Koreans illiterate until the end of the 19th century. [4]

Hangeul, on the other hand, is an alphabet. There is a manageable number of 40 letters [3], each of which corresponds to a clearly defined sound. Despite some irregularities, these letters simplify reading and writing a lot in stark contrast to hanja. King Sejong the Great together with selected scholars developed and made hangeul known on the Korean Peninsula during the 1440s. However, many of the literary elite were sceptical, since they considered hanja as the only legitimate script, probably also afraid of losing their privileged status. So, hangeul had a hard time, until it eventually spread widely in the 20th century. While still present in historical documents, hanja characters are rarely found in texts later than the 1940s. Nowadays, some newspapers might use them to disambiguate homonyms or for some abbreviations. The success of hangeul reduced South Korea‌’s illiteracy rate to less than 2 % [3], and its invention is celebrated every year during ‘Hangeul Day’, a Korean national commemorative day. [4]

Ingredients of hangeul

The individual letters of hangeul are called 자모 jamo. Often two or three, sometimes more, jamo together are placed in a block, which forms a syllable. There are patterns which specify how to stack jamo in a block. Besides the placement, the relative size of the jamo is traditionally adjusted, so that the whole block is a square of equal size as hanja. In order to build words, blocks are concatenated, usually horizontally from left to right, even though the vertical direction with columns growing to the left may be used as well. [4]

Modern hangeul blocks are composed of either two or three jamo. These jamo are divided into the three classes of initial, peak and final characters according to their place inside a syllable. Table 1 gives the possibilities for each of the three jamo classes, which are limited by 19 initial consonants, 21 vowels as peak characters and 27 final consonants. In order to create a syllable, choose an initial followed by a peak and optionally a final character. Thus, the number of syllables is 19 × 21 for those with two and 19 × 21 × 27 for those with three jamo, which results in a total count of 11 172 modern hangeul blocks. [1]

Table 1: Modern block composition
Class Jamo
Initial ㄱ ㄲ ㄴ ㄷ ㄸ ㄹ ㅁ ㅂ ㅃ ㅅ ㅆ ㅇ ㅈ ㅉ ㅊ ㅋ ㅌ ㅍ ㅎ
Peak ㅏ ㅐ ㅑ ㅒ ㅓ ㅔ ㅕ ㅖ ㅗ ㅘ ㅙ ㅚ ㅛ ㅜ ㅝ ㅞ ㅟ ㅠ ㅡ ㅢ ㅣ
Final ㄱ ㄲ ㄳ ㄴ ㄵ ㄶ ㄷ ㄹ ㄺ ㄻ ㄼ ㄽ ㄾ ㄿ ㅀ ㅁ ㅂ ㅄ ㅅ ㅆ ㅇ ㅈ ㅊ ㅋ ㅌ ㅍ ㅎ

Let us look at an example. The English word ‘diary’ can be translated as 일기 ilgi into Korean. Table 2 shows how 일기 consists of the syllables 일 and 기, which in turn comprise three and two jamo respectively. All jamo are associated with a consonant or a vowel, being pronounced as approximated by the English letters. One exception is ㅇ, which is silent as an initial and sounds like the nasal ng at the end of a block. What is more, 일기 can mean ‘weather’, too. As mentioned earlier, hanja sometimes helps to distinguish homonyms. In this case, the hanja characters 日記 for ‘diary’ and 日氣 for ‘weather’ would remove the ambiguity.2

Table 2: Levels of hangeul
Word 일기 ilgi
Block il gi
Jamo - i l g i

Unicode standard

Unicode tries to collect the characters of virtually all writing systems of the world. It is an established standard and therefore simplifies the development of multilingual software, which was much harder in the past. The version 6.1 of Unicode contains 110 116 characters [1].

Overview

A database provides semantic information about each Unicode character. All characters are given a unique name and partitioned based on to their usage into general categories like letters, numbers, punctuation or separators. Another example of semantic information is the numeric values some ideograms represent, such as the hanja character 四 associated with the number 4. The Unicode database stores the number symbolised by this kind of characters, thereby allowing computers to calculate with them. There are many such properties which prove useful to handle characters automatically. They all help to work with or structure the characters somehow, which is crucial having their vast amount in mind. [1]

Moreover, Unicode specifies algorithms to process text. This includes descriptions of how to correctly break text into lines, sort words to arrange them in a dictionary, format numbers to improve their readability, search for a word using a particular equivalence relation and so on. Notice that these processes might depend on the language as well as the origin of a piece of text. The alphabetical order of hangeul, for instance, differs between South Korea and North Korea [4], which is why the language alone is not enough. [1]

Layers of text processing

So far, it was not defined what a character is. Precise definitions of relevant terms help to structure the explanations about text processing and clarify the role Unicode plays. Think of the selected terms in table 3 as layers, where the hangeul block 밥 bab, meaning ‘steamed rice’, serves as an example.

Table 3: Unicode text layers
Glyph
Abstract character
Code point U+BC25
Sequence of bits 00100101 101111002

A glyph is a written mark with visual features such as form, size and orientation. As soon as the symbolic nature rather than the concrete shape is important, the term abstract character is preferred. Unicode defines an abstract character as a ‘unit of information used for the organization, control, or representation of textual data [1, page 66]’. This means that an abstract character is visually represented by one or more glyphs, which may be written on paper or rendered on a screen. Unicode abstracts from glyphs and instead deals with the interpretation of symbols. [1]

Once the abstract characters are selected, Unicode assigns a number to them. Such a number, called a code point, is one of the first 1 114 112 non-negative integers and usually written in hexadecimal with a U+ prefix. In other words, the code points range from U+0000 to U+10FFFF. It may surprise at first that the relationship between abstract characters and code points is not a one-to-one correspondence. Unicode maps some abstract characters to multiple code points or to sequences of code points, the latter seeming natural for hangeul blocks. In fact, the abstract character 밥 in table 3 can not only be associated with the single code point U+BC25, but also with the code point sequence <U+1107, U+1161, U+11B8>. More details are given later. [1]

Code points allow to look at text as a sequence of mathematical objects. However, this is still a relatively high abstraction, for computers have limited resources to process numbers. In order to store text in a file or exchange it between computers of possibly various architectures, Unicode specifies encoding schemes. An encoding scheme defines how a sequence of code points is translated into a sequence of bits, so that the information can be reconstructed unambiguously by any other program. What needs to be standardised, among others, is the order of the octets3 belonging to the same number. The code point U+BC25 in table 3, for instance, is converted with the encoding scheme UTF-16LE, which uses the little-endian order. This is why the octet with the less significance, 2516 or binary 001001012, takes the left place with the lower address. Unicode provides a special code point U+FEFF, known as byte order mark or BOM, which indicates the encoding scheme of textual data. [1]

Equivalence of texts

A common operation is comparing two texts for some kind of equality. For this purpose, Unicode introduces the notion of canonical equivalence. Roughly speaking, two or more sequences of code points are canonically equivalent when they should be interpreted and displayed identically. An equivalence test cannot simply check whether the code points are the same, since abstract characters can be represented differently as seen before. Instead, prior to comparing the code points, the texts are first transformed into a unique representation. This process of finding one representation or normal form for each set of equivalent texts is called normalisation. A normalisation can either decompose all precomposed code points into their components or vice versa. For example, a decomposing normalisation called NFD maps the code point U+BC25 in table 3 to the canonically equivalent sequence <U+1107, U+1161, U+11B8>. Furthermore, Unicode defines compatibly equivalent texts, which is a looser form of equivalence. [1]

How Unicode models the Korean writing system

The basics are prepared to take a closer look at how Unicode handles some specialities of the Korean writing system.

Nondecomposable characters

Unicode collects hanja ideograms in a large repertoire of Han characters. This set of 75 215 code points provides ideograms of Chinese origin, which are used across East Asian scripts. Besides Korean, other languages like Japanese and Vietnamese partially borrow Chinese characters, too. Unicode attempts to unify these characters across languages as Han characters, because their appearance remains identical and many even have similar meanings. For example, the primary meaning of the Han character 湯 has evolved over the centuries from ‘hot water’ to ‘soup’ in Chinese, whereas in the context of Korean, this hanja still denotes ‘hot water’. Nevertheless, Unicode assigns 湯 to a single code point U+6E6F. [1]

How does Unicode model hangeul? Every hangeul block is naturally a sequence of jamo, thus, it makes sense to map jamo separately to code points. Unicode calls such a code point associated with one jamo a conjoining jamo, since multiple of them connect together to blocks. The most common conjoining jamo are found in the range U+1100–U+11FF. The word 일기 in table 2, for instance, consists of the jamo ㅇ, ㅣ, ㄹ, ㄱ and lastly ㅣ again and can accordingly be represented by the code point sequence <U+110B, U+1175, U+11AF, U+1100, U+1175>. As expected, U+1175 appears twice and consequently must be the repeated jamo ㅣ. [1]

To revise an earlier example, the code point sequence <U+1107, U+1161, U+11B8> corresponds to the hangeul syllable 밥 in table 3. Interestingly, the same jamo ㅂ is associated with distinct code points, namely U+1107 at first and U+11B8 later. The reason is that Unicode lets the code point of a jamo depend on its role, that is to say, whether it is an initial, peak or final character in its block. Table 1 shows that many consonant jamo, including ㅂ, can appear both as an initial or final character, which results in different code points.

Precomposed characters

All modern hangeul syllables can be represented by a sequence of conjoining jamo. However, Unicode provides an alternative: any modern block is also assigned to merely a single code point. These are the 11 172 precomposed blocks in the range U+AC00–U+D7A3. So, the word 일기 may be represented by the code point sequence <U+C77C, U+AE30> as well. Such precomposed characters can be handled easier by software and generally use less space. Nonetheless, the former approach based on the decomposition into jamo is the only choice when writing archaic hangeul syllables, present in the Old Korean language. Instead of being precomposed by Unicode, they are built from conjoining jamo. [1]

When Korean texts are compared for equality, precomposed blocks deserve special attention, since each of them is canonically equivalent to a sequence of conjoining jamo. Therefore, the decomposing normalisation NFD converts hangeul syllables into their sequence of conjoining jamo. Unicode defines algorithms for the decomposition of precomposed blocks as well as the inverse operation of composition. In addition to the specifications in pseudocode, implementations in the Java programming language are given. An alphabetical ordering of Korean words requires comparisons, too, and normally takes hanja ideograms and hangeul syllables as units. Syllables are then sorted according to an order of their jamo. [1]

Conclusion

To sum up, Unicode is a sophisticated standard for text processing, which does much more than merely assigning code points to abstract characters. Knowledge from linguistics and computer science is combined to collect character properties in databases and define algorithms. One aspect concerns the granularity of characters. While hangeul blocks are simply assembled jamo, it may make sense to treat syllables as precomposed characters like hanja ideograms. Because Unicode offers both possibilities for many scripts, it defines notions of equivalence to abstract from these differences on a higher level. This enables users to browse the World Wide Web for a term in their language, without having to care about the underlying representation. Such standards help to make the world smaller.

Footnotes

  1. As a romanisation for this text, the ‘Revised Romanisation of Korean’ was chosen.
  2. Thanks to 노희주 for the idea and help with the example.
  3. One octet is a sequence of eight bits.

References

  1. The Unicode Consortium. The Unicode Standard, Version 6.1.0. The Unicode Consortium, Mountain View, CA, 2012. ISBN 978-1-936213-02-3. URL http://www.unicode.org/versions/Unicode6.1.0/.
  2. Oxford Dictionaries. “ideogram”. Oxford Dictionaries, 2010. URL http://oxforddictionaries.com/definition/ideogram. [Online; accessed 21 April 2012].
  3. Henry J. Amen IV and Kyubyong Park. Korean for Beginners: Mastering Conversational Korean. Tuttle Publishing, first edition, 2010. ISBN 978-0-8048-4100-9.
  4. Wikipedia. Hangul — Wikipedia, The Free Encyclopedia, 2012. URL http://en.wikipedia.org/w/index.php?title=Hangul&oldid=487464426. [Online; accessed 21 April 2012].
  5. Wikipedia. Korean language — Wikipedia, The Free Encyclopedia, 2012. URL http://en.wikipedia.org/w/index.php?title=Korean_language&oldid=488354078. [Online; accessed 21 April 2012].