Hebrew Cantillation Marks And Their Encoding

by Helmut Richter

III. Character Codes

Cantillation Marks In Modern Character Codes

Here, the representation of the cantillation marks in two character codes is given:

Unicode, a unified framework to worldwide character encoding. The basis for the assignment of Unicode values to cantillation marks seems to have been the Israeli national code SI 1311-2 with code values that are at a constant difference of 0x04f0 lower than the Unicode values. There is a major ambiguity in the definition of two of the marks which seem to be swapped in the Unicode tables. A separate chapter of this WWW page describes this problem and two minor points in ample detail, hoping that this will assist in removing the ambiguities.
The Michigan-Claremont encoding, used in a collection of computer readable Biblical texts. The rationale is explained in the Michigan Manual. As far as the codes for the cantillation marks are concerned, this encoding looks rather arbitrary; it is intended to be mnemonic regarding the shape and position of the marks.

When it comes to the question how cantillation marks are represented in these character codes, we need to make an important distinction:

A cantillation mark is a diacritical symbol attached to a letter or a word. It is characterised by its shape and its position relative to the letter or word it is attached to.
A cantillation symbol consists of one or two cantillation marks. It is characterised by its meaning and by the rules for its usage in a given context.

In particular, if marks have the same shape and position but different meanings, such as Tipcha, Tarkha and Meayla, they are considered the same mark but distinct symbols.

Now, a character coding can adhere to one of the following strategies:

Codes are assigned to the marks, irrespective of their meaning and their combination to symbols. This includes marks that cannot occur other than in the combination with another mark to form a symbol.
Codes are assigned to the symbols. The corresponding marks must then be derived during the rendering process.
Codes are assigned to the marks, but dependent on what symbols they are components of.

Each of these ways to proceed has its benefits and drawbacks. Concentrating on marks facilitates reading and writing (i.e. the transformation between written or printed text and its coded representation) but complicates processing the content of the coded text. Violations of the usage standard for symbols can much more easily dealt with if only the marks are coded, thus making it possible to encode texts even if the marks do not always combine to meaningful symbols: in this spirit, the Michigan-Claremont manual insists "Code what is written, not what is meant."

The two codings presented in the table both follow the first of the strategies, with the Michigan-Claremont code making a few distinctions in the spirit of strategy 3. Therefore, there is sometimes more than one MC code value that corresponds to the same Unicode value.

As both these codes are based on marks as distinct from symbols, a new private code for the symbols and their constituents was developed for the purpose of this exposition. Without such a code, it would have been very difficult to maintain tables of symbols. Such a code can also serve as a basis for privately used characters in a publicly standardised code when a distinction of marks according to strategy 3 above is needed. This private code is explained in the next section together with the abbreviations used in the syntax chapters.

A Systematic Code Reflecting The Semantics Of The Symbols

All publicly standardised character codes for cantillation marks covered in this article are codes for marks irrespective of their combination to symbols and irrespective of the semantics of the symbols. In contrast to that, the syntax description showed that the same mark can be part of different symbols (e.g. Legarmeh (=Paseq) as part of Shalshelet Gadol, of Mahpakh Legarmeh and of several others), and the same symbol can have different semantic significance (e.g. Revia as king or as duke in the 3 books, or Mahpakh Legarmeh with three different possible ranks). When these differences are important, one needs a code reflecting them.

The code proposed below has been developed in order to have a sorting criterion for the tables in this article. It can, however, also be used in texts containing cantillation marks if a finer distinction of the marks is needed than the one by the mere shape and position of the marks, for instance, when a program is written to distinguish the different possible semantics of each given mark. It can be used together with Unicode if its range is embedded into the private use area, e.g. at U+E100 to U+E1FF.

The design principles of this code are:

The code uses code positions 00 to FF, that is, 256 different possible codes.
The top distinction is 21 books (codes 00 to 7F) vs. 3 books (codes 80 to FF).
The next distinction is distinctive symbols (codes 00 to 5B and 80 to DB) vs. conjunctive symbols (codes 60 to 7B and E0 to FB); the remaining code positions are reserved if one wants to put into the same code three characters which are not cantillations symbols but which can influence the positioning of cantillation symbols: Paseq, Maqqef, and Meteg.
The codes assigned to distinctive symbols are divided into contiguous blocks according to the ranks of the symbols, emperors first, then kings, dukes, and officers.
Within the distinctive symbols, blocks of four consecutive code points, the first one ending with 0, 4, 8, or C, are assigned to pairs of cantillation symbols which can replace each other (the braces in the syntax charts).
Throughout the whole code, blocks of 2 consecutive code points, the first one even (ending with 0, 2, 4, 6, 8, A, C, or E), are assigned to a cantillation symbol each.
Whenever possible, symbols that are related to one another, in particular a lower symbol serving exclusively a particular higher symbol, are given code point distances that are multiples of 20, so that they appear in the same horizontal line in the code chart below:

21 books				3 books
distinctive			conj.	distinctive			conj.
00 SoP0	20 Rvi2		60 Mun	80 SoP0	A0 RvG2		E0 Mun
00 SoP0	20 Rvi2			80 SoP0	A0 RvG2
		44 Paz3		84 OYr1	A4 RvQ2	C4 Paz3	E4 AtH
		46 QaP3	66 Glg	86 AzL1	A4 RvQ2	C4 Paz3	E6 Glg
08 Atn0		48 TlG3	68 Mer	88 Atn1	A8 Dhi2		E8 Mer
08 Atn0		48 TlG3		8A Paz1	AA MpL2		EA MrM
			6C TlQ	8C Rvi1			EC ShQ
			6E May	8C Rvi1			EE Tar
10 Sgl1	30 Zar2	50 Ger3	70 Qad		B0 Tsi2	D0 AzL3	F0 Qad
12 Sha1	30 Zar2	52 Grm3			B0 Tsi2	D2 MpL3	F2 Ill
14 ZqQ1	34 Psh2		74 Mhp	94 RvM1			F4 Mhp
16 ZqG1	36 Ytv2			94 RvM1			F6 MpM
18 Tip1	38 Tvr2	58 Lgm3	78 MeK	98 ShG1
18 Tip1	38 Tvr2	58 Lgm3	7A Dar	98 ShG1
			7C Mf	9C MpL1			FC Mf
		5E Pq	7E Mg	9C MpL1		DE Pq	FE Mg

The above rules assign even code points to symbols. If a symbol consists of two marks, the primary mark gets the same code point, and the secondary mark gets the code point that is higher by one. The primary mark is defined as follows:
- For symbols consisting a mandatory mark and an optional mark (i.e. a mark which is not always present), the primary mark is the mandatory mark.
- For symbols consisting of two mandatory marks, the primary mark is the mark placed at the consonant of the stressed syllable.

In the code chart above, the same abbreviations for the symbols have been used as in the syntax charts. What the abbreviations stand for is listed in the tables. There, similar abbreviations will be defined for the marks as well. The design principles for these abbreviations are:

The abbreviation for Paseq, Maqqef, and Meteg consists of the first and the last letter of the name. The abbreviation of cantillation symbols begins with three letters from the name. The first letter taken from each Hebrew word in the name appears as capital letter in the abbreviation. The Latin letter "e" standing for a Schwa (e.g. the "e" in "Revia") will not be used for the abbreviation.
The abbreviations for conjunctive symbols consist just of these three letters; the abbreviations for distinctive symbols consist of the three letters followed by a number indicating the rank, as given in the table below.
The abbreviation for a mark consists of the abbreviation of the symbol it belongs to, followed by a letter indicating the position of the mark (see legend under position). This additional letter is omitted when the symbol consists of only one mark.

Here is a summary of how the ranks of the symbols are denoted by the numbers in the abbreviations and by the colours in the syntax charts and in the tables:

Abbr.	rank of symbol
xxx0	final emperor
xxx0	non-final emperor
xxx1	non-final king
xxx1	final king
xxx1	king after Atnach (3 books only)
xxx2	non-final duke
xxx2	final duke
xxx3	non-final officer
xxx3	final officer
xxx	servant = conjunctive symbol
xx	other character

Legend For The Tables

SC = systematic code

a private code for both cantillation symbols and cantillation marks based on their semantics.

Rationale and explanations see above.

Abbr. = Abbreviation

abbreviation used in the code charts of the syntax description.

Both the digit in the abbreviation and the background colour indicate the rank of the symbol. Detailed explanations see above.

Position

the position of the mark(s) relative to the text

The position of a mark consists of two features: its place in the word and its position relative to the letter. For the former, the following codes are used:

codes for marks and for symbols consisting of only one mark:
a	mark prepositive to the word, carries no information about stress
b	mark on unstressed syllable
c	mark at the initial consonant of stressed syllable
d	mark postpositive to the word, indicates stress on ultimate syllable unless other mark present
e	mark postpositive to the word, carries no information about stress
code combinations for symbols consisting of two marks:
ac	marks a and c, both mandatory
a(c)	mark a prepositive to the word with no information about stress; infrequently extra mark c at stressed syllable
bc	marks b and c, both mandatory, marks mostly on the same word but mark b sometimes on the preceding word
(b)c	marks b and c, marks mostly on the same word but mark b sometimes on preceding word or missing
(c)d	mark d if stress on ultimate syllable, otherwise two marks c and d
ce	marks c and e, both mandatory
(c)e	mark e postpositive to the word with no information about stress; infrequently extra mark c at stressed syllable

In addition, the codes in the mark table contain an indication of the position of the mark relative to the letter where it is placed: a and b for above and below, optionally with l or r for left and right, and f for final (a mark that is placed after the word like a spacing character).

The information given by these codes is supplemented by a symbol in the next column. There, the space occupied by the entire word is depicted as grey area, so that a mark at the right or left of this area denotes a mark that is prepositive (or postpositive, resp.) to the entire word. If the symbol consists of two marks, their positions are shown in red and pale blue as follows:

In the cantillation mark table, the red spot shows the position of the mark at hand, and the pale blue spot shows the position of the mark with which the mark at hand is combined to form a symbol.
In the cantillation symbol tables, the red spot shows the position of the primary mark (typically the mark placed on the consonant of the stressed syllable).

Shape

the shape of the marks without indication where they are positioned relative to the text

If there are two marks, they are read from right to left. For instance, if the position code is ac, the mark in position a is shown right of the mark in position c.

Name

the name of the symbol or mark

In the cantillation mark table, the name of the mark is given if the mark has a name of its own; otherwise the name of the symbol is given. This can lead to a situation like, for instance, the following: The symbol Sof Pasuq consists of two marks; one of them, Silluq, has a name of its own, the other one has not and is given as Sof Pasuq. In any case, only one name for a mark is given, even if it has also other names.
In the cantillation symbol tables, one column of the table contains one or more synonyms or different Latin-script spellings of the symbol's name; another column contains one of the names (not necessarily the first in the other column) in Hebrew script.

code value in the Michigan-Claremont encoding

Note that this is not a decimal or hexadecimal number but a string consisting of two decimal digits.

Unicode

code value of the mark in Unicode

Unicode name