Here, the representation of the cantillation marks in two character codes is given:
Unicode, a unified framework to worldwide character encoding. The basis for the assignment of Unicode values to cantillation marks seems to have been the Israeli national code SI 1311-2 with code values that are at a constant difference of 0x04f0 lower than the Unicode values. There is a major ambiguity in the definition of two of the marks which seem to be swapped in the Unicode tables. A separate chapter of this WWW page describes this problem and two minor points in ample detail, hoping that this will assist in removing the ambiguities.
The Michigan-Claremont encoding, used in a collection of computer readable Biblical texts. The rationale is explained in the Michigan Manual. As far as the codes for the cantillation marks are concerned, this encoding looks rather arbitrary; it is intended to be mnemonic regarding the shape and position of the marks.
When it comes to the question how cantillation marks are represented in these character codes, we need to make an important distinction:
A cantillation mark is a diacritical symbol attached to a letter or a word. It is characterised by its shape and its position relative to the letter or word it is attached to.
A cantillation symbol consists of one or two cantillation marks. It is characterised by its meaning and by the rules for its usage in a given context.
In particular, if marks have the same shape and position but different meanings, such as Tipcha, Tarkha and Meayla, they are considered the same mark but distinct symbols.
Now, a character coding can adhere to one of the following strategies:
Codes are assigned to the marks, irrespective of their meaning and their combination to symbols. This includes marks that cannot occur other than in the combination with another mark to form a symbol.
Codes are assigned to the symbols. The corresponding marks must then be derived during the rendering process.
Codes are assigned to the marks, but dependent on what symbols they are components of.
Each of these ways to proceed has its benefits and drawbacks. Concentrating on marks facilitates reading and writing (i.e. the transformation between written or printed text and its coded representation) but complicates processing the content of the coded text. Violations of the usage standard for symbols can much more easily dealt with if only the marks are coded, thus making it possible to encode texts even if the marks do not always combine to meaningful symbols: in this spirit, the Michigan-Claremont manual insists "Code what is written, not what is meant."
The two codings presented in the table both follow the first of the strategies, with the Michigan-Claremont code making a few distinctions in the spirit of strategy 3. Therefore, there is sometimes more than one MC code value that corresponds to the same Unicode value.
As both these codes are based on marks as distinct from symbols, a new private code for the symbols and their constituents was developed for the purpose of this exposition. Without such a code, it would have been very difficult to maintain tables of symbols. Such a code can also serve as a basis for privately used characters in a publicly standardised code when a distinction of marks according to strategy 3 above is needed. This private code is explained in the next section together with the abbreviations used in the syntax chapters.
All publicly standardised character codes for cantillation marks covered in this article are codes for marks irrespective of their combination to symbols and irrespective of the semantics of the symbols. In contrast to that, the syntax description showed that the same mark can be part of different symbols (e.g. Legarmeh (=Paseq) as part of Shalshelet Gadol, of Mahpakh Legarmeh and of several others), and the same symbol can have different semantic significance (e.g. Revia as king or as duke in the 3 books, or Mahpakh Legarmeh with three different possible ranks). When these differences are important, one needs a code reflecting them.
The code proposed below has been developed in order to have a sorting criterion for the tables in this article. It can, however, also be used in texts containing cantillation marks if a finer distinction of the marks is needed than the one by the mere shape and position of the marks, for instance, when a program is written to distinguish the different possible semantics of each given mark. It can be used together with Unicode if its range is embedded into the private use area, e.g. at U+E100 to U+E1FF.
The design principles of this code are:
The code uses code positions 00 to FF, that is, 256 different possible codes.
The top distinction is 21 books (codes 00 to 7F) vs. 3 books (codes 80 to FF).
The next distinction is distinctive symbols (codes 00 to 5B and 80 to DB) vs. conjunctive symbols (codes 60 to 7B and E0 to FB); the remaining code positions are reserved if one wants to put into the same code three characters which are not cantillations symbols but which can influence the positioning of cantillation symbols: Paseq, Maqqef, and Meteg.
The codes assigned to distinctive symbols are divided into contiguous blocks according to the ranks of the symbols, emperors first, then kings, dukes, and officers.
Within the distinctive symbols, blocks of four consecutive code points, the first one ending with 0, 4, 8, or C, are assigned to pairs of cantillation symbols which can replace each other (the braces in the syntax charts).
Throughout the whole code, blocks of 2 consecutive code points, the first one even (ending with 0, 2, 4, 6, 8, A, C, or E), are assigned to a cantillation symbol each.
Whenever possible, symbols that are related to one another, in particular a lower symbol serving exclusively a particular higher symbol, are given code point distances that are multiples of 20, so that they appear in the same horizontal line in the code chart below:
21 books | 3 books | ||||||
---|---|---|---|---|---|---|---|
distinctive | conj. | distinctive | conj. | ||||
00 SoP0 | 20 Rvi2 | 60 Mun | 80 SoP0 | A0 RvG2 | E0 Mun | ||
44 Paz3 | 84 OYr1 | A4 RvQ2 | C4 Paz3 | E4 AtH | |||
46 QaP3 | 66 Glg | 86 AzL1 | E6 Glg | ||||
08 Atn0 | 48 TlG3 | 68 Mer | 88 Atn1 | A8 Dhi2 | E8 Mer | ||
8A Paz1 | AA MpL2 | EA MrM | |||||
6C TlQ | 8C Rvi1 | EC ShQ | |||||
6E May | EE Tar | ||||||
10 Sgl1 | 30 Zar2 | 50 Ger3 | 70 Qad | B0 Tsi2 | D0 AzL3 | F0 Qad | |
12 Sha1 | 52 Grm3 | D2 MpL3 | F2 Ill | ||||
14 ZqQ1 | 34 Psh2 | 74 Mhp | 94 RvM1 | F4 Mhp | |||
16 ZqG1 | 36 Ytv2 | F6 MpM | |||||
18 Tip1 | 38 Tvr2 | 58 Lgm3 | 78 MeK | 98 ShG1 | |||
7A Dar | |||||||
7C Mf | 9C MpL1 | FC Mf | |||||
5E Pq | 7E Mg | DE Pq | FE Mg |
The above rules assign even code points to symbols. If a symbol consists of two marks, the primary mark gets the same code point, and the secondary mark gets the code point that is higher by one. The primary mark is defined as follows:
For symbols consisting a mandatory mark and an optional mark (i.e. a mark which is not always present), the primary mark is the mandatory mark.
For symbols consisting of two mandatory marks, the primary mark is the mark placed at the consonant of the stressed syllable.
In the code chart above, the same abbreviations for the symbols have been used as in the syntax charts. What the abbreviations stand for is listed in the tables. There, similar abbreviations will be defined for the marks as well. The design principles for these abbreviations are:
The abbreviation for Paseq, Maqqef, and Meteg consists of the first and the last letter of the name. The abbreviation of cantillation symbols begins with three letters from the name. The first letter taken from each Hebrew word in the name appears as capital letter in the abbreviation. The Latin letter "e" standing for a Schwa (e.g. the "e" in "Revia") will not be used for the abbreviation.
The abbreviations for conjunctive symbols consist just of these three letters; the abbreviations for distinctive symbols consist of the three letters followed by a number indicating the rank, as given in the table below.
The abbreviation for a mark consists of the abbreviation of the symbol it belongs to, followed by a letter indicating the position of the mark (see legend under position). This additional letter is omitted when the symbol consists of only one mark.
Here is a summary of how the ranks of the symbols are denoted by the numbers in the abbreviations and by the colours in the syntax charts and in the tables:
Abbr. | rank of symbol |
---|---|
xxx0 | final emperor |
xxx0 | non-final emperor |
xxx1 | non-final king |
xxx1 | final king |
xxx1 | king after Atnach (3 books only) |
xxx2 | non-final duke |
xxx2 | final duke |
xxx3 | non-final officer |
xxx3 | final officer |
xxx | servant = conjunctive symbol |
xx | other character |
SC = systematic code | a private code for both cantillation symbols and cantillation marks based on their semantics. Rationale and explanations see above. | ||||||||||||||||||||||||||||
Abbr. = Abbreviation | abbreviation used in the code charts of the syntax description. Both the digit in the abbreviation and the background colour indicate the rank of the symbol. Detailed explanations see above. | ||||||||||||||||||||||||||||
Position | the position of the mark(s) relative to the text The position of a mark consists of two features: its place in the word and its position relative to the letter. For the former, the following codes are used:
In addition, the codes in the mark table contain an indication of the position of the mark relative to the letter where it is placed: a and b for above and below, optionally with l or r for left and right, and f for final (a mark that is placed after the word like a spacing character). The information given by these codes is supplemented by a symbol in the next column. There, the space occupied by the entire word is depicted as grey area, so that a mark at the right or left of this area denotes a mark that is prepositive (or postpositive, resp.) to the entire word. If the symbol consists of two marks, their positions are shown in red and pale blue as follows:
| ||||||||||||||||||||||||||||
Shape | the shape of the marks without indication where they are positioned relative to the text If there are two marks, they are read from right to left. For instance, if the position code is ac, the mark in position a is shown right of the mark in position c. | ||||||||||||||||||||||||||||
Name | the name of the symbol or mark
| ||||||||||||||||||||||||||||
MC | code value in the Michigan-Claremont encoding Note that this is not a decimal or hexadecimal number but a string consisting of two decimal digits. | ||||||||||||||||||||||||||||
Unicode | code value of the mark in Unicode | ||||||||||||||||||||||||||||
Unicode name | name of the mark in Unicode |