.. automodule:: selkie.corpus.rom Romanizations — ``selkie.corpus.rom`` ===================================== Definition ---------- A **romanization** defines ASCII key sequences for entering non-ASCII characters. It can be thought of as a keyboard for entering a non-roman script, or as an orthography. For example, using the Salish romanization, one can type ``l**x@c'`` to obtain the character sequence ƛ̣̓xəc̓. In CLD, text stored in files is stored in romanized (ASCII) form. It is easiest if we can associate a unique romanization with a given language. That is problematic in cases where different orthographies are in use. For example, Ojibwe is sometimes written using Canadian syllabics. We probably want to convert such texts to the "standard" orthography before analyzing them. The alternative is to treat alternative orthographies as introducing variant forms of all words, not an attractive option. We store the romanizations in the toplevel directory ``roms``. We need to be able to specify a separate registry. Usage ----- Romanizations provide one-way codecs: they can be used to decode ASCII byte sequences, producing Unicode strings as output. The reverse mappings are not currently provided. The romanizations currently defined are: ``'gothic'``, ``'gothic-student'``, and ``'salish'``. They are enabled when seal.nlp.rom is imported. One uses them as one uses any decoder. For example: >>> import selkie.corpus.rom >>> s = b'c*a'.decode('salish') >>> from selkie.pyx.string import unidescribe >>> unidescribe(s) 0 0x10d LATIN SMALL LETTER C WITH CARON 1 0x61 LATIN SMALL LETTER A The string prints out as "ča." There is also a ``decode()`` function: >>> from selkie.corpus.rom import decode >>> s2 = decode("a'tho:", 'gothic-student') >>> unidescribe(s2) 0 0xe1 LATIN SMALL LETTER A WITH ACUTE 1 0xfe LATIN SMALL LETTER THORN 2 0x6f LATIN SMALL LETTER O 3 0x304 COMBINING MACRON To convert the output to an ascii string containing HTML entities of form ``&#dddd;`` for non-ascii characters: >>> from selkie.corpus.rom import to_html >>> to_html(s2) b'áþō' To see the graph:: >>> student.print_graph() # doctest: +SKIP Decoder ------- A Decoder applies a romanization. It is similar to the reader for a codec, but it maps text to text, not bytes to text. The romanization behaves as a dict mapping strings to strings. It is interpreted as a prefix code. At any point in the input stream, the longest matching key is used to determine the output string at that point. If no key matches, the unicoder checks whether the next thing in the input stream is one of the directives in the following table. The first thirteen are identical to escapes allowed in Python. The symbol *d* represents any octal digit (0-7), and *h* represents any hex digit (0-7, a-f, A-F). .. list-table:: * - \*newline* - A backslash followed by newline produces no output. * - \\ - A literal backslash. * - \a - ASCII bell, U+0007. * - \b - ASCII backspace, U+0008. * - \f - ASCII formfeed, U+000C. * - \n - ASCII newline (line feed), U+000A. * - \r - ASCII carriage return, U+000D. * - \t - ASCII tab, U+0009. * - \v - ASCII vertical tab, U+000B. * - \*ddd* - The Unicode character whose codepoint, in octal, is *ddd.* One to three digits may be given; the longest match will be taken up to three digits. * - \x*hh* - The Unicode character U+00*hh.* Exactly two hex digits must be provided. * - \u*hhhh* - The Unicode character U+*hhhh.* Exactly four hex digits must be provided. The named codepoint is inserted. * - \U*hhhhhhhh* - The codepoint U+*hhhhhhhh* is inserted. Exactly eight hex digits must be provided. * - \.*name* - *Name* consists of any mix of letters, digits, and underscore. The longest match is taken. To force a shorter match, when the next intended character is a letter, digit, or underscore, one may terminate the name with ``\.`` (backslash period). The unicoder switches to the named romanization. * - \[*name* - The unicoder switches to the named romanization, but pushes the old one on the stack. * - \] - The unicoder pops the previous romanization off the stack and resumes using it. * - \. - Produces no output, can be used to terminate *name* or *ddd.* If the next thing in the input is not one of the romanization's keys, and not one of the directives in the table, then a single character is copied unmodified to the output. A ``.rom`` file is loaded using ``load_dict()`` of ``seal.io``. Keys may not be null. To get Unicode characters into the value part of a .rom file, use numeric escapes and pass it through Unicoder. The function ``decoder`` produces a decoder for a given romanization, and the function ``reader`` produces an input stream. In Javascript, the coder accepts strings or single characters via ``append()``. The input must consist of seven-bit ASCII, so characters and code points are the same. There is no one-one correspondence between input characters and output characters, and in some cases, lookahead is required to determine what the output sequence should be. If the output sequence is still ambiguous, but no further input remains, one can force all pending output to be produced by calling ``flush()``. Catalog ------- To get a list of the defined romanizations: >>> from selkie.corpus.rom import default_registry >>> default_registry.reset() >>> sorted(default_registry) ['gothic', 'gothic-student', 'korean', 'otw-jones', 'otw-webkamigad', 'salish'] To get the romanization itself, access the registry like a dict: >>> salish = default_registry['salish'] The file in which the romanization resides is ``salish.filename``. Calling ``print(salish)`` prints its contents. One can also use ``salish.items()`` to get an iteration over the pairs, and ``salish.print_graph()`` to see the finite-state graph. Gothic ...... Here are the contents of the 'gothic' romanization: .. list-table:: * - a - 𐌰 * - b - 𐌱 * - g - 𐌲 * - d - 𐌳 * - e - 𐌴 * - q - 𐌵 * - z - 𐌶 * - h - 𐌷 * - th - 𐌸 * - i - 𐌹 * - k - 𐌺 * - l - 𐌻 * - m - 𐌼 * - n - 𐌽 * - j - 𐌾 * - u - 𐌿 * - p - 𐍀 * - 90 - 𐍁 * - r - 𐍂 * - s - 𐍃 * - t - 𐍄 * - w - 𐍅 * - f - 𐍆 * - x - 𐍇 * - hv - 𐍈 * - o - 𐍉 * - 900 - 𐍊 Here are the contents of 'gothic-student': .. list-table:: * - A: - Ā * - E: - Ē * - O: - Ō * - U: - Ū * - A' - Á * - I' - Í * - U' - Ú * - A:' - Ā́ * - I:' - Ī́ * - U:' - Ū́ * - a: - ā * - e: - ē * - o: - ō * - u: - ū * - a' - á * - i' - í * - u' - ú * - a:' - ā́ * - i:' - ī́ * - u:' - ū́ * - th - þ Defining a new romanization --------------------------- Here is an example of a romanization definition:: a' \(00e1) N \(004b) L- \(019b) l- \(026c) ? \(02c1) Nota bene: the columns are separated by a single tab, not spaces. If the preceding is the contents of ``romtest.rom`` in the current directory, it is immediately available as encoding ``romtest``. If it resides in directory DIR, one may make it available by adding DIR to the default registry's path: >>> import selkie >>> default_registry.path.insert(0, selkie.data.path('examples')) >>> b"l-a'L-e ?u".decode('romtest') 'ɬáƛe ˁu' API --- .. py:function:: load_rom(fn) Opens the file in binary mode. Returns an iteration over (key, value) pairs. The values are not expanded. .. class:: Romanization .. py:method:: __init__([name], [fn]) Initialize. If *fn* is provided, `load_rom()` is used to read it, and the values are decoded. .. py:attribute:: name The name. .. py:attribute:: filename The filename. .. py:attribute:: start The start state. .. py:method:: __setitem__(k, v) Add a new association. .. py:method:: items() Calls `load_rom()` on its filename and returns the resulting iteration. .. py:method:: __str__() Prints the contents of the file. .. py:method:: print_graph() Prints out the state graph. .. py:method:: match(input, i=0) Finds the longest match in *input* beginning at index *i*. The return value is a pair (j, value). .. py:method:: decode(input, output=None, errors='strict') Creates a Decoder from itself and calls it on *input* and *output*.