.. automodule:: selkie.nlp.tok Tokenizer for Latin scripts — ``selkie.nlp.tok`` ================================================ Usage ----- .. py:function:: tokenized(s) The main function is ``tokenized()``, which takes a text (string) and returns a list of tokens. >>> from selkie.nlp.tok import tokenized >>> t = tokenized('"Hi, @#!"\nsaid 42-J.\n') >>> t[:4] [, , , ] The return value is a list of Token instances. .. py:class:: selkie.nlp.tok.Token .. py:method:: type() The value is one of: ``'word'``, ``'number'``, ``'hyphen'``, or ``'punct'``. (Internally, the tokenizer also creates ``'space'`` and ``'newline'`` tokens, but they are not returned. Newlines are implicit in the line numbers, and spaces can be reconstructed from the column numbers.) .. py:method:: string() The original characters. .. py:method:: line() The line number (from 1). .. py:method:: column() The column number (from 0). .. py:method:: endcolumn() The ending column number (exclusive). .. py:method:: start() The character offset (from 0, inclusive). .. py:method:: end() Ending character offset (exclusive). For example: >>> t[5].string() 'J.' >>> t[5].line() 2 >>> t[5].column() 9 One can compute the space between two adjacent tokens by subtracting the endcolumn of the first from the column of the second: >>> t[5].column() - t[4].endcolumn() 0 >>> t[2].string() 'said' >>> t[3].column() - t[2].endcolumn() 2 This is legitimate only if the two tokens are on the same line. Algorithm --------- To be robust to OCR errors, tokens typically mix alphanumeric and punctuation characters. The only exception is that a non-peripheral hyphen will break a token into two pieces. The definition in detail is as follows: * *Whitespace* is a sequence of one or more characters that satisfy ``isspace()``. * A *hyphen* is a sequence of one or more hyphen characters. It is *peripheral* if it is preceded or followed by whitespace or by the beginning or end of the text. It is *embedded* if it is not peripheral. * A *separator* is whitespace or an embedded hyphen. Note that a peripheral hyphen is not a separator, and will be included as part of another token. * A *regular token* is a maximal sequence of non-separators. Its type is ``word`` if it contains any letters, ``number`` if it contains digits but no letters, and ``punct`` if it contains neither letters nor digits. * A *token* is either a regular token or an embedded hyphen. The tokenizer returns a sequence of tokens. Note that whitespace is discarded. However, the tokenizer does keep track of line numbers; each token has line number as an attribute.