Programmatic Interface to Corpus

Selkie provides a programmatic interface to SLF corpora. The directory structure of an SLF corpus is given in the first column of the following. The second column gives an expression for accessing the structural unit in question, assuming that corpus is a variable containing the corpus as a whole, and the third column gives the type of the object:

corpus/               corpus                        Corpus
    langs             corpus.langs                  LanguageTable
    roms/             corpus.roms                   RomRepository
        *romname*     corpus.roms[*romname*]        Rom
        ...
    *langid*/         corpus[*langid*]              Language
        lexicon       corpus[*langid*].lexicon      Lexicon
        index         corpus[*langid*].index        TokenIndex
        toc           corpus[*langid*].toc          MetadataTable
        txt/          corpus[*langid*].txt          TextTable
            *txtid*   corpus[*langid*].txt[*txtid]  Text
            ...
    ...

Recall from the description of the SLF format that the individual files (‘langs’, ‘lexicon’, ‘index’, ‘toc’, and each of the roms and simple texts) are called corpus items. The contents of the items suffices to reconstruct the entire corpus.

Corpus

One loads a corpus using the Corpus constructor. Let us first create a temp directory to work in:

>>> from tempfile import TemporaryDirectory
>>> tmp = TemporaryDirectory()

And let us create a corpus by copying an example:

>>> from selkie.data import ex
>>> from shutil import copytree
>>> from os.path import join
>>> corpus_filename = join(tmp.name, 'corpus')
>>> bool(copytree(ex('corp25.slf'), corpus_filename))
True

Opening the corpus:

>>> from selkie.corpus import Corpus
>>> corpus = Corpus(corpus_filename)

A corpus behaves like a dict whose keys are language IDs:

>>> list(corpus)
['deu']
>>> corpus['deu']
<Language deu German>

The methods __iter__(), __len__(), keys(), items(), and values() are also available and work as one would expect. One can use the method new() to add a new language:

>>> corpus.new('oji', 'Ojibwe')
<Language oji Ojibwe>
>>> list(corpus)
['deu', 'oji']

And one can delete a language using del:

>>> del corpus['oji']
>>> list(corpus)
['deu']

Language table

As indicated above, the corpus has a langs member, which is the list of languages:

>>> print(corpus.langs)
deu German

One may equally treat corpus.langs as a dict containing languages. In fact, dict method calls placed on the corpus are simply dispatched to corpus.langs (including new() as an honorary “dict method call”).

Language

A language has an ID and a full name:

>>> deu = corpus['deu']
>>> deu.langid()
'deu'
>>> deu.fullname()
'German'

Alternatively, the properties listed in the ‘langs’ file can be accessed by treating the language as a dict:

>>> deu['id']
'deu'
>>> deu['name']
'German'

Similarly, those properties may be modified, and the change is automatically written to disk:

>>> deu['name'] = 'Deutsch (German)'
>>> deu['name']
'Deutsch (German)'

(However, the key ‘id’ cannot be modified.)

Corresponding to the files in the language directory, a language has attributes lexicon, index, toc, and txt:

>>> deu.lexicon
<Lexicon /deu/lexicon>
>>> deu.index
<TokenIndex /deu/index>
>>> deu.toc
<Toc /deu/toc>
>>> deu.txt
<TextTable deu>

Table of Contents

A table of contents (‘toc’) is a table that maps text IDs to metadata:

>>> list(deu.toc)
['1', '2', '3']
>>> deu.toc['1']
<TextMetadata deu 1>

The toc prints out as a listing of IDs, types, and titles:

>>> print(deu.toc)
1 story Eine kleine Geschichte
2 page  p1
3 page  p2

One can add new texts to the toc:

>>> deu.toc.new('4', ti='Der Taucher', ty='story')
<TextMetadata deu 4>
>>> print(deu.toc)
1 story Eine kleine Geschichte
2 page  p1
3 page  p2
4 story Der Taucher

Text metadata behaves like a dict:

>>> meta = deu.toc['1']
>>> meta['ti']
'Eine kleine Geschichte'
>>> print(meta)
id 1
ty story
ti Eine kleine Geschichte
ch 2 3

Text table

The ‘txt’ member has the same keys as the TOC (namely, text IDs), but the values are text objects instead of metadata objects:

>>> list(deu.txt)
['1', '2', '3', '4']
>>> deu.txt['1']
<Text 1>
>>> deu.txt['2']
<Text 2>

The same metadata dict that one access through ‘toc’ can also be accessed from the text itself:

>>> t1 = deu.txt['1']
>>> t1.metadata()
<TextMetadata deu 1>

Incidentally, the inverse method, from metadata to text, is also available:

>>> meta.text()
<Text 1>

The text has convenience methods to access most of the metadata items:

>>> t1.textid()
'1'
>>> t1.text_type()
'story'
>>> t1.author()
''
>>> t1.title()
'Eine kleine Geschichte'

However, one cannot access metadata properties using square brackets on a text. Square brackets applied to an aggregate text return its children, and square brackets applied to a simple text returns its sentences.

Hierarchical structure

Texts form a hierarchical structure, represented by the children() and parent() methods of Text. One obtains the root of the hierarchy from the language:

>>> roots = deu.get_roots()
>>> roots
[<Text 1>, <Text 4>]

From there, one follows children() and parent() links:

>>> roots[0].children()
[<Text 2>, <Text 3>]
>>> t2 = _[0]
>>> t2.parent()
<Text 1>

One can also use the method walk() to iterate over all descendants of a text (including itself).

A text has methods that characterize its intuitive level in the hierarchy. The largest aggregates are collections, which are distinguished by having text type ‘collection’. The largest non-collections are documents. And the leaves of the hierarchy are simple texts. Texts have methods to test those properties: is_collection(), is_document(), is_simple_text(), and languages have methods to fetch them:

>>> deu.get_collections()
[]
>>> deu.get_documents()
[<Text 1>, <Text 4>]
>>> deu.get_simple_texts()
[<Text 2>, <Text 3>, <Text 4>]

Sentences and words

A simple text behaves like a list of sentences:

>>> t3 = deu.txt['3']
>>> list(t3)
[<Sentence 3.1 eines Tages begegnete der Schuster ...>, <Sentence 3.2 Ende>]

(Incidentally, if one accesses an aggregate like a list, the list elements are the children.)

A sentence behaves like a list of words:

>>> sent = t3[0]
>>> list(sent)
['eines', 'Tages', 'begegnete', 'der', 'Schuster', 'einen', 'Bettler']
>>> sent[0]
'eines'

In addition, a sentence has methods for accessing a list of timestamps:

>>> sent.timestamps()
[(0, '1.4958'), (2, '1.9394'), (5, '2.7833'), (7, '3.3269')]

One can alternatively obtain a list of spans, which are triples consisting of start time, end time, and a list of words:

>>> for span in sent.spans():
...     print(span)
...
('1.4958', '1.9394', ['eines', 'Tages'])
('1.9394', '2.7833', ['begegnete', 'der', 'Schuster'])
('2.7833', '3.3269', ['einen', 'Bettler'])

Finally, if the sentence has a translation, the method gloss() returns it:

>>> sent.gloss()
'one day the cobbler met a beggar'

The words in a sentence appear to be strings, and are strings, but they are more precisely instances of a specialization of str called Token. They have some additional methods that str lacks. In particular, each token has a location, which consists of the text ID and the sentence number:

>>> token = sent[4]
>>> token
'Schuster'
>>> token.loc()
<Loc 3.1.5>

A token also has a link to its lexical entry:

>>> type(token.entry())
<class 'selkie.corpus.core.Lexent'>

For convenience, one can access all methods of the lexical entry directly from the token. We return to that point below, after discussing lexical entries.

Lexicon

In addition to accessing lexical entries via tokens, one can access them from the Lexicon itself. The Lexicon behaves like a dict whose keys are forms:

>>> list(deu.lexicon)
['begegnete', 'Bettler', 'der', 'einen', 'eines', 'Schuster', 'Tag', 'Tages']
>>> tages = deu.lexicon['Tages']
>>> print(tages.table())
id   Tages
pp   Tag .gen

An entry, like a token, is a specialization of str:

>>> tages
'Tages'

But it has additional methods, like table() in the example above. In particular, it has methods for accessing lexical attributes:

>>> tages.form()
'Tages'
>>> tages.parts()
['Tag', '.gen']

The method form() actually just returns the lexent itself. The value of parts() is also a list of lexents, not merely a list of strings. For example:

>>> tag = tages.parts()[0]
>>> tag
'Tag'
>>> tag.gloss()
'day'

The lexical attributes were listed in the discussion of SLF. The method names are:

  • form

  • formtype

  • cat

  • parts

  • gloss

  • canonical

  • orthographic

The values of these lexical attributes can be set using the method set():

>>> tages.set(cat='N', gloss="day's")
>>> print(tages.table())
id   Tages
pp   Tag .gen
c    N
g    day's

In addition, there are two (automatically-generated) inverse relations:

  • partof — inverse of parts

  • variants — inverse of canonical

For example:

>>> sorted(tag.partof())
['Tag', 'Tages']

By default, the return set includes not only forms that the input is an immediate part of, but the reflexive-transitive closure of that relation. One can suppress the closure by specifying closure=False:

>>> tag.partof(closure=False)
{'Tages'}

Finally, a lexical entry has a method sentences() that accesses the token index to find all sentences in which this form occurs:

>>> tages.sentences()
{<Sentence 3.1 eines Tages begegnete der Schuster ...>}

The return list includes not only sentences in which the form appears as an element, but also sentences in which the form appears as a part of an element. For example, the word “Tag” never appears as an independent word in our sentences, but it does appear as a part of the word “Tages”:

>>> tag.sentences()
{<Sentence 3.1 eines Tages begegnete der Schuster ...>}

One can restrict the return value just to sentences in which the form appears as an element by specifying recurse=False:

>>> tag.sentences(recurse=False)
set()

(Note: sentences() either calls partof() or not; it is not possible to specify that it should call partof(recurse=False).)

One can break down the operation of sentences() into three steps. The method locations() returns a set of locations:

>>> tages.locations()
{<Loc 3.1.2>}

(The method locations() also accepts recurse=False as an option.) The method deref() of Language can then be used to go from the locations to the tokens whose location is specified:

>>> tokens = list(deu.deref(tages.locations()))
>>> tokens
['Tages']

Then one can get the sentences that the tokens belong to:

>>> tokens[0].sentence()
<Sentence 3.1 eines Tages begegnete der Schuster ...>

Recall that we earlier set token to be one of the words of a sentence. Tokens have a method entry() that returns their lexent, though the token and lexent are not visibly different:

>>> token.entry()
'Schuster'
>>> type(token)
<class 'selkie.corpus.core.Token'>
>>> type(token.entry())
<class 'selkie.corpus.core.Lexent'>

For convenience, tokens have all the same methods as lexents. The token versions simply dispatch to the lexent:

>>> token.gloss()
'cobbler'
>>> print(token.table())
id   Schuster
g    cobbler

Thus, for practical purposes, one can think of a token simply as a lexent with additional methods loc() and sentence().