Programmatic Interface to Corpus
Selkie provides a programmatic interface to SLF corpora. The directory structure of an SLF corpus is given in the first column of the following. The second column gives an expression for accessing the structural unit in question, assuming that corpus is a variable containing the corpus as a whole, and the third column gives the type of the object:
corpus/ corpus Corpus
langs corpus.langs LanguageTable
roms/ corpus.roms RomRepository
*romname* corpus.roms[*romname*] Rom
...
*langid*/ corpus[*langid*] Language
lexicon corpus[*langid*].lexicon Lexicon
index corpus[*langid*].index TokenIndex
toc corpus[*langid*].toc MetadataTable
txt/ corpus[*langid*].txt TextTable
*txtid* corpus[*langid*].txt[*txtid] Text
...
...
Recall from the description of the SLF format that the individual files (‘langs’, ‘lexicon’, ‘index’, ‘toc’, and each of the roms and simple texts) are called corpus items. The contents of the items suffices to reconstruct the entire corpus.
Corpus
One loads a corpus using the Corpus constructor. Let us first create a temp directory to work in:
>>> from tempfile import TemporaryDirectory
>>> tmp = TemporaryDirectory()
And let us create a corpus by copying an example:
>>> from selkie.data import ex
>>> from shutil import copytree
>>> from os.path import join
>>> corpus_filename = join(tmp.name, 'corpus')
>>> bool(copytree(ex('corp25.slf'), corpus_filename))
True
Opening the corpus:
>>> from selkie.corpus import Corpus
>>> corpus = Corpus(corpus_filename)
A corpus behaves like a dict whose keys are language IDs:
>>> list(corpus)
['deu']
>>> corpus['deu']
<Language deu German>
The methods __iter__(), __len__(), keys(), items(), and values() are
also available and work as one would expect.
One can use the method new() to add a new language:
>>> corpus.new('oji', 'Ojibwe')
<Language oji Ojibwe>
>>> list(corpus)
['deu', 'oji']
And one can delete a language using del:
>>> del corpus['oji']
>>> list(corpus)
['deu']
Language table
As indicated above, the corpus has a
langs member, which is the list of languages:
>>> print(corpus.langs)
deu German
One may equally treat corpus.langs as a dict containing languages. In
fact, dict method calls placed on the corpus are simply dispatched to
corpus.langs (including new() as an honorary “dict method call”).
Language
A language has an ID and a full name:
>>> deu = corpus['deu']
>>> deu.langid()
'deu'
>>> deu.fullname()
'German'
Alternatively, the properties listed in the ‘langs’ file can be accessed by treating the language as a dict:
>>> deu['id']
'deu'
>>> deu['name']
'German'
Similarly, those properties may be modified, and the change is automatically written to disk:
>>> deu['name'] = 'Deutsch (German)'
>>> deu['name']
'Deutsch (German)'
(However, the key ‘id’ cannot be modified.)
Corresponding to the files in the language directory, a language has
attributes lexicon, index, toc, and txt:
>>> deu.lexicon
<Lexicon /deu/lexicon>
>>> deu.index
<TokenIndex /deu/index>
>>> deu.toc
<Toc /deu/toc>
>>> deu.txt
<TextTable deu>
Table of Contents
A table of contents (‘toc’) is a table that maps text IDs to metadata:
>>> list(deu.toc)
['1', '2', '3']
>>> deu.toc['1']
<TextMetadata deu 1>
The toc prints out as a listing of IDs, types, and titles:
>>> print(deu.toc)
1 story Eine kleine Geschichte
2 page p1
3 page p2
One can add new texts to the toc:
>>> deu.toc.new('4', ti='Der Taucher', ty='story')
<TextMetadata deu 4>
>>> print(deu.toc)
1 story Eine kleine Geschichte
2 page p1
3 page p2
4 story Der Taucher
Text metadata behaves like a dict:
>>> meta = deu.toc['1']
>>> meta['ti']
'Eine kleine Geschichte'
>>> print(meta)
id 1
ty story
ti Eine kleine Geschichte
ch 2 3
Text table
The ‘txt’ member has the same keys as the TOC (namely, text IDs), but the values are text objects instead of metadata objects:
>>> list(deu.txt)
['1', '2', '3', '4']
>>> deu.txt['1']
<Text 1>
>>> deu.txt['2']
<Text 2>
The same metadata dict that one access through ‘toc’ can also be accessed from the text itself:
>>> t1 = deu.txt['1']
>>> t1.metadata()
<TextMetadata deu 1>
Incidentally, the inverse method, from metadata to text, is also available:
>>> meta.text()
<Text 1>
The text has convenience methods to access most of the metadata items:
>>> t1.textid()
'1'
>>> t1.text_type()
'story'
>>> t1.author()
''
>>> t1.title()
'Eine kleine Geschichte'
However, one cannot access metadata properties using square brackets on a text. Square brackets applied to an aggregate text return its children, and square brackets applied to a simple text returns its sentences.
Hierarchical structure
Texts form a hierarchical structure, represented by the children() and
parent() methods of Text. One obtains the root of the hierarchy from
the language:
>>> roots = deu.get_roots()
>>> roots
[<Text 1>, <Text 4>]
From there, one follows children() and parent() links:
>>> roots[0].children()
[<Text 2>, <Text 3>]
>>> t2 = _[0]
>>> t2.parent()
<Text 1>
One can also use the method walk() to iterate over all descendants of
a text (including itself).
A text has methods that characterize its intuitive level in the
hierarchy. The largest aggregates are collections, which are
distinguished by having text type ‘collection’. The largest
non-collections are documents. And the leaves of the hierarchy are
simple texts. Texts have methods to test those properties:
is_collection(), is_document(), is_simple_text(), and languages have
methods to fetch them:
>>> deu.get_collections()
[]
>>> deu.get_documents()
[<Text 1>, <Text 4>]
>>> deu.get_simple_texts()
[<Text 2>, <Text 3>, <Text 4>]
Sentences and words
A simple text behaves like a list of sentences:
>>> t3 = deu.txt['3']
>>> list(t3)
[<Sentence 3.1 eines Tages begegnete der Schuster ...>, <Sentence 3.2 Ende>]
(Incidentally, if one accesses an aggregate like a list, the list elements are the children.)
A sentence behaves like a list of words:
>>> sent = t3[0]
>>> list(sent)
['eines', 'Tages', 'begegnete', 'der', 'Schuster', 'einen', 'Bettler']
>>> sent[0]
'eines'
In addition, a sentence has methods for accessing a list of timestamps:
>>> sent.timestamps()
[(0, '1.4958'), (2, '1.9394'), (5, '2.7833'), (7, '3.3269')]
One can alternatively obtain a list of spans, which are triples consisting of start time, end time, and a list of words:
>>> for span in sent.spans():
... print(span)
...
('1.4958', '1.9394', ['eines', 'Tages'])
('1.9394', '2.7833', ['begegnete', 'der', 'Schuster'])
('2.7833', '3.3269', ['einen', 'Bettler'])
Finally, if the sentence has a translation, the method gloss()
returns it:
>>> sent.gloss()
'one day the cobbler met a beggar'
The words in a sentence appear to be strings, and are strings, but they are more precisely instances of a specialization of str called Token. They have some additional methods that str lacks. In particular, each token has a location, which consists of the text ID and the sentence number:
>>> token = sent[4]
>>> token
'Schuster'
>>> token.loc()
<Loc 3.1.5>
A token also has a link to its lexical entry:
>>> type(token.entry())
<class 'selkie.corpus.core.Lexent'>
For convenience, one can access all methods of the lexical entry directly from the token. We return to that point below, after discussing lexical entries.
Lexicon
In addition to accessing lexical entries via tokens, one can access them from the Lexicon itself. The Lexicon behaves like a dict whose keys are forms:
>>> list(deu.lexicon)
['begegnete', 'Bettler', 'der', 'einen', 'eines', 'Schuster', 'Tag', 'Tages']
>>> tages = deu.lexicon['Tages']
>>> print(tages.table())
id Tages
pp Tag .gen
An entry, like a token, is a specialization of str:
>>> tages
'Tages'
But it has additional methods, like table() in the example above.
In particular, it has methods for accessing lexical attributes:
>>> tages.form()
'Tages'
>>> tages.parts()
['Tag', '.gen']
The method form() actually just returns the lexent itself.
The value of parts() is also a list of lexents, not
merely a list of strings. For example:
>>> tag = tages.parts()[0]
>>> tag
'Tag'
>>> tag.gloss()
'day'
The lexical attributes were listed in the discussion of SLF. The method names are:
form
formtype
cat
parts
gloss
canonical
orthographic
The values of these lexical attributes can be set using the method
set():
>>> tages.set(cat='N', gloss="day's")
>>> print(tages.table())
id Tages
pp Tag .gen
c N
g day's
In addition, there are two (automatically-generated) inverse relations:
partof — inverse of parts
variants — inverse of canonical
For example:
>>> sorted(tag.partof())
['Tag', 'Tages']
By default, the return set includes not only forms that the input is
an immediate part of, but the reflexive-transitive closure of that
relation. One can suppress the closure by specifying closure=False:
>>> tag.partof(closure=False)
{'Tages'}
Finally, a lexical entry has a method sentences() that
accesses the token index to find all sentences in which this form occurs:
>>> tages.sentences()
{<Sentence 3.1 eines Tages begegnete der Schuster ...>}
The return list includes not only sentences in which the form appears as an element, but also sentences in which the form appears as a part of an element. For example, the word “Tag” never appears as an independent word in our sentences, but it does appear as a part of the word “Tages”:
>>> tag.sentences()
{<Sentence 3.1 eines Tages begegnete der Schuster ...>}
One can restrict the return value just to sentences in which the form
appears as an element by specifying recurse=False:
>>> tag.sentences(recurse=False)
set()
(Note: sentences() either calls partof() or not; it is not
possible to specify that it should call partof(recurse=False).)
One can break down the operation of sentences() into
three steps. The method locations() returns a set of
locations:
>>> tages.locations()
{<Loc 3.1.2>}
(The method locations() also accepts recurse=False as an
option.) The method deref() of Language can then be used to go
from the locations to the tokens whose location is specified:
>>> tokens = list(deu.deref(tages.locations()))
>>> tokens
['Tages']
Then one can get the sentences that the tokens belong to:
>>> tokens[0].sentence()
<Sentence 3.1 eines Tages begegnete der Schuster ...>
Recall that we earlier set token to be one of the words of a
sentence. Tokens have a method entry() that returns their lexent,
though the token and lexent are not visibly different:
>>> token.entry()
'Schuster'
>>> type(token)
<class 'selkie.corpus.core.Token'>
>>> type(token.entry())
<class 'selkie.corpus.core.Lexent'>
For convenience, tokens have all the same methods as lexents. The token versions simply dispatch to the lexent:
>>> token.gloss()
'cobbler'
>>> print(token.table())
id Schuster
g cobbler
Thus, for practical purposes, one can think of a token simply as a
lexent with additional methods loc() and sentence().