Programmatic Interface to Corpus ================================ Selkie provides a programmatic interface to SLF corpora. The directory structure of an SLF corpus is given in the first column of the following. The second column gives an expression for accessing the structural unit in question, assuming that *corpus* is a variable containing the corpus as a whole, and the third column gives the type of the object:: corpus/ corpus Corpus langs corpus.langs LanguageTable roms/ corpus.roms RomRepository *romname* corpus.roms[*romname*] Rom ... *langid*/ corpus[*langid*] Language lexicon corpus[*langid*].lexicon Lexicon index corpus[*langid*].index TokenIndex toc corpus[*langid*].toc MetadataTable txt/ corpus[*langid*].txt TextTable *txtid* corpus[*langid*].txt[*txtid] Text ... ... Recall from the description of the SLF format that the individual files ('langs', 'lexicon', 'index', 'toc', and each of the roms and simple texts) are called corpus *items*. The contents of the items suffices to reconstruct the entire corpus. Corpus ------ One loads a corpus using the Corpus constructor. Let us first create a temp directory to work in:: >>> from tempfile import TemporaryDirectory >>> tmp = TemporaryDirectory() And let us create a corpus by copying an example:: >>> from selkie.data import ex >>> from shutil import copytree >>> from os.path import join >>> corpus_filename = join(tmp.name, 'corpus') >>> bool(copytree(ex('corp25.slf'), corpus_filename)) True Opening the corpus:: >>> from selkie.corpus import Corpus >>> corpus = Corpus(corpus_filename) A corpus behaves like a dict whose keys are language IDs:: >>> list(corpus) ['deu'] >>> corpus['deu'] The methods ``__iter__()``, ``__len__()``, ``keys()``, ``items()``, and ``values()`` are also available and work as one would expect. One can use the method ``new()`` to add a new language:: >>> corpus.new('oji', 'Ojibwe') >>> list(corpus) ['deu', 'oji'] And one can delete a language using del:: >>> del corpus['oji'] >>> list(corpus) ['deu'] Language table -------------- As indicated above, the corpus has a ``langs`` member, which is the list of languages:: >>> print(corpus.langs) deu German One may equally treat corpus.langs as a dict containing languages. In fact, dict method calls placed on the corpus are simply dispatched to corpus.langs (including ``new()`` as an honorary "dict method call"). Language -------- A language has an ID and a full name:: >>> deu = corpus['deu'] >>> deu.langid() 'deu' >>> deu.fullname() 'German' Alternatively, the properties listed in the 'langs' file can be accessed by treating the language as a dict:: >>> deu['id'] 'deu' >>> deu['name'] 'German' Similarly, those properties may be modified, and the change is automatically written to disk:: >>> deu['name'] = 'Deutsch (German)' >>> deu['name'] 'Deutsch (German)' (However, the key 'id' cannot be modified.) Corresponding to the files in the language directory, a language has attributes ``lexicon``, ``index``, ``toc``, and ``txt``:: >>> deu.lexicon >>> deu.index >>> deu.toc >>> deu.txt Table of Contents ----------------- A table of contents ('toc') is a table that maps text IDs to metadata:: >>> list(deu.toc) ['1', '2', '3'] >>> deu.toc['1'] The toc prints out as a listing of IDs, types, and titles:: >>> print(deu.toc) 1 story Eine kleine Geschichte 2 page p1 3 page p2 One can add new texts to the toc:: >>> deu.toc.new('4', ti='Der Taucher', ty='story') >>> print(deu.toc) 1 story Eine kleine Geschichte 2 page p1 3 page p2 4 story Der Taucher Text metadata behaves like a dict:: >>> meta = deu.toc['1'] >>> meta['ti'] 'Eine kleine Geschichte' >>> print(meta) id 1 ty story ti Eine kleine Geschichte ch 2 3 Text table ---------- The 'txt' member has the same keys as the TOC (namely, text IDs), but the values are text objects instead of metadata objects:: >>> list(deu.txt) ['1', '2', '3', '4'] >>> deu.txt['1'] >>> deu.txt['2'] The same metadata dict that one access through 'toc' can also be accessed from the text itself: >>> t1 = deu.txt['1'] >>> t1.metadata() Incidentally, the inverse method, from metadata to text, is also available:: >>> meta.text() The text has convenience methods to access most of the metadata items:: >>> t1.textid() '1' >>> t1.text_type() 'story' >>> t1.author() '' >>> t1.title() 'Eine kleine Geschichte' However, one cannot access metadata properties using square brackets on a text. Square brackets applied to an aggregate text return its children, and square brackets applied to a simple text returns its sentences. Hierarchical structure ---------------------- Texts form a hierarchical structure, represented by the ``children()`` and ``parent()`` methods of Text. One obtains the root of the hierarchy from the language:: >>> roots = deu.get_roots() >>> roots [, ] From there, one follows ``children()`` and ``parent()`` links:: >>> roots[0].children() [, ] >>> t2 = _[0] >>> t2.parent() One can also use the method ``walk()`` to iterate over all descendants of a text (including itself). A text has methods that characterize its intuitive level in the hierarchy. The largest aggregates are *collections*, which are distinguished by having text type 'collection'. The largest non-collections are *documents*. And the leaves of the hierarchy are simple texts. Texts have methods to test those properties: ``is_collection()``, ``is_document()``, ``is_simple_text()``, and languages have methods to fetch them:: >>> deu.get_collections() [] >>> deu.get_documents() [, ] >>> deu.get_simple_texts() [, , ] Sentences and words ------------------- A simple text behaves like a list of sentences:: >>> t3 = deu.txt['3'] >>> list(t3) [, ] (Incidentally, if one accesses an aggregate like a list, the list elements are the children.) A sentence behaves like a list of words:: >>> sent = t3[0] >>> list(sent) ['eines', 'Tages', 'begegnete', 'der', 'Schuster', 'einen', 'Bettler'] >>> sent[0] 'eines' In addition, a sentence has methods for accessing a list of timestamps:: >>> sent.timestamps() [(0, '1.4958'), (2, '1.9394'), (5, '2.7833'), (7, '3.3269')] One can alternatively obtain a list of *spans*, which are triples consisting of start time, end time, and a list of words:: >>> for span in sent.spans(): ... print(span) ... ('1.4958', '1.9394', ['eines', 'Tages']) ('1.9394', '2.7833', ['begegnete', 'der', 'Schuster']) ('2.7833', '3.3269', ['einen', 'Bettler']) Finally, if the sentence has a translation, the method ``gloss()`` returns it:: >>> sent.gloss() 'one day the cobbler met a beggar' The words in a sentence appear to be strings, and *are* strings, but they are more precisely instances of a specialization of str called Token. They have some additional methods that str lacks. In particular, each token has a location, which consists of the text ID and the sentence number:: >>> token = sent[4] >>> token 'Schuster' >>> token.loc() A token also has a link to its lexical entry:: >>> type(token.entry()) For convenience, one can access all methods of the lexical entry directly from the token. We return to that point below, after discussing lexical entries. Lexicon ------- In addition to accessing lexical entries via tokens, one can access them from the Lexicon itself. The Lexicon behaves like a dict whose keys are forms:: >>> list(deu.lexicon) ['begegnete', 'Bettler', 'der', 'einen', 'eines', 'Schuster', 'Tag', 'Tages'] >>> tages = deu.lexicon['Tages'] >>> print(tages.table()) id Tages pp Tag .gen An entry, like a token, is a specialization of str: >>> tages 'Tages' But it has additional methods, like ``table()`` in the example above. In particular, it has methods for accessing lexical attributes:: >>> tages.form() 'Tages' >>> tages.parts() ['Tag', '.gen'] The method ``form()`` actually just returns the lexent itself. The value of ``parts()`` is also a list of lexents, not merely a list of strings. For example:: >>> tag = tages.parts()[0] >>> tag 'Tag' >>> tag.gloss() 'day' The lexical attributes were listed in the discussion of SLF. The method names are: * form * formtype * cat * parts * gloss * canonical * orthographic The values of these lexical attributes can be set using the method ``set()``:: >>> tages.set(cat='N', gloss="day's") >>> print(tages.table()) id Tages pp Tag .gen c N g day's In addition, there are two (automatically-generated) inverse relations: * partof — inverse of parts * variants — inverse of canonical For example:: >>> sorted(tag.partof()) ['Tag', 'Tages'] By default, the return set includes not only forms that the input is an immediate part of, but the reflexive-transitive closure of that relation. One can suppress the closure by specifying ``closure=False``:: >>> tag.partof(closure=False) {'Tages'} Finally, a lexical entry has a method ``sentences()`` that accesses the token index to find all sentences in which this form occurs:: >>> tages.sentences() {} The return list includes not only sentences in which the form appears as an element, but also sentences in which the form appears as a part of an element. For example, the word "Tag" never appears as an independent word in our sentences, but it does appear as a part of the word "Tages":: >>> tag.sentences() {} One can restrict the return value just to sentences in which the form appears as an element by specifying ``recurse=False``:: >>> tag.sentences(recurse=False) set() (Note: ``sentences()`` either calls ``partof()`` or not; it is not possible to specify that it should call ``partof(recurse=False)``.) One can break down the operation of ``sentences()`` into three steps. The method ``locations()`` returns a set of locations:: >>> tages.locations() {} (The method ``locations()`` also accepts ``recurse=False`` as an option.) The method ``deref()`` of Language can then be used to go from the locations to the tokens whose location is specified:: >>> tokens = list(deu.deref(tages.locations())) >>> tokens ['Tages'] Then one can get the sentences that the tokens belong to:: >>> tokens[0].sentence() Recall that we earlier set ``token`` to be one of the words of a sentence. Tokens have a method ``entry()`` that returns their lexent, though the token and lexent are not visibly different:: >>> token.entry() 'Schuster' >>> type(token) >>> type(token.entry()) For convenience, tokens have all the same methods as lexents. The token versions simply dispatch to the lexent:: >>> token.gloss() 'cobbler' >>> print(token.table()) id Schuster g cobbler Thus, for practical purposes, one can think of a token simply as a lexent with additional methods ``loc()`` and ``sentence()``.