Application and corpus

Overview

The class CLD (in selkie.cld.core) is the

an instance of SealApp that represents the CLD application. Its contents are represented by the CLDRequestHandler class, which overrides only two methods of RequestHandler:

  • open_file() - returns a Corpus

  • make_root() - returns a CorpusEditor

The cld_app is used as the application function in a CLDManager. It can be invoked as:

$ cld corpus.cld

Manually instantiating the corpus

The easiest way to get a Corpus instance is to use the CLDManager:

>>> from selkie.cld.toplevel import CLDManager
>>> mgr = CLDManager('/tmp/corpus.cld')
>>> corpus = mgr.corpus()
>>> corpus
<Corpus /tmp/corpus.cld>

Corpus and environment

OUT OF DATE

The Corpus class.

The Corpus class

A Corpus is a Structure with the following signature.

  • langs (LanguageList) — list of mono-lingual subcorpora

  • users (UserList) — a Collection with child type User

  • roms (Registry) — the central registry of romanizations

  • glab (GLabDirectory)

In addition, a corpus has a _meta member containing a PropList with general information, and, like all Files, an env member containing an Environment.

The Environment

The env member is inherited from File, but it gets set by the Corpus, inasmuch as the Corpus is the root of the disk hierarchy. See the section ‘Environment’ for general information on environments. Corpus specializes Database, which specializes EnvRoot.

When one reaches a Language when descending the hierarchy, a new copy of the Environment is created that is specific to that language. The new copy is used by the Language and its descendants.

An Environment instance has the following members:

  • corpus — A backlink to the Corpus.

  • username — The authorized username, for the purpose of permissions

  • language — Set to the language, if within the scope of a language

  • parent — The original Environment, if this one belongs to a language.

All Environments provide the following methods:

  • for_language(lang) — Create a copy associated with the given language.

  • require_rom(name) — Returns the named Romanization. Signals an error if not found.

  • find_rom(name) — Returns the named Romanization, or None if not found.

Language-specific Environments provide the following methods:

  • get_text(id) — Returns the text that has the given ID.

  • default_orthography() — Returns the default orthography for this language.

  • orthographies() — Returns the list of available orthographies for this language.

  • romanization() — Returns the default orthography as a Romanization.

  • deref_parid(parid) — Returns the paragraph with the given paragraph ID.

Example

An example of opening a corpus and accessing a couple of its members:

>>> from selkie.cld.corpus import Corpus
>>> corpus = Corpus('corpus.cld')
>>> corpus.media.filename()
'/Users/abney/git/cld/media'
>>> corpus.langs['oji']
<selkie.cld.language.Language object at 0x10ac41fd0>

User interface

Corpus UI

Metadata editor

Catalog of pages

The relevant modules all belong to selkie.cld.ui.

  • /home — CorpusEditor (corpus)

  • /langs — LanguageListEditor (language)

  • …/lang.xxx/home — LanguageEditor (language)

  • …/texts/home — TocEditor (toc)

  • …/page/edit — PageEditor (page)

  • …/audio/edit — AudioEditor (audio)

Organization by URL

The most natural starting point for examining code is often the URL that you use to reach a page. Each page is generated by a particular method of an HTML directory instance. The page connects to other pieces of source code: there may be Javascript code associated with the page, placing callbacks to the HTML directory; and the HTML directory is generally associated with one or more disk objects.

The quickest way to determine the page and directory associated with a URL is to run a query in python. For example, in the directory ~/git/cld, do:

>>> from selkie.cld.app import App
>>> app = App('test.cfg')
>>> app('/langs/lang.oji/texts/text.7/page/xscript/edit.0')
<HtmlPage Media 33>
>>> page = _

From the page, one can get the parent (and determine its class):

>>> page.__parent__
<selkie.cld.ui.media.Transcriber object at 0x10bde84a8>

One can also determine the name that was used to access the page:

>>> page.__file__.name
'edit.0'

The last directory component of the URL pathname (“xscript”, in our example) often determines a unique directory class. The following table lists the associations.

/

CorpusEditor

users

GroupsEditor

langs

LanguageListEditor

lgsel

LanguageSelector

lang.*name*

LanguageEditor

texts

TocEditor

text.*id*

TextEditor

page

PageEditor

xscript

Transcriber

The following provides an overview of the interface pages. The names are classes within selkie.cld.ui, and the arguments are classes within selkie.cld.

  • CorpusEditor(corpus:Corpus)

    • corpora: CorpusListEditor(contents:CorpusList)

    • lang: LanguageEditor(lang:Language)

    • text: TextEditor

  • langs

    • LanguageListEditor(contents: LanguageList)

    • Search — langs/search

    • Ojibwa — lang.oji

  • lgsel

    • LanguageSelector(langlist: LanguageList)

  • lang.*l*

    • LanguageEditor(lang: Language)

    • Texts — texts

    • Lexicon — lexicon

  • texts, text.*i*

    • TextEditor(text: Text)

    • Redirect — toc, page, stub

  • toc

    • TocEditor(toc: Toc)

  • page

    • PageEditor(page: Page)

    • PlainTextPanel

    • click — igt

  • igt

    • IGTEditor

    • [IGTEditor, LexentViewer]

Corpus file format

The following table gives the corpus file format. The root type is ‘Corpus.’

All directories contain _children and _perm; they are not explicitly mentioned in the table.

All files are in tab-separated format.

:widths: 3 3 1 6 :header-rows: 1

Filename

Class

FD

Contents

_children

Children

F

name suffix

_config

Config

F

key value

_groups

GroupsFile

F

usr grp*[*(sp)grp*]*

_info

PropDict

F

key value

_meta

PropDict

F

key value

_perm

Permissions

F

mode role usr*[*(sp)usr*]

*.cl

ClipsFile

F

start end

*.cld

Corpus

D

_config _meta _groups

*.cld

Corpus

D

glab.gd langs.ll roms.reg users.ul

*.gd

GLabDirectory

D

user.gl*

*.gl

Library

D

n.gn*

*.gn

Notebook

F

(GLab notebook format)</tr>

*.lg

Language

D

_index _info lexicon.lx texts.toc

*.ll

LanguageList

D

lang.lg*

*.lx

Lexicon

F

form sno refs n=value

*.mf

MediaFile

F

usr/name.*suf*

*.mi

MediaIndex

F

name.*suf tid*

*.pd

PropDict

F

key value

*.pp

ParagraphFile

F

bool

*.reg

Registry

D

name.rom

*.rom

Romanization

F

ascii unicode

*.tf

TokenFile

F

sentno

*.tf

TokenFile

F

nxid n

*.tf

TokenFile

F

form sno lpunc rpunc

*.toc

Toc

D

id.txt*

*.txt

Text

D

_info orig.tf trans.tr?

*.txt

Text

D

_info media.mf xscript.xs? trans.tr?

*.txt

Text

D

_info toc.toc

*.xs

Transcription

D

clips.cl paras.pp transcript.tf

*.tr

Translation

F

trans

*.ul

UserList

D

name.usr*

*.usr

User

D

media.mi props.pd

CLDManager