Panlex
******

Panlex2
-------

This is a replacement for the previous version.

Usage
.....

Usage::

   $ python -m seal.script.panlex2 COM ARG*

Some of the commands are actually multi-word commands, in
particular, all the commands beginning with "compile."

 * lang CODE
   — 
   CODE is an ISO 639-3 language code.
   Prints out information about all varieties of the language with the
   given code.  The printout includes the LVID code for each variety
   and the list of dictionaries (SIDs) for each variety.

 * lvid LVID
   — 
   Produces the same output as lang, but limited to a single
   variety.

 * dict SID
   — Prints out metadata for the dictionary whose "source ID" is SID.

 * compile varieties
   — Writes varieties.tab.  Needed for e.g. compile bilex.

 * compile bilex TGT [GLS]
   —
   TGT and GLS must be language variety IDs.
   If GLS is not given it defaults to 187 (English).
   Writes the file bilex-TLVID-GLVID.tab,
   which contains records of form *tgt_str gloss_str sids,*
   where *sids* is a space-separate list of source IDs.

Environment
...........

The following variables must be set in ~/.selkie:

 * data.panlex.zipfn
   — The pathname of the Panlex zip file.  It may begin with "~".

 * data.panlex.dirname
   — The toplevel directory in the zip file, e.g. "panlex-20190901-csv".

 * data.panlex.tgtdir
   — The directory in which to install compiled dictionaries, etc.
   It may begin with "~".


Overview
--------

Panlex is a relational database representing lexical information for
the world's languages.  The information is drawn typically from
bilingual dictionaries.  Accordingly, a dictionary is viewed as consisting of
lexical entries ("meanings"), each of which is the pairing of
an expression in the target language with an expression in the
glossing language, such as::

   boojoo[oji] hello[eng]

Generalizing, multiple target languages and
multiple glossing languages are allowed.  An example is a multilingual
dictionary of several related languages, glossed in both English and
French.  Viewed this way, there is actually little need to distinguish
between target language and glossing language: a lexical entry
is simply a set of synonymous expressions in multiple languages.

Panlex includes some additional lexical information, such as parts
of speech, properties, definitions, and semantic fields.  Definitions and semantic
fields are associated with lexical entries, but parts of speech and
properties are permitted to differ between a word and its gloss.  We should revise
the previous example to::

   boojoo[oji]/int hello[eng]/int

This lexical entry consists of two fields: boojoo[oj]/int
and hello[eng]/int.  A field is intrinsic to a lexical
entry.  Even if an apparently identical field occurs in a different
lexical entry, Panlex treats it as a distinct object.

Hence, the main data types are as follows.

 * An **expression** is a piece of text that is explicitly labeled with
   the language it is written in, like "boojoo[oji]."
   An expression is represented in the
   database by an **expression ID (exid).**
   The ex table associates an exid with a string and language variety.

 * A **field,** which Panlex calls a "denotation," contains
   an expression, has a part of
   speech ("word class"), and may have properties.  A field is represented by
   a **field ID (fid).**  The expression and lexical entry for
   a given fid are specified in the dn table.  The part of
   speech is given in the wc table.  The list of properties
   is given in the table md.

 * A **lexical entry,** which Panlex calls a "meaning," is represented by
   a **lexical-entry ID (lxid).**  I use the term *lexical entry*
   rather than *meaning*, because the object in question is dictionary-specific.
   No attempt is made to identify
   sameness of meaning across dictionaries.
   The association between lxid and dictionary is given in
   the mn table.
   A lxid may also be associated with a definition, in the df
   table, or with a semantic domain, in the dm table.

 * A **dictionary,** which Panlex calls a "source" or "approver,"
   consists of a list of lexical entries, plus metadata.
   A dictionary is represented by a **dictionary ID (did).**
   The association between did and lxids is given in the table mn,
   and dictionary metadata is given in the table ap.

 * A **language variety** may be documented in multiple
   dictionaries, and a dictionary may document multiple language varieties.
   A language variety is represented by a **language variety ID (lvid).**
   The Panlex code for a language variety is of form abc-123,
   consisting of a three-letter **iso code** for the language and a
   three-digit **variety code.**  The association between lvids
   and dids is given in the av table.  The iso code and
   variety code are given in the lv table.

Data tables
-----------

Data types
..........

The data-type specifications used in the data tables are as follows.
The most important are:

 * *exid* - Expression

 * *fid* - Field

 * *lxid* - Lexical entry

 * *did* - Dictionary

 * *lvid* - Language variety

Supporting data types are as follows.

 * *bool* - t or f.

 * *num* - A number.

 * *str* - A string.

 * *char* - A Unicode code point.

 * *date* - A date.

 * *url* - A URL.

 * *iso* - A 3-letter ISO language code.

 * *vc* - A 3-digit Panlex variety code.

 * *lic2* - A 2-letter license code.

 * *fm* - A file format (?)

Expressions
...........

Expressions are used not only for words in dictionaries but also for
parts of speech and dictionary names.
An expression is a word in a particular language variety.  It pairs a
string with a language-variety ID.

``ex``

 * ex (*exid*) — The expression.

 * lv (*lvid*) — Its language variety.

 * tt (*str*) — Its string.

 * td (*str*) — A "degraded text"
   version of the string.  Contains only lowercase
   letters and digits.

Fields
......


A field belongs to a particular lexical entry, and its contents is an
expression.

``dn``

 * dn (*fid*) — The field.

 * mn (*lxid*) — The lexical entry it
   belongs to.

 * ex (*exid*) — The contents.

A part of speech may be assigned to a field.

``wc``

 * wc (*num*) — An ID for the assignment?

 * dn (*fid*) — The field.

 * ex (*exid*) — The part of speech.

The wcex table is a convenience listing of the expressions
that are used as parts of speech.

``wcex``

 * ex (*exid*) — The part-of-speech expression.

 * tt (*str*) — The part-of-speech string.

A field may have properties (key-value pairs).  These are used
for declension classes, valency, etc.

``md``

 * md (*num*) — An ID for the assignment?

 * dn (*fid*) — The field.

 * vb (*str*) — The key.

 * vl (*str*) — The value.

Lexical entries
...............

A dictionary is a list of lexical entries.  Panlex calls them "meanings."

``mn``

 * mn (*lxid*) — The lexical entry.

 * ap (*did*) — The dictionary it belongs to.
   The table is sorted by this column.

The df table appears to represent definitions or explanations.
Not all dictionaries have them.

``df``

 * df (*num*) — The definition ID (?)

 * mn (*lxid*) — The lexical entry.

 * lv (*lvid*) — The language variety of the definition text.

 * tt (*str*) — The definition text.

The dm table appears to represent the semantic domain of an
entry.  Not all dictionaries include it.

``dm``

 * dm (*num*) — The semantic domain (?)

 * mn (*lxid*) — The lexical entry.

 * ex (*exid*) — An expression naming
   the semantic domain

An additional table, mi, also provides information about
lexical entries.  I have not been able to determine what it
represents.  The values in the tt
field are usually IDs of some sort, but occasionally English words.

``mi``

 * mn (*lxid*) — The lexical entry.

 * tt (?) — ?

Dictionaries
............

A dictionary contains a list of lexical entries (see above).
Metadata information is contained in the table ap.

``ap``

 * ap (*did*) — The dictionary ID.

 * dt (*date*) — Registration date.

 * tt (*str*) — A short identifier, e.g. eng-ciw:Weshki.

 * ur (*url*) — The URL.

 * bn (*str*) — ISBN, perhaps?

 * au (*str*) — Author.

 * ti (*str*) — Title.

 * pb (*str*) — Publisher.

 * yr (*str*) — Year of publication.

 * uq (*num*) — Quality?

 * ui (*did*) — Appears to be the same as ap.

 * ul (*str*) — Some kind of summary line.

 * li (*lic2*) — An IP license code.

 * ip (*str*) — An IP license statement.

 * co (*str*) — Company?

 * ad (*str*) — Email address

A dictionary documents one or more language varieties.

``av``

 * ap (*did*) — The dictionary.

 * lv (*lvid*) — A variety that it documents.

The apli table appears to map 2-letter license codes to
3-letter codes.  I don't know what the codes mean.

``apli``

 * id (*num*) — ID for the assignment (?)

 * li (*lic2*) — 2-letter code

 * pl (*?*) — 3-letter code

The table af appears to indicate the file format of the original
source for the dictionary.

``af``

 * ap (*did*) — The dictionary.

 * fm (*fm*) — The format.  Example values are html,
   html-curl, pdf-lock/encrypt, txt, txt-wb,
   xml, pdf-img, and db.

The fm table appears to contain information about "fm" codes.

``fm``

 * fm (*fm*) — Format ID?

 * tt (*str*) — Dictionary name??

 * md (*str*) — ?

The table aped appears to contain Panlex processing information
for dictionaries.

``aped``

 * ap (*did*) — The dictionary.

 * q  (*bool*) — ?

 * cx (*num*) — ?

 * im (*bool*) — ?

 * re (*bool*) — ?

 * ed (?) — ?

 * fp (?) — A code that seems to indicate the documented
   varieties and a one-word abbreviation of the title.  E.g., eng-ciw-Weshki.

 * etc (*str*) — Appears to be comments about what work
   needs to be done yet.

Language varieties
..................

Languages are identified by 3-digit ISO codes.  A language variety is
a specialization.  The varieties of a given language are numbered from
0: eng0, eng1, etc.  There is also a numeric ID for each
language variety.  For example, variety 187 is eng0.
<table class="display">

 * lv (*lvid*) — The language variety.

 * lc (*iso*) — Its ISO language code.

 * vc (*vc*) — Language-variety sequence number.  The varieties of a
   particular ISO-coded language are numbered sequentially from 0.

 * sy (*bool*) — ?

 * am (*bool*) — ?

 * ex (*exid*) — The name of the variety.  Names are usually given in
   the variety (e.g., the name for German is given as "Deutsch."
   But sometimes names are given in English.

Additional information about language varieties is given in tables
cp and cu.  I don't know what these tables contain,
possibly punctuation characters in the language.

``cp``

 * lv (*lvid*) — A language variety.

 * c0 (*char*) — A code point.

 * c1 (*char*) — A code point.

``cu``

 * lv (*lvid*) — A language variety.

 * c0 (*char*) — A code point.

 * c1 (*char*) — A code point.

 * loc (?) — ?

 * vb (?) — Values include pun, priv, aux,
   cit:fin:pri, cit:kom:pri.

Panlex executable
-----------------

Zip
...

One can examine the contents of the original zip file using the
zip command.  There are four subcommands:

 * list — List the filenames.

 * head *f* — Print the first 50 records of file *f*.

 * cat *f* — Print all the records of file *f*.

 * table *f* — The table is like the contents, except that, if
   there is a field labeled ex, two new columns are added: ex.tt
   and ex.lv.  The former contains the string contents of the
   expression and the latter is the language-variety code for the
   expression.  One may optionally provide an attribute *a* and value *v* to
   restrict the listing to records that have value *v* for attribute *a*.
   Nota bene: this command is generally *much* slower than cat.

Variety
.......

A language is a set of varieties::

   $ panlex variety deu
   lv | lc | vc | sy | am | ex | ex.tt | ex.lv
   157 | deu | 0 | t | t | 274 | Deutsch | 157
   1349 | deu | 1 | t | t | 18586881 | Masematte | 1349
   1845 | deu | 2 | t | t | 18586883 | Hessisch | 1845
   9097 | deu | 3 | t | t | 12660638 | doitS | 9097

These are all the language varieties corresponding to ISO code
"deu."  Language variety 157 is deu0, variety 1349 is deu1, and so
on.  I don't know what "sy" and "am" are.  The name of the variety
is given in the variety itself.  Specifically, an expression (ex) is
the pairing of a string (ex.tt) with an indiciation of which variety it is
written in (ex.lv).

To give another example, Ojibwe (oji) is a macrolanguage comprising
Severn Ojibwa (ojs), Eastern Ojibwa (ojg), Central Ojibwa (ojc),
Northwestern Ojibwa (ojb), Western Ojibwa (ojw), Chippewa (ciw),
Ottawa (otw), and Algonquin (alq)::

   $ panlex variety oji ojs ojg ojc ojb ojw ciw otw alq
   lv | lc | vc | sy | am | ex | ex.tt | ex.lv
   30 | ojb | 0 | t | t | 18592962 | Anishinaabemowin | 30
   536 | ciw | 0 | t | t | 18586345 | Anishinaabemowin | 536
   934 | otw | 0 | t | t | 18593131 | Daawaamwin | 934
   4069 | ojw | 0 | t | t | 18592975 | Nakaw?mowin | 4069
   5598 | ojs | 1 | t | t | 7505858 | ????? | 5598
   6930 | ojg | 0 | t | t | 18592966 | Nishnaabemwin | 6930
   6931 | ojc | 0 | t | t | 18592964 | Ojibwe | 6931
   6932 | ojs | 0 | t | t | 18592970 | Anishininiimowin | 6932
   6933 | ciw | 1 | t | t | 8150 | Central Minnesota Chippewa | 187
   7415 | ciw | 2 | t | t | 17070963 | Minnesota Ojibwe | 187
   9170 | alq | 1 | t | t | 241072 | ???????? | 9170
   19 | alq | 0 | t | t | 45808 | anicin?bemowin | 19

The question marks represent Unicode characters that Latex does not handle.
The information here does not appear to be entirely correct.  Panlex
labels a wordlist that Margaret and Howard produced as documenting
variety 536 (ciw0), which is Chippewa.  I would have thought that they
speak Eastern Ojibwa.

Dicts
.....

For each variety, there is a set of
dictionaries::

   $ panlex dicts 30 536 934 4069 5598 6930 6931 6932 6933 7415 9170 19
   128 | Freelang Ojibwe-English dictionary | 13741 | eng-ciw-Weshki
   153 | Freelang Ojibwe-English dictionary | 1319 | ciw-ojw-ojc-ojs-ojg-otw-mic-pot-eng-Weshki
   611 | Astronomia Terminaro | 2474 | mul-Rapley
   2409 | Swadesh Lists | 207 | art-mul-SL
   2815 | Anishinaabemowinâ€“English | 131 | ciw-eng-Noori
   2830 | Ezhi-Giigidaang, How We Say It (Pronunciation) | 0 | ciw-eng-Kimewon
   4091 | Lexique de la langue algonquine | 0 | alq-fra-Cuoq
   3778 | Ojibwe Vocabulary Project | 0 | ciw-eng-Manidoons
   3779 | Ojibwe-English Wordlist | 0 | ciw-eng-Weshki
   4095 | Travels through the Canadas: Vocabulary of the Algonquin Tongue | 0 | alq-eng-Heriot
   4144 | The Ojibwe Peopleâ€™s Dictionary | 0 | eng-ciw-OPD

A dictionary may document more than one variety.

Dict
....

To see information about a dictionary::

   $ panlex dict 128
   ap | lv
   128 | 187
   128 | 536
   
   id 128
   dt 2007-12-11
   tt eng-ciw:Weshki
   ur http://www.freelang.net/dictionary/ojibwe.php
   bn
   au Weshki-ayaad; Charles Lippert; Guy T. Gambill
   ti Freelang Ojibwe-English dictionary
   pb Freelang
   yr 2010
   uq 5
   ui 128
   ul TG 122; FreeLang.English_Ojibwe.wb
   li co
   ip Every author exercises rights with respect to the part of a list that represents that personâ€™s own contribution.
   co Guy T. Gambill
   ad gambillgt1@yahoo.com

The first lines indicate which varieties the dictionary documents.  In
this case, they are 187 (English, eng0) and 536 (Chippewa, ciw0).

Bidicts
.......

To find out which dictionaries document a particular pair of
varieties::

   $ panlex bidicts 187 536
   128 | Freelang Ojibwe-English dictionary | 13741 | eng-ciw-Weshki
   153 | Freelang Ojibwe-English dictionary | 1319 | ciw-ojw-ojc-ojs-ojg-otw-mic-pot-eng-Weshki
   611 | Astronomia Terminaro | 2474 | mul-Rapley
   2409 | Swadesh Lists | 207 | art-mul-SL
   2830 | Ezhi-Giigidaang, How We Say It (Pronunciation) | 0 | ciw-eng-Kimewon
   3778 | Ojibwe Vocabulary Project | 0 | ciw-eng-Manidoons
   4144 | The Ojibwe People's Dictionary | 0 | eng-ciw-OPD

The columns are: dictionary ID (ap.ap) title (ap.ti),
number of entries (count where mn.ap==ap), and short code (aped.fp).

Bidict
......

To extract a bidict::

   $ panlex bidict 128 536 187 | uniq > tmp.out

The result is ASCII sorted (case sensitive), in two-column format,
with a single tab character as column separator.  Let us think of the
first column as the target language and the second column as the
glossing language.  If a target-language word has multiple glosses,
they produce multiple lines in the file, all sharing the same
target-language word.  (Since the file is sorted, they form a
contiguous block.)  For example, the following occurs in the middle of
tmp.out::

   aabizh  cut seams open on
   aabizhiishin    perk up
   aabiziishin     come to
   aabiziishin     revive

For some reason, the dictionaries sometimes contain duplicate
entries - hence the "uniq" in the command line above.

Panlex module
-------------

Zip files
.........

Usage::

   f = open_zipfile()

The Panlex zip file is ~/src/cl/panlex-20140501-csv.zip.

Things you can do with a zip file::

   f.namelist()      # list of filenames
   f.printdir()      # print long listing
   s = f.read(name)  # one of the names from namelist

The entire file is read as a single string.

The list of Panlex files::

   >>> from panlex import open_zipfile
   >>> f = open_zipfile()
   >>> for nm in f.namelist():
   ...     print nm
   ...
   panlex-20140501-csv/
   panlex-20140501-csv/af.csv
   panlex-20140501-csv/mi.csv
   panlex-20140501-csv/aped.csv
   panlex-20140501-csv/df.csv
   panlex-20140501-csv/wc.csv
   panlex-20140501-csv/av.csv
   panlex-20140501-csv/lv.csv
   panlex-20140501-csv/fm.csv
   panlex-20140501-csv/ex.csv
   panlex-20140501-csv/dm.csv
   panlex-20140501-csv/cp.csv
   panlex-20140501-csv/md.csv
   panlex-20140501-csv/dn.csv
   panlex-20140501-csv/cu.csv
   panlex-20140501-csv/ap.csv
   panlex-20140501-csv/wcex.csv
   panlex-20140501-csv/mn.csv
   panlex-20140501-csv/apli.csv

Reading a file
..............

**Raw contents.**::

   s = raw_contents(fn)

The fn omits the directory name and the .csv suffix.  That
is, legitimate values are "af," "mi," etc.

**Reader**.::

   r = reader(fn)

Uses csv.reader to parse the csv format.
The return value is an iterator over records, each record being a list
of fields.  The first record contains the field names::

   >>> from panlex import reader
   >>> r = reader('af')
   >>> r.next()
   ['ap', 'fm']
   >>> r.next()
   ['1636', '24']

**Open file**.::

   (hdr, recs) = open_file(fn)

The header is the list of field names, and recs is an iterator
over the content records.

**Print headers.**
Prints the database schema: the names and headers of all the files::

   >>> from panlex import print_headers
   >>> print_headers()
   af: ap fm
   mi: mn tt
   aped: ap q cx im re ed fp etc
   df: df mn lv tt
   wc: wc dn ex
   av: ap lv
   lv: lv lc vc sy am ex
   fm: fm tt md
   ex: ex lv tt td
   dm: dm mn ex
   cp: lv c0 c1
   md: md dn vb vl
   dn: dn mn ex
   cu: lv c0 c1 loc vb
   ap: ap dt tt ur bn au ti pb yr uq ui ul li ip co ad
   wcex: ex tt
   mn: mn ap
   apli: id li pl

**Head and cat.**
The function head() prints the first *n* records.  The function
cat() dumps the contents readably.  cat(fn,'html')
produces HTML output.

Database tables
...............

**Where**.
Select records containing specified values in a specified field.
The return value is an iterator over records::

   >>> from panlex import where
   >>> for r in where('lv', 'lc', 'deu'):
   ...     print '|'.join(r)
   ...
   157|deu|0|t|t|274
   1349|deu|1|t|t|18586881
   1845|deu|2|t|t|18586883
   9097|deu|3|t|t|12660638

**Expand expressions.**::

   r = expand_expressions(recs, hdr)

Returns an iterator over records.  Two new columns are added: the
first contains the expression's string, and the second contains the
expression's variety.

Extracting dictionaries
.......................

**Dict entries.**
The function dict_entry_ids() returns an iterator over the entry IDs
(*lxids*) for a given dictionary or dictionaries::

   >>> from panlex import dict_entries
   >>> len(list(dict_entry_ids('128')))
   13741

The function dict_entry_table() returns a table whose keys are
meaning IDs, and whose values are list of pairs of form (*lvid, w*)
where $w$ is a word string::

   >>> from panlex import dict_entries
   >>> ents = dict_entry_table('128')
   >>> len(ents)
   13741
   >>> mns = list(ents)
   >>> mns[0]
   '2525999'
   >>> ents[mns[0]]
   [('187', 'consider'), ('536', 'naagadawaabam')]
   >>> ents[mns[1]]
   [('187', 'knock against'), ('536', 'bitaakoshkan')]

**Bilex pairs.**
The function bilex_pairs() returns an alphabetically sorted
list of word pairs representing the entries of the given dictionary::

   >>> from panlex import bilex_pairs
   >>> pairs = bilex_pairs('128','536','187')
   >>> pairs[0]
   ['Aabamadong', 'Fort Hope']
   >>> len(pairs)
   13739

Note that the pair of language IDs is not predictable from the
dictionary.  The dictionary may contain more than two languages, and
even if it only contains two, the dictionary does not specify their
order.

The database
------------

Zip file
........

The database dump is contained in a zip file.  The class ZipFile
is used to access it::

   >>> from seal.data.panlex import ZipFile
   >>> zf = ZipFile()

Methods are provided for listing the contents of the zip file::

   >>> zf.ls()
   File Name                                             Modified             Size
   panlex-20140501-csv/                           2014-05-01 03:02:18            0
   panlex-20140501-csv/af.csv                     2014-05-01 03:00:04        38522
   panlex-20140501-csv/mi.csv                     2014-05-01 03:02:00     33214449
   ...
   >>> list(zf.filenames())
   ['af', 'mi', 'aped', 'df', 'wc', 'av', 'lv', 'fm', 'ex', ..., 'apli']

The method print_headers() prints out, for each table, its name and field names.
It takes a minute or two to run::

   >>> zf.print_headers()
   af: ap fm
   mi: mn tt
   aped: ap q cx im re ed fp etc
   ...

To print the contents of the tables, the methods head and cat
are provided::

   >>> zf.head('wcex', 3)
   ex | tt
   3846607 | noun
   3846608 | verb
   >>> zf.cat('wcex')
   ex | tt
   3846607 | noun
   3846608 | verb
   3846609 | adjv
   ...

The method table returns a Table object containing the
contents of the table.  If the table contains an ex field,
two new fields named ex.tt and ex.lv are added to each
record.  This method can be slow to run.

Tables
......

A Table is a collection of records.  It
has the following members and methods.

 * header — A list of strings.

 * records — A list of records, each record being a list of strings.

 * where(*f*,*v*) — Returns a new Table containing the subset
   of records in which field *f* has value *v*.

 * dump() — Prints out the table.

 * grep(*f*,*v*) — Prints out the subtable for which field *f* has
   value *v*.

Parser
......

A Parser instance digests the information in the tables.

Compiler
........

The value of compile is a Compiler instance.  It is used
to create digested files.  If called with no arguments, it creates the
files 

Utility functions
.................

The function attribute_entries() iterates over the records for
a given subject type or a given subject-relation pair.  For example::

   >>> i = attribute_entries('expression', 'label')
   >>> i.next()
   (('expression', 'label', 'string'), '3990756' u'!')

The entries are of form *(t, v_1, v_2),* where *t* is of form
*(t_1, r, t_2)*.

**Collect variety languages.**
The function collect_variety_languages() iterates over the
variety-language records, and constructs a table indexed by variety ID
(an int), whose value is the variety's language.  E.g.::

   >>> vlangs = collect_variety_languages()
   >>> vlangs[187]
   'eng'

**Collect approvers.**
The function collect_approvers() returns a table indexed by
approver ID, in which the values are lists of form [lang, variety,
quality, title].

**Extracting bilexicons.**
A bilexicon is represented in Python by the class Bilex::

   >>> b = Bilex('spa','eng')

**Create raw.**
The first step is to create the raw bilexicon::

   >>> b.create_raw()

This takes about 25 minutes to run.  The output (in this example) is
the file spa-eng-raw.txt in the directory /cl/data/panlex/lex.

The create_raw() method starts by loading the variety-language table, which maps varieties
to their languages.

Then it goes through the expression-variety records, creating a table
of expressions.  The keys are expressions (ints) and the values are
lists of form [variety, label, degraded text].  An entry is created
only for expressions whose variety's language is one of the two
languages of interest.  Label and degraded
text are initially set to the empty string.

Next it goes through the expression-label and expression-degraded-text
records, filling in the other fields of the expression entries.

Next it creates a denotations table.  It
goes through the denotation-expression records.  If the expression has
an entry in the expressions table, then a new entry is created in the
denotations table.  The key is the denotation (an int), and the value
is a list of form [expression, part of speech, meaning].  Initially
only the expression is set.  Part of speech is initialized to the
empty string and meaning is initialized to 0.

Next it goes through the denotation-pos records and the
denotation-meaning records, filling in the remaining fields in the
denotation entries.

By that point, memory is pretty much full.  Output is written to
*lang1-*lang2*-raw.txt*.
We pass through the denotations table.  Each denotation entry contains
an expression ID, we use it to fetch the expression entry.  The
expression entry contains a variety ID; we use it to look up the
language.  Each denotation generates one line of output, of form::

   m lang v expr degraded pos d e

The single letters represent integer IDs: meaning (m), variety (v),
denotation (d), expression (e).  The denotation and expression IDs are
included only for debugging purposes.

**Sort raw.**
The method sort_raw() calls Unix sort to sort the raw
file by meaning, language, variety, and label.  The output is written
to *lang1-*lang2*-m1.txt*.  It takes a couple minutes
to run.

**Create m2.**
The method create_m2() adds approvers, and also filters out
monolingual meanings.  (I tried adding approvers when creating the raw
file, but Python runs out of memory)::

   >>> b.create_m2()

The method scans through the m1.txt file, collecting a table of
meanings.  For each block of meanings, note is kept of whether both
languages are seen.  If so, an entry is created in the meanings table,
and otherwise no entry is created.  The meanings table is indexed by
meaning ID, and the value is the approver ID (initialized to 0).

After creating the meanings table, the method passes through the
meaning-approver records and sets the values (approvers) for the
meanings.

Next it calls collect_approvers() to get the quality
information for each approver.

Finally, it passes a second time through the m1.txt file.  Each
time it encounters a new meaning, it looks in the meanings table to
see whether it should be kept or not.  If the meaning is a keeper, the
quality of the approver is looked up in the approvers table.  Each
line from m1.txt that is to be kept is copied to m2.txt,
and two new fields are added at the end: approver ID and quality.
Hence the lines in m2.txt are of form::

   m lang v expr degraded pos d e a q

where "a" is approver and "q" is quality (both are ints).

**Create sources.**
The method create_sources() extracts detailed information about
each of the approvers.  It writes the file *lang1*-*lang2*-sources.txt.
The line format is::

   a rel value

where "a" is the approver ID.  The relations (attributes) are:
lang, variety, regdate, label, creator,
isbn, lic_id, license, year, publ,
title, and url.  An empty line is inserted before each
block of records sharing a common value for "a."

**By word.**
The method by_word() creates a file containing lines of form::

   word-lang1 quality word-lang2

The method sort_by_word() then sorts that file.

It turns out that the quality scores for the approvers are not very
informative about whether the entries are actually good.  For example,
the top quality source (quality 7) for the Spanish word "a" includes
meanings "crazy," "missionary," and "physical" - completely
bogus.  A much better gauge appears to be the number of sources in
which the translation occurs.