Introduction

Overview

Panlex is a relational database representing lexical information for the world’s languages. The information is drawn typically from bilingual dictionaries.

Consider an illustrative (fake) entry:

moo

A kind of flatbread. eng:flatbread

Money. eng:money

To crush. eng:pancake

In panlex, each numbered subentry is a lexeme, and the dictionary is simply a collection of lexemes:

xyz:moo(n) eng:flatbread(n) - A kind of flatbread.

xyz:moo(n) eng:money(n) - Money.

xyz:moo(v) eng:pancake(v) - To crush.

A lexeme expressed using a particular word, such as “[3]xyz:moo(v)”, is a word sense.

A lexeme in Panlex typically contains two word senses, one in the target language and one in the glossing language. However, a lexeme in a multi-lingual dictionary contains one word sense for each language of the dictionary.

Definitions and semantic fields are associated with lexemes, but parts of speech and properties are intrinsic to word senses and may differ between a target word and its gloss.

In more detail, the main data types are as follows.

A dictionary (which Panlex calls a “source” or “approver”) consists of a list of lexemes, plus metadata. A dictionary is represented by a dictionary ID (DID). Dictionary metadata is given in the table ap.

A lexeme entry (which Panlex calls a “meaning”) is represented by a lexeme ID (LXID). I use the term lexeme rather than meaning because the object in question is dictionary-specific. No attempt is made to identify sameness of meaning across dictionaries. The association between LXID and dictionary is given in the mn table. An LXID may also be associated with a definition, in the df table, or with a semantic domain, in the dm table.

A sense (which Panlex calls a “denotation”) is a word used to express a particular meaning: that is, a word paired with a lexeme. A sense has a part of speech (“word class”), and may have properties. A sense is represented by a sense ID (sid). The word and lexeme for a given SID are specified in the dn table. The part of speech is given in the wc table. The list of properties is given in the table md.

An expression is a piece of text that is explicitly labeled with the language it is written in, like xyz:moo. An expression is represented in the database by an expression ID (EXID). The ex table associates an EXID with a string and a language variety.

A language variety may be documented in multiple dictionaries, and a dictionary may document multiple language varieties. A language variety is represented by a language variety ID (LVID). The Panlex code for a language variety is of form abc-123, consisting of a three-letter ISO code for the language and a three-digit variety code. The association between LVIDs and DIDs is given in the av table. The ISO code and variety code are given in the lv table.

Data types

The data-type specifications used in the data tables are as follows. The most important are:

lvid - Language variety

did - Dictionary

lxid - Lexeme

sid - Sense

exid - Expression

Supporting data types are as follows.

bool - t or f.
num - A number.
str - A string.
char - A Unicode code point.
date - A date.
url - A URL.
iso - A 3-letter ISO language code.
vc - A 3-digit Panlex variety code.
lic2 - A 2-letter license code.
fm - A file format (?)

Language varieties

Languages are identified by 3-digit ISO codes. A language variety is a specialization. The varieties of a given language are numbered from 0: eng0, eng1, etc. There is also a numeric ID for each language variety. For example, variety 187 is eng0.

Table ?
`lv`	lvid	The language variety
`lc`	iso	Its ISO language code
`vc`	vc	Language-variety sequence number (from 0)
`sy`	bool	?
`am`	bool	?
`ex`	exid	The name of the variety

Names are usually given in the variety (e.g., the name for German is given as “Deutsch.” But sometimes names are given in English.

Additional information about language varieties is given in tables cp and cu. I don’t know what these tables contain, possibly punctuation characters in the language.

Table `cp`
`lv`	lvid	A language variety
`c0`	char	A code point
`c1`	char	A code point

Table `cu`
`lv`	lvid	A language variety
`c0`	char	A code point
`c1`	char	A code point
`loc`	?	?
`vb`	?

Values for vb include pun, priv, aux, cit:fin:pri, cit:kom:pri.

Dictionaries

A dictionary contains a list of lexemes (see above). Metadata information is contained in the table ap.

Table `ap`
`ap`	did	The dictionary ID
`dt`	date	Registration date
`tt`	str	A short identifier, e.g. `eng-ciw:Weshki`
`ur`	url	The URL
`bn`	str	ISBN, perhaps?
`au`	str	Author
`ti`	str	Title
`pb`	str	Publisher
`yr`	str	Year of publication
`uq`	num	Quality?
`ui`	did	Appears to be the same as `ap`
`ul`	str	Some kind of summary line
`li`	lic2	An IP license code
`ip`	str	An IP license statement
`co`	str	Company?
`ad`	str	Email address

A dictionary documents one or more language varieties.

Table `av`
`ap`	did	The dictionary
`lv`	lvid	A variety that it documents

The apli table appears to map 2-letter license codes to 3-letter codes. I don’t know what the codes mean.

Table `apli`
`id`	num	ID for the assignment (?)
`li`	lic2	2-letter code
`pl`	?	3-letter code

The table af appears to indicate the file format of the original source for the dictionary.

Table `af`
`ap`	did	The dictionary
`fm`	fm	The format

Example values for format are html, html-curl, pdf-lock/encrypt, txt, txt-wb, xml, pdf-img, and db.

The fm table appears to contain information about “fm” codes.

Table `fm`
`fm`	fm	Format ID?
`tt`	str	Dictionary name??
`md`	str	?

The table aped appears to contain Panlex processing information for dictionaries.

Table `aped`
`ap`	did	The dictionary
`q`	bool	?
`cx`	num	?
`im`	bool	?
`re`	bool	?
`ed`	?	?
`fp`	?	Short name?
`etc`	str	What remains to be done?

The fp codes appear to indicate the documented varieties and a one-word abbreviation of the title. E.g., eng-ciw-Weshki.

Lexemes

A dictionary is a list of lexemes. Panlex calls them “meanings.”

Table `mn`
`mn`	lxid	The lexical entry
`ap`	did	The dictionary it belongs to

The df table appears to represent definitions or explanations. Not all dictionaries have them.

Table `df`
`df`	num	The definition ID (?)
`mn`	lxid	The lexical entry
`lv`	lvid	The language variety of the definition text
`tt`	str	The definition text

The dm table appears to represent the semantic domain of an entry. Not all dictionaries include it.

Table `dm`
`dm`	num	The semantic domain ID (?)
`mn`	lxid	The lexical entry
`ex`	exid	The name of the semantic domain

An additional table, mi, also provides information about lexemes. I have not been able to determine what it represents. The values in the tt field are usually IDs of some sort, but occasionally English words.

Table `mi`
`mn`	lxid	The lexical entry
`tt`	?	?

Senses

A sense combines a lexeme with an a word (expression).

Table `dn`
`dn`	sid	The sense
`mn`	lxid	The lexeme it belongs to
`ex`	exid	The contents

A part of speech may be assigned to a sense.

Table `wc`
`wc`	num	An ID for the assignment?
`dn`	sid	The sense
`ex`	exid	The part of speech

The wcex table is a convenience listing of the expressions that are used as parts of speech.

Table `wcex`
`ex`	exid	The part-of-speech expression
`tt`	str	The part-of-speech string

A sense may have properties (key-value pairs). These are used for declension classes, valency, etc.

Table `md`
`md`	num	An ID for the assignment?
`dn`	sid	The sense
`vb`	str	The key
`vl`	str	The value

Expressions

Expressions are used not only for words in dictionaries but also for parts of speech and dictionary names. An expression is a word in a particular language variety. It pairs a string with a language-variety ID.

Table `ex`
`ex`	exid	The expression
`lv`	lvid	Its language variety
`tt`	str	Its string
`td`	str	A “degraded text” version (lowercase letters + digits)