Wiktionary — `selkie.data.wiktionary`

Using the command line:

$ python -m selkie.data.wiktionary xlangs DUMPFILE LANGFILE
$ python -m selkie.data.wiktionary xdicts DUMPFILE TGTLANGFILE OUTDIR

Using the API:

>>> dump_fn = '~/scratch/ling780/enwiktionary-20230201-pages-articles.xml.bz2'
>>> from selkie.data.wiktionary import WiktDump
>>> wikt = WiktDump(dump_fn)
>>> arts = wikt.articles()
>>> art = next(arts)
>>> while ':' in art.title:
...     art = next(arts)

An example of a raw article:

>>> art.orig
{'title': 'dictionary',
 'ns': '0',
 ...
 'text': '{{also|Dictionary}}\n==English==\n{{was wotd|2022|December|12}}...'}

Parsed version:

>>> level_1_section = art.parsed()
>>> [k for (k,v) in level_1_section]
['__pre__', 'English']
>>> level_2_section = level_1_section[1][1]
>>> [k for (k,v) in level_2_section]
['__pre__',
 'Alternative forms',
 'Etymology',
 'Pronunciation',
 'Noun',
 'Verb',
 'Further reading',
 'Anagrams']

API

class WiktDump

Represents a wiktionary dump file from https://dumps.wikimedia.org/enwiktionary.

__init__(dump_fn): Sets the members dump_fn, prefix, and parse. The dump_fn may start with ‘~’. It should end with ‘-pages-articles.xml.bz2’. The prefix is everything before that. The parse member is an instance of Parser.

raw_articles()

Opens the bz2 file. The first line is expected to be ‘<mediawiki …>’, and is discarded. Next there should be a ‘<siteinfo>’ element, which is also discarded.

After that, the function selkie.pyx.xml.lines_to_items is used to convert the input to XML parsed into dictionary format. (See lines_to_items.)

The elements in the iteration are expected to have tag ‘page’; otherwise an error is signalled. The return value is an iteration over elements, each element represented as a dict.

articles(): Calls raw_articles() and calls WiktArticle(art) on each raw article. Returns an iteration over WiktArticles.

find(title): Iterates over the articles() and returns the first one whose title is title.

extract_language_names(tgtfn): Writes tgtfn. Iterates over the articles(), skipping any whose title contains a colon. In the remaining articles, all level-2 headings are language names. Extracts them and writes them to tgtfn, one per line, eliminating duplicates.

extract_dicts(tgtlangs_fn, tgtdir): Reads the names of the target languages from the file tgtlangs_fn. Creates tgtdir and writes one file per language in that directory. The filename is the language name. It processes each wiktionary page into entries, one for each level-2 heading. Each Entry is pickled to the corresponding language file.

class WiktArticle

wikt: The WiktDump object.

orig: The original (raw) article, a dict.

title: The title. The value of orig[‘title’], or if that does not exist, the empty string.

markdown: The contents. The value of orig[‘revision’][‘text’], or if that does not exist, the empty string.

parsed(): The parsed version is computed the first time that parsed() is called. The WiktDump parser is called on the markdown. See Parser. The output is a ParsedArticle.

class ParsedArticle

orig: The original WiktArticle.

items: A list of pairs; the output of parsing. Values are either item lists (recursively) or strings or Markdown.

entries(): The toplevel items have level-2 headings as keys, which are language names. This wraps each value as an Entry and yields it. (But if the title contains a colon, the empty iteration is returned.)

class Entry

word: The lemma.

lang: The language name.

items: The contents.

class Parser

__call__(md)

The input is markdown. It is split into lines, and then recursively split wherever there are headers. The result is in recursive item format. An element is a list of items, and an item is a (key, value) pair, where the key is either a header or ‘__pre__’ (for material preceding the first header), and the value is either a Markdown object or an element (recursively).

The initial parse produces a list of items whose key is either ‘__pre__’ or a level-1 header. In the return from __call__(), that is reduced to an iteration over items whose key is either ‘__pre__’ or ‘__H1__’ (with the level-1 header as value) or a level-2 header. In principle, there might also be items whose key is ‘__md__’ (for stray markdown), though only if an article is ill-formed.

The final outcome is an iteration over level-2 items, which are pairs (level-2-header, level-2-section). A level-2-section, in turn, is a list of pairs (level-3-header, level-3-section), and so on.

Wiktionary — selkie.data.wiktionary

API

Wiktionary — `selkie.data.wiktionary`