Wiktionary — selkie.data.wiktionary
Using the command line:
$ python -m selkie.data.wiktionary xlangs DUMPFILE LANGFILE
$ python -m selkie.data.wiktionary xdicts DUMPFILE TGTLANGFILE OUTDIR
Using the API:
>>> dump_fn = '~/scratch/ling780/enwiktionary-20230201-pages-articles.xml.bz2'
>>> from selkie.data.wiktionary import WiktDump
>>> wikt = WiktDump(dump_fn)
>>> arts = wikt.articles()
>>> art = next(arts)
>>> while ':' in art.title:
... art = next(arts)
An example of a raw article:
>>> art.orig
{'title': 'dictionary',
'ns': '0',
...
'text': '{{also|Dictionary}}\n==English==\n{{was wotd|2022|December|12}}...'}
Parsed version:
>>> level_1_section = art.parsed()
>>> [k for (k,v) in level_1_section]
['__pre__', 'English']
>>> level_2_section = level_1_section[1][1]
>>> [k for (k,v) in level_2_section]
['__pre__',
'Alternative forms',
'Etymology',
'Pronunciation',
'Noun',
'Verb',
'Further reading',
'Anagrams']
API
- class WiktDump
Represents a wiktionary dump file from
https://dumps.wikimedia.org/enwiktionary.- __init__(dump_fn)
Sets the members
dump_fn,prefix, andparse. The dump_fn may start with ‘~’. It should end with ‘-pages-articles.xml.bz2’. The prefix is everything before that. The parse member is an instance of Parser.
- raw_articles()
Opens the bz2 file. The first line is expected to be ‘<mediawiki …>’, and is discarded. Next there should be a ‘<siteinfo>’ element, which is also discarded.
After that, the function
selkie.pyx.xml.lines_to_itemsis used to convert the input to XML parsed into dictionary format. (See lines_to_items.)The elements in the iteration are expected to have tag ‘page’; otherwise an error is signalled. The return value is an iteration over elements, each element represented as a dict.
- articles()
Calls
raw_articles()and callsWiktArticle(art)on each raw article. Returns an iteration over WiktArticles.
- find(title)
Iterates over the
articles()and returns the first one whose title is title.
- extract_language_names(tgtfn)
Writes tgtfn. Iterates over the
articles(), skipping any whose title contains a colon. In the remaining articles, all level-2 headings are language names. Extracts them and writes them to tgtfn, one per line, eliminating duplicates.
- extract_dicts(tgtlangs_fn, tgtdir)
Reads the names of the target languages from the file tgtlangs_fn. Creates tgtdir and writes one file per language in that directory. The filename is the language name. It processes each wiktionary page into entries, one for each level-2 heading. Each Entry is pickled to the corresponding language file.
- class WiktArticle
- wikt
The WiktDump object.
- orig
The original (raw) article, a dict.
- title
The title. The value of orig[‘title’], or if that does not exist, the empty string.
- markdown
The contents. The value of orig[‘revision’][‘text’], or if that does not exist, the empty string.
- parsed()
The parsed version is computed the first time that parsed() is called. The WiktDump parser is called on the markdown. See Parser. The output is a ParsedArticle.
- class ParsedArticle
- orig
The original WiktArticle.
- items
A list of pairs; the output of parsing. Values are either item lists (recursively) or strings or Markdown.
- entries()
The toplevel items have level-2 headings as keys, which are language names. This wraps each value as an Entry and yields it. (But if the title contains a colon, the empty iteration is returned.)
- class Parser
- __call__(md)
The input is markdown. It is split into lines, and then recursively split wherever there are headers. The result is in recursive item format. An element is a list of items, and an item is a (key, value) pair, where the key is either a header or ‘__pre__’ (for material preceding the first header), and the value is either a Markdown object or an element (recursively).
The initial parse produces a list of items whose key is either ‘__pre__’ or a level-1 header. In the return from __call__(), that is reduced to an iteration over items whose key is either ‘__pre__’ or ‘__H1__’ (with the level-1 header as value) or a level-2 header. In principle, there might also be items whose key is ‘__md__’ (for stray markdown), though only if an article is ill-formed.
The final outcome is an iteration over level-2 items, which are pairs (level-2-header, level-2-section). A level-2-section, in turn, is a list of pairs (level-3-header, level-3-section), and so on.