Wiktionary — ``selkie.data.wiktionary`` ======================================= Using the command line:: $ python -m selkie.data.wiktionary xlangs DUMPFILE LANGFILE $ python -m selkie.data.wiktionary xdicts DUMPFILE TGTLANGFILE OUTDIR Using the API:: >>> dump_fn = '~/scratch/ling780/enwiktionary-20230201-pages-articles.xml.bz2' >>> from selkie.data.wiktionary import WiktDump >>> wikt = WiktDump(dump_fn) >>> arts = wikt.articles() >>> art = next(arts) >>> while ':' in art.title: ... art = next(arts) An example of a raw article:: >>> art.orig {'title': 'dictionary', 'ns': '0', ... 'text': '{{also|Dictionary}}\n==English==\n{{was wotd|2022|December|12}}...'} Parsed version:: >>> level_1_section = art.parsed() >>> [k for (k,v) in level_1_section] ['__pre__', 'English'] >>> level_2_section = level_1_section[1][1] >>> [k for (k,v) in level_2_section] ['__pre__', 'Alternative forms', 'Etymology', 'Pronunciation', 'Noun', 'Verb', 'Further reading', 'Anagrams'] .. Loading entries from a language file: API --- .. py:class:: WiktDump Represents a wiktionary dump file from ``https://dumps.wikimedia.org/enwiktionary``. .. py:method:: __init__(dump_fn) Sets the members ``dump_fn``, ``prefix``, and ``parse``. The *dump_fn* may start with '~'. It should end with '-pages-articles.xml.bz2'. The *prefix* is everything before that. The *parse* member is an instance of Parser. .. py:method:: raw_articles() Opens the bz2 file. The first line is expected to be '', and is discarded. Next there should be a '' element, which is also discarded. After that, the function ``selkie.pyx.xml.lines_to_items`` is used to convert the input to XML parsed into dictionary format. (See lines_to_items.) The elements in the iteration are expected to have tag 'page'; otherwise an error is signalled. The return value is an iteration over elements, each element represented as a dict. .. py:method:: articles() Calls ``raw_articles()`` and calls ``WiktArticle(art)`` on each raw article. Returns an iteration over WiktArticles. .. py:method:: find(title) Iterates over the ``articles()`` and returns the first one whose title is *title*. .. py:method:: extract_language_names(tgtfn) Writes *tgtfn*. Iterates over the ``articles()``, skipping any whose title contains a colon. In the remaining articles, all level-2 headings are language names. Extracts them and writes them to *tgtfn*, one per line, eliminating duplicates. .. py:method:: extract_dicts(tgtlangs_fn, tgtdir) Reads the names of the target languages from the file *tgtlangs_fn*. Creates *tgtdir* and writes one file per language in that directory. The filename is the language name. It processes each wiktionary page into **entries**, one for each level-2 heading. Each Entry is pickled to the corresponding language file. .. py:class:: WiktArticle .. py:attribute:: wikt The WiktDump object. .. py:attribute:: orig The original (raw) article, a dict. .. py:attribute:: title The title. The value of orig['title'], or if that does not exist, the empty string. .. py:attribute:: markdown The contents. The value of orig['revision']['text'], or if that does not exist, the empty string. .. py:method:: parsed() The parsed version is computed the first time that parsed() is called. The WiktDump parser is called on the markdown. See Parser. The output is a ParsedArticle. .. py:class:: ParsedArticle .. py:attribute:: orig The original WiktArticle. .. py:attribute:: items A list of pairs; the output of parsing. Values are either item lists (recursively) or strings or Markdown. .. py:method:: entries() The toplevel items have level-2 headings as keys, which are language names. This wraps each value as an Entry and yields it. (But if the title contains a colon, the empty iteration is returned.) .. py:class:: Entry .. py:attribute:: word The lemma. .. py:attribute:: lang The language name. .. py:attribute:: items The contents. .. py:class:: Parser .. py:method:: __call__(md) The input is markdown. It is split into lines, and then recursively split wherever there are headers. The result is in **recursive item format**. An *element* is a list of *items*, and an *item* is a (key, value) pair, where the *key* is either a header or '__pre__' (for material preceding the first header), and the *value* is either a Markdown object or an *element* (recursively). The initial parse produces a list of items whose key is either '__pre__' or a level-1 header. In the return from __call__(), that is reduced to an iteration over items whose key is either '__pre__' or '__H1__' (with the level-1 header as value) or a level-2 header. In principle, there might also be items whose key is '__md__' (for stray markdown), though only if an article is ill-formed. The final outcome is an iteration over **level-2 items**, which are pairs (level-2-header, level-2-section). A level-2-section, in turn, is a list of pairs (level-3-header, level-3-section), and so on.