.. automodule:: selkie.pyx.xml XML files — ``selkie.pyx.xml`` ============================== The module ``selkie.pyx.xml`` provides a convenient XML parser. The standard Python library provides XML parsing, but the facilities it provides generally signal an error when encountering ill-formed XML. Unfortunately, a great deal of XML on the web is ill-formed, and aborting is not a very graceful way of dealing with it. The XML parser in ``selkie.pyx.xml`` is designed to be very robust and lightweight. To read in the entire file as a tree:: from selkie.pyx.xml import load_xml tree = load_xml(fn) To read an iteration over trees:: from selkie.pyx.xml import lines_to_trees for tree in lines_to_trees(f): ... The input *f* should be an iteration over lines. It is parsed as an iteration over XML trees. To read in XML trees and convert them to dicts:: from selkie.pyx.xml import lines_to_dicts for (cat, tab) in lines_to_dicts(f): ... *cat* is the tag at the root of the tree, and *tab* is a dict in which each key-value pair comes from one of the children. The key is the child's tag, and the value is a string (if the child consists solely of #CDATA) or a dict obtained by recursively converting the child. XML tags -------- .. py:function:: iter_xml_tags(fn) The XML parser described in the previous section calls ``iter_xml_tags()`` to convert the file to a stream containing a mix of XML tags and character data. The function ``iter_xml_tags()`` is actually the constructor for a class. It has no methods beyond the standard iterator methods. For example: >>> from selkie.data import ex >>> from selkie.pyx.xml import iter_xml_tags >>> for elt in iter_xml_tags(ex('xml1')): ... print(repr(elt)) ... '\n' '\nA "little" ' 'example' '.\n' '\n' '\n' The elements are either of type ``Tag`` or of type ``str``. The XML standard requires quotes around attribute values, but the tag scanner does not insist on them. Entity references are expanded in character data as well as in attribute values. The function ``load_xml_tags()`` converts the iterator into a list: >>> from selkie.pyx.xml import load_xml_tags >>> tags = load_xml_tags(ex('xml1')) >>> tags[0] >>> len(tags) 12 Tags ---- A ``Tag`` instance has the following attributes. .. list-table:: * - ``type`` - One of ``"start"``, ``"end"``, or ``"empty"``. * - ``label`` - The label (category) of the tag. * - ``ftrs`` - A list of pairs (*att, value*). * - ``cpos`` - The character position in the plain text file. * - ``line_number`` - The line number in the original XML file. For example: >>> tag = tags[2] >>> tag >>> tag.type 'start' >>> tag.label 'body' >>> tag.ftrs [('foo', 'hi&bye'), ('bar', '16')] >>> tag.cpos 7 >>> tag.line_number 2 Note that the features are represented as a list of pairs. If multiple features have the same key, all will be present. Entities -------- The tag iterator has a method ``decode_xml_entities()`` that converts XML entities like "&" to characters ("&"). The codes are listed in ``TagIterator.entity_table``. For example: >>> from selkie.pyx.xml import TagIterator >>> TagIterator.entity_table['amp'] '&' One can override ``entity_table`` in a given instance of TagIterator to add new tags. If TagIterator is instantiated with ``entities=False``, entity replacement is suppressed. XML trees --------- .. py:function:: load_xml(fn) The main function. It reads an XML file and converts it to a tree. Optional keyword arguments are: * ``entities=False`` leaves all HTML entities as-is. * ``multiple=True`` causes the return value to be a list of trees. By default, only one tree is returned and it is an error if there is more than one in the file. To give an example, consider the file ``xml1``: >>> from selkie.pyx.io import contents >>> print(contents(ex('xml1')), end='') A "little" example. We read it as a tree: >>> from selkie.pyx.xml import load_xml >>> xml1 = load_xml(ex('xml1')) >>> print(xml1) 0 (html 1 ('#CDATA' '\n') 2 (body 3 ('#CDATA' '\nA "little" ') 4 (b 5 ('#CDATA' example)) 6 ('#CDATA' '.\n')) 7 ('#CDATA' '\n')) Each XML node is represented by a Tree instance. It has one nonstandard member, namely, ``ftrs``. * ``node.cat`` - a string, the tag label or ``'#CDATA'`` * ``node.word`` - a string, the contents of a CDATA node * ``node.ftrs`` - a list of (att, value) pairs * ``node.children`` - a list of Tree instances, or None The functions from ``selkie.nlp.tree`` can be useful with XML trees. For example, one can pick out subtrees using the function ``subtrees()``. It takes a second argument which is either a category or a predicate. >>> from selkie.nlp.tree import subtrees >>> subtrees(xml1, 'b') [] >>> subtrees(xml1, lambda x: x.cat == '#CDATA' and x.word != '\n') [, , ] The function ``subtree()`` is just like ``subtrees()``, except that it returns a tree; it signals an error if the specified tree does not exist or is not unique. >>> from selkie.nlp.tree import subtree >>> body = subtree(xml1, 'body') The function ``terminal_string()`` returns the string contents of a node. >>> from selkie.nlp.tree import terminal_string >>> terminal_string(body) '\nA "little" example .\n' The ``subtrees()`` function will not recurse inside any node that it returns. For example: >>> subtrees(xml1, lambda x: not x.word) [] To retrieve all nodes matching a given criterion, use the function ``nodes()`` and list comprehension: >>> from selkie.nlp.tree import nodes >>> [n for n in nodes(xml1) if not n.word] [, , ] It is worth noting that the ``iter_xml_trees()`` function handles ill-formed XML gracefully. For example: >>> print(contents(ex('bad.html')), end='')

This is an example with lots of missing end tags.

Name Rank SerialNo

Smith Corporal 1234567

Jones Private 7654321

Howard Major General 0000001

The missing end tags are inserted automatically: >>> bad = load_xml(ex('bad.html')) >>> rows = subtrees(bad, 'tr') >>> for row in rows: print(terminal_string(row)) ... Name Rank SerialNo Smith Corporal 1234567 Jones Private 7654321 Howard Major General 0000001 End tags are paired with the nearest matching start tag. If there is no start tag with the same label, the end tag is silently ignored. Within the region spanned by a matching tag pair, there may be unmatched start tags. They are dealt with in a right-to-left pass, as follows. First, every category has a "precedence" assigned to it. (The precedence is only meaningful for HTML categories; it is a fixed constant for any non-HTML categories.) If the precedence is zero (for categories ``br`` and ``img``), the unmatched start tag is converted to an empty tag. Otherwise, the "span" of the unmatched start tag is grown to cover as much material as possible, until it reaches the end of the region defined by the encompassing explicit tag pair, or until it encounters a node of equal or higher precedence. In other words, no element created from an unmatched start tag will contain a child whose precedence is equal to or greater than its own. To get the value for an attribute, use the function ``getvalue()``. >>> from selkie.pyx.xml import getvalue >>> getvalue(body, 'foo') 'hi&bye' Tidy ---- When there are unmatched start tags, ``iter_xml_trees()`` calls the function ``tidy()`` to decide where the end tags should go. It uses a table encoding intuitive operator precedences for HTML tags.

Name	Rank	SerialNo
Smith	Corporal	1234567
Jones	Private	7654321
Howard	Major General	0000001