.. automodule:: selkie.nlp.dep Dependency conversion — ``selkie.nlp.dep`` ========================================== Overview -------- The toplevel function for conversions among tree types is ``convert()``. It takes optional arguments giving the type of ``input`` and the type of ``output``. By default, ``input`` is ``'tree'`` and ``output`` is ``'efstemma'``:: >>> from selkie.nlp.tree import parse_tree >>> t = parse_tree('(S (NP (Pron this))' ... ' (VP (VBZ is)' ... ' (NP (DT a) (NN test))))') ... >>> from selkie.nlp.dep import convert >>> print(convert(t)) 0 *root* _ _ _ _ 1 this Pron _ _ 2 2 is VBZ _ root 0 3 a DT _ _ 4 4 test NN _ _ 2 To see how heads are assigned, one can specify ``'headed'`` output:: >>> print(convert(t, output='headed')) 0 (S 1 (NP 2 (Pron:head this)) 3 (VP:head 4 (VBZ:head is) 5 (NP 6 (DT a) 7 (NN:head test)))) Or if one prefers a dependency tree to a stemma:: >>> print(convert(t, output='dep')) 0 (VBZ:root 1 (Pron this) is 2 (NN 3 (DT a) test)) The legal types input and output types are: ``'tree'`` for an unheaded constituency tree, ``'headed'`` for a headed constituency tree, ``'dep'`` for a dependency tree, ``'stemma'`` for a ``Sentence`` possibly containing empty words, ``'efstemma'`` for an ε-free stemma. These reflect the steps of the conversion: ``mark_heads()`` converts an unheaded tree to a headed tree, ``dependency_tree()`` converts a headed tree to a dependency tree, ``stemma()`` converts a dependency tree to a stemma, and ``eliminate_epsilons()`` eliminates empty words. All steps except the first are non-destructive. If given an unheaded tree as input, ``convert()`` makes a copy before calling ``mark_heads()``, unless the keyword argument ``destructive=True`` is provided. The keyword arguments ``projections`` and ``reductions`` may optionally be provided; they are passed directly to ``dependency_tree()``. Usage ----- The central function provided by ``selkie.nlp.dep`` is ``dependency_tree()``, which converts a headed phrase-structure tree to a dependency tree. (It signals an error if it encounters a headless node):: >>> h = parse_tree(''' ... (S (NP:subj (Det the) (N:head dog)) ... (VP:head (V:head chased) ... (NP:obj (Det a) (N:head cat))) ... (Adv:mod quickly)) ... ''') >>> from selkie.nlp.dep import dependency_tree >>> d = dependency_tree(h) >>> print(d) 0 (V:root 1 (N:subj 2 (Det the) dog) chased 3 (N:obj 4 (Det a) cat) 5 (Adv:mod quickly)) The function ``dependency_tree()`` takes two keyword arguments: ``projections`` and ``reductions``. They are passed directly to the ``tree()`` method of ``Projection``, which is discussed below. It should be noted that the dependency tree may contain empty nodes. The conversion treats all terminal nodes alike, whether they have a string or ``None`` as their value for ``.word``. Projections ----------- The ``dependency_tree()`` function works by converting the tree first to its *projections,* where a projection is defined as a list of nodes, each being the head of the previous. There is one projection for each leaf node. For example, in the tree *h* above, "the" has projection (Det), "dog" has projection (NP, N), "chased" has projection (S, VP, V), "a" has projection (Det), "cat" has projection (NP, N), and "quickly" has projection (Adv). The left dependents of a projection are defined to be the concatenation of left dependents of the nodes it contains, from outermost to innermost. The right dependents are defined to be the concatenation of the right dependents of the nodes, from innermost to outermost. For example, the only left dependent of (S, NP, V) is the subject NP, and its right dependents are the object NP and the adverb. The class ``Projection`` represents a projection. One creates a projection from a headed tree:: >>> from selkie.nlp.dep import Projection >>> p = Projection(h) This actually creates projections recursively for the entire tree. **Nodes.** The value of attribute ``nodes`` is the list of nodes that make up the projection:: >>> p.nodes [, , ] **Ldeps, rdeps.** The attributes ``ldeps`` and ``rdeps`` contain the left and right dependents, converted to projections:: >>> p.ldeps [] >>> p.rdeps [, ] **Lr, parent, headsib.** Each non-root projection has values for ``lr``, ``parent``, and ``headsib``, representing the configuration in which the root node occurs in the original tree. This configuration is called the "reduction" represented by attaching the root of projection to its parent. For example, the projection for the subject NP occurs as a left dependent in S, with head child VP. Accordingly:: >>> sp = p.ldeps[0] >>> sp.lr 'L' >>> sp.parent >>> sp.headsib (For the root projection, all three attributes have the value ``None``.) **Tree.** The method ``tree()`` converts a projection into a dependency tree. By default, the category of a projection is taken to be the part of speech of the head node (that is, ``nodes[-1]``.cat), and the role is the role (if any) of the root node (that is, ``nodes[0].role``). There are two boolean keyword arguments that can be used to select alternative definitions of category and role. If ``projections`` is true, then the category is the concatenation of all categories in the projection. For example:: >>> print(p.tree(projections=True)) 0 (S_VP_V:root 1 (NP_N:subj 2 (Det the) dog) chased 3 (NP_N:obj 4 (Det a) cat) 5 (Adv:mod quickly)) If ``reductions`` is true, then the role is represented by a ``Reduction`` object, which prints out as the concatenation of ``lr``, ``nodes[0].cat``, ``parent.cat``, and ``headsib.cat``. For example:: >>> print(p.tree(reductions=True)) 0 (V:root 1 (N:'L_NP:subj_S_VP' 2 (Det:L_Det_NP_N the) dog) chased 3 (N:'R_NP:obj_VP_V' 4 (Det:L_Det_NP_N a) cat) 5 (Adv:'R_Adv:mod_S_VP' quickly)) One can specify both ``projections`` and ``reductions``, if desired. Reduction --------- The class ``Reduction`` represents the configuration, in the original headed phrase structure tree, in which a dependent occurs. It has four attributes: ``lr`` may be "``L``," for a dependent that precedes its head sibling, or "``R``," for one that follows, or "``root``," for the root node. ``dep`` is the category of the dependent. ``parent`` is the category of the parent node. ``head`` is the category of the head sibling. Stemmas and governor arrays --------------------------- A dependency stemma is represented by a ``Sentence`` instance, which contains ``Word`` instances representing the individual words of the sentence. A ``Sentence`` may itself have an ``index()``, which is intended to represent its position in a collection of sentences such as a treebank. Otherwise, a ``Sentence`` is simply a list of ``Word`` instances. The word at position 0 is a pseudo-word representing the root. To create a sentence with a known number of words, use ``make_sentence()``:: >>> from selkie.nlp.dep import make_sentence >>> s = make_sentence(4, index='test') >>> s[1].form = 'This' >>> s[2].form = 'is' >>> s[3].form = 'a' >>> s[4].form = 'test' >>> print(s) 0 *root* _ _ _ _ 1 This _ _ _ 0 2 is _ _ _ 0 3 a _ _ _ 0 4 test _ _ _ 0 Alternatively, one can create an empty sentence and add words one at a time. (Note that an "empty" sentence does contain a ``*root*`` pseudo-word):: >>> from selkie.nlp.dep import Sentence, Word >>> s = Sentence() >>> s.append(Word(form='hi')) >>> s.append(Word(form='there')) >>> print(s) 0 *root* _ _ _ _ 1 hi _ _ _ 0 2 there _ _ _ 0 One can copy an existing word by using the ``copy()`` method:: >>> s[1].copy() The copy is identical to the original except that its ``sent`` and ``index`` are both ``None``. .. py:class:: selkie.nlp.dep.Sentence The methods of ``Sentence`` are as follows: .. py:method:: index() Returns the index of the sentence. .. py:method:: providence() Returns the index as a string, or ``None``. .. py:method:: __len__() Includes the root pseudo-word. .. py:method:: __iter__() Iterates over all words, including the root pseudo-word. .. py:method:: __getitem__(i) Returns the *i*-th word; the root pseudo-word is at 0. .. py:method:: words() Returns a list of word forms (strings), excluding the root pseudo-word. .. py:method:: nwords() Excludes the root pseudo-word. .. py:method:: cmp(s, other) Sentences are compared by comparing words from left to right until a difference is found. The root pseudo-words are assumed identical, and are not included in the comparison. .. py:method:: append(w) Adds w (not a copy) to the list of words. .. py:method:: form(i) Returns the form of the *i*-th word. .. py:method:: cat(i) Returns the category of the *i*-th word. .. py:method:: cpos(i) Returns the coarse category of the *i*-th word. This signals an error if the sentence is not a CoNLL sentence. .. py:method:: lemma(i) Returns the lemma of the *i*-th word. .. py:method:: morph(i) Returns the morph of the *i*-th word. .. py:method:: govr(i) Returns the governor of the *i*-th word. .. py:method:: role(i) Returns the role of the *i*-th word. .. py:method:: column(c) Returns the column named *c*, which should be one of ``'form'``, ``'cat'``, ``'lemma'``, ``'morph'``, ``'govr'``, or ``'role'``. The column is a list of values, one for each word. It includes the root pseudo-word. .. py:class:: selkie.nlp.dep.Word The members of ``Word`` are as follows: .. py:attribute:: index The position of the word in the sentence; the root pseudo-word has index 0. .. py:attribute:: form The printed form of the word. .. py:attribute:: cat The part of speech. In sentences read from a CoNLL-format file, the cat is a pair (*cpos, fpos*). .. py:attribute:: lemma The lemma, i.e., the key to use for lexical access. .. py:attribute:: morph Morphological information. .. py:attribute:: govr The index of the governor. .. py:attribute:: role The role with respect to the governor. The methods of ``Word`` are: .. py:method:: __lt__(other) Comparison is done by comparing attribute values in the order ``form``, ``cat``, ``lemma``, ``morph``, ``govr``, ``role``. The attribute ``index`` is intentionally omitted, with the consequence that words at different positions in the sentence may be equal. The attribute ``cpos`` is also omitted; it is assumed that ``cpos``, if present, is uniquely determined by ``cat``. .. py:method:: tagged_string() Returns "form.cat". Conversion to ``Sentence`` (stemma) ----------------------------------- A stemma is a list of ``Word`` objects, one for each word in the sentence. The ``Word`` class represents a word as the dependent in a dependency link. The function ``stemma()`` converts a dependency tree into a stemma. For example:: >>> from selkie.nlp.dep import stemma >>> s = stemma(d) >>> print(s, end='') 0 *root* _ _ _ _ 1 the Det _ _ 2 2 dog N _ subj 3 3 chased V _ root 0 4 a Det _ _ 5 5 cat N _ obj 3 6 quickly Adv _ mod 3 The columns are: index, word, part of speech, lemma, role, and governor. The value for governor is the index of the governor, not the governor itself. One can access a stemma like a list:: >>> s[2] >>> s[2].role 'subj' >>> s[2].govr 3 >>> s[3] The length of the stemma is the number of words in the sentence plus one for the root:: >>> len(s) 7 The element at index 0 is a pseudo-word representing the root of the sentence:: >>> s[0] The method ``words()`` returns a list of word forms (strings) excluding the root pseudo-word:: >>> s.words() ['the', 'dog', 'chased', 'a', 'cat', 'quickly'] Governor array -------------- A very compact representation of a dependency tree is the *governor array*. This is simply a list of numbers representing, for each word, the index of the governor of that word:: >>> from selkie.nlp.dep import governor_array >>> governor_array(d) [2, 3, 0, 5, 3, 3] The argument to ``governor_array()`` may be either a stemma or something that can be converted to a stemma using the function ``stemma()``. ``DepLists`` ------------ A ``DepLists`` object behaves as a list of lists. It is indexed by word index *i*, and returns the list of indices of words dependent on *i*. For example, in our example ``Sentence s``, word 3 (*chased*) has dependents 2 (*dog*), 5 (*cat*), and 6 (*quickly*):: >>> from selkie.nlp.dep import DepLists >>> deps = DepLists(s) >>> deps[3] [2, 5, 6] >>> len(deps) 7 The ``DepLists`` object prints out readably:: >>> print(deps) [0] *root* root: [3] chased [1] the [2] dog None: [1] the [3] chased subj: [2] dog obj: [5] cat mod: [6] quickly [4] a [5] cat None: [4] a [6] quickly It contains a pointer to the original sentence, which can be used for access to the identity of the dependents, etc.:: >>> deps.sentence[2].form 'dog' Lemmatization ------------- The Sentence method ``lemmatize()`` sets the ``lemma``, ``cpos``, and ``morph`` attributes for each word. The value for ``lemma`` is the lemmatized word. The module ``selkie.nlp.stemmer`` is used. The value for ``cpos`` is the part of speech with inflection stripped. The known inflected tags are ``'VBZ'``, ``'VBG'``, ``'VBN'``, ``'VBP'``, ``'VBD'``, ``'NN'``, ``'NNS'``, and the lemmatized versions are ``'V'`` or ``'N'``. The value for ``morph`` is set to one of: ``'3s'``, ``'ing'``, ``'en'``, ``'pl'``, ``'ed'``, ``'sg'``, ``'pl'``. The method is destructive. It only works for English. Eliminating epsilons -------------------- The Sentence method ``eliminate_epsilons()`` eliminates empty words (those whose form is ``None``). It is possible for empty words to have dependents. Suppose word *w* has governor *g*, which is empty. The new governor of *w* is defined to be its lowest non-empty ancestor, where *ancestor* means the transitive closure of *governor-of*:: >>> h = parse_tree(''' ... (VP (V:head thought) ... (CP (C:head) ... (S ... (NP:subj (Name:head John)) ... (VP:head (V:head left))))) ... ''') >>> s = stemma(dependency_tree(h)) >>> print(s) 0 *root* _ _ _ _ 1 thought V _ root 0 2 _ C _ _ 1 3 John Name _ subj 4 4 left V _ _ 2 >>> print(s.eliminate_epsilons()) 0 *root* _ _ _ _ 1 thought V _ root 0 2 John Name _ subj 3 3 left V _ _ 1 CoNLL Format ------------ To get the raw contents of a file in CoNLL dependency format, use ``selkie.nlp.io.iter_record_blocks()``:: >>> from selkie.nlp.io import iter_record_blocks >>> from selkie.data import ex >>> sent = next(iter_record_blocks(ex('depsent1'))) >>> sent[0] ('1', 'This', 'this', '_', 'pron', '_', '2', 'subj', '_', '_') The fields are: index, form, lemma, cpos, fpos, morph, head, rel, phead, prel. The fields cpos, phead, and prel are considered "extra" information: they are optional, whereas fpos, head, and rel are obligatory. (Head and rel are obligatory, but need not be projective; phead and rel are optional, but must be projective.) Missing fields are represented with a single underscore character. .. py:function:: iter_sentences(fn) The function ``iter_sentences()`` reads a CoNLL-format file as a sequence of ``selkie.nlp.dep.Sentence`` instances. It takes a filename as input, with an optional "``#proj``" or "``#std``" suffix. The function ``conll_sents()`` is a synonym. The mapping between the raw fields and the Sentence attributes is done as follows. For each word, if both cpos and fpos are present, then the cat is fpos and ``cpos`` is added as an extra attribute. If only one is present, it becomes the cat. If the filename ends in ``#proj``, the phead and prel are used; otherwise, the head and rel are used. (The suffix "``#std``" selects head and rel, but that is also the default):: >>> from selkie.nlp.dep import iter_sentences >>> s = next(iter_sentences(ex('depsent1'))) >>> print(s[1]) >>> s[1].cat 'pron' .. py:function:: load_sentences(fn) Returns a list rather than an iteration. .. py:function:: save_sentences(sents, fn) Takes a list of sentences and a filename as input:: >>> from tempfile import TemporaryDirectory >>> from os.path import join >>> from selkie.nlp.dep import save_sentences, load_sentences >>> with TemporaryDirectory() as dfn: ... fn = join(dfn, 'sents') ... save_sentences([s], fn) ... sents = load_sentences(fn) ... print(sents[0]) ... 0 *root* _ _ _ _ 1 This pron this subj 2 2 is vb be mv 0 3 a dt a det 4 4 test n test prednom 2 If one loads a sentence and then saves it, the result may differ from the original. Namely, if the original records contain cpos but not fpos, the cpos will show up in the fpos position in the saved file. Universal postag mapping ------------------------ Das and Petrov (2011) [3145] introduced a set of universal part-of-speech tags that were subsequently used in the McDonald et al. delexicalized parsers. Petrov, Das & McDonald [3300] describe a set of tag tables, which are installed in ``selkie.data`` as ``conll/2006/universal-pos-tags``. .. py:function:: load_umap(fn) Loads a tag map from a file, returning a dict. (If given a relative pathname, it expands it relative to the ``universal-pos-tags`` directory):: >>> from selkie.nlp.dep import load_umap >>> map = load_umap('da-ddt.map') >>> map['VA'] 'VERB' .. py:function:: apply_umap(tagmap, sent) Takes a map and a sentence in which the word ``cat`` values are (``cpos, fpos``) pairs, and it changes the ``cat`` values to be ``map[fpos]``. .. py:function:: umapped_sents(fn, tagmap) Takes a filename and a map, and generates a sequence of sentences in which the map has been applied to the parts of speech. It takes an optional flag ``projective=True`` whose meaning is the same as for ``conll_sents()``. The following example assumes that one has downloaded the CoNLL 2006 data and stored its location under the config key ``data.conll``:: >>> from selkie import config >>> from os.path import expanduser, join >>> conll = expanduser(config['data']['conll']) >>> fn = join(conll, '2006', 'danish', 'ddt', 'train', 'danish_ddt_train.conll') # doctest: +SKIP >>> from selkie.nlp.dep import umapped_sents >>> s = next(umapped_sents(fn, map)) # doctest: +SKIP >>> s[1].form # doctest: +SKIP 'Samme' >>> s[1].cat # doctest: +SKIP 'ADJ'