The Nivre parser
****************

Experiment
----------

Here is an example of running an experiment::

   $ cp /cl/examples/nivre-2007.ftrs ./
   $ cp /cl/examples/nivre.exp ./
   $ python -m selkie.dp.ml.experiment nivre.exp work

This creates the directory work as a subdirectory of the current
working directory.  All output is written in work, except the
main summary, which is written to stdout and also saved in the file
nivre.out as a sister to nivre.exp.  When this particular
experiment has completed, the file nivre.out should contain,
among other things::

   ...
   acc: 0.908801020408 correct= 9975 ntest= 10976
   ...
   LAS:    3912 4991 0.783810859547
   UAS:    4102 4991 0.821879382889
   LA:     4391 4991 0.879783610499
   NSents: 206                     

The file nivre.exp is called the **experiment file**.
Omitting the .exp extension gives the **experiment name**,
which in this case is nivre.  The
directory in which the experiment file resides (the current working directory, in this
case), is called the **experiment directory**.  The directory work
is called the **working directory.**  See selkie.dp.ml.experiment.

Here is an example of an experiment file ::

   command selkie.dp.nivre
   dataset spa.orig
   features nivre-2007
   nulls True
   split.feature fpos.input.0
   split.cpt.s 0
   split.cpt.t 1
   split.cpt.d 2
   split.cpt.g 0.2
   split.cpt.c 0.5
   split.cpt.r 0
   split.cpt.e 1.0

The command is selkie.dp.nivre, which names the module.  The
function that runs the experiment is run_experiment within that
module.  The steps it goes through are the following.

 * Call save_experiment() and then
   load_experiment(), to make a persistent copy of the experiment
   and feature files in the working directory.  Loading the experiment
   also creates the feature function and loads the dataset.

 * Call the train() function to train an oracle, which is
   a classifier whose classes are parsing actions.
   The train() function converts training and testing sentences
   to instances, then uses them to train an oracle.

 * Load a Model from the working directory.

 * Call the model's accuracy() method to get the
   classification accuracy of the oracle on the test instances.

 * Call the model's evaluation() method to use the oracle to parse
   the test sentences, and determine the accuracy (LAS, UAS) of the
   resulting parser.

Output is passed through a tee so that it goes both to stdout
and to the file *expname*.out in the experiment directory.

Dataset
.......

The dataset spa.orig is Spanish, original format.  To get a list of available
datasets::

   >>> from selkie.data import dep
   >>> sorted(dep.datasets)

See selkie.data.dep.datasets for details.

Features
........

The features are nivre-2007, which are found in the file
nivre-2007.ftrs residing in the experiment directory.  Here are
the contents of the feature file ::

   form input 0
   lemma input 0
   cpos input 0
   fpos input 0
   morph input 0
   form input 1
   fpos input 1
   fpos input 2
   fpos input 3
   role lc input 0
   form stack 0
   lemma stack 0
   cpos stack 0
   fpos stack 0
   morph stack 0
   role stack 0
   fpos stack 1
   form govr stack 0
   role lc stack 0
   role rc stack 0

The first line says that the input[0].form is one feature.  The
last line says that stack[0].rc.role is one feature.
For more details, see selkie.dp.features.

Nulls
.....

There are two ways that a feature may be null: either the feature
expression (e.g., input[0].form) results in an error when
evaluated, or it results in a value that is boolean false.  If
nulls is true, then null values are represented as null.
Otherwise, features with null values are omitted from the instance.
See selkie.dp.features.

Split
.....

The parser calls selkie.dp.ml.split to do training and testing.  It
splits instances into sub-datasets and does SVM training on each
sub-dataset separately.  The value of split.feature is the
feature to use to split the dataset: each distinct value of the
feature names a separate sub-dataset.

Split.cpt
.........

The split trainer calls a learner on each sub-dataset.  Here the
learner is hardcoded as selkie.dp.ml.libsvm.  The split.cpt
settings are parameters of the libsvm learner.  See selkie.dp.ml.libsvm.

General usage
-------------

To train and use a parser, one first requires an experiment file.
Assume that ptb.exp
contains the contents::

   command selkie.dp.nivre
   dataset ptb.umap
   features delex
   nulls True
   split.feature fpos.input.0
   split.cpt.s 0
   split.cpt.t 1
   split.cpt.d 2
   split.cpt.g 0.2
   split.cpt.c 0.5
   split.cpt.r 0
   split.cpt.e 1.0

Then one creates the model directory ptb.model by doing::

   >>> from selkie.dp import nivre
   >>> nivre.train('ptb')

Training also creates the directory foo.work.  The work directory can be
used to evaluate parser accuracy, provided that the training dataset
includes a test portion as well.
There are two separate functions for measuring accuracy.  Remember
that the parser uses an oracle.  For a given test sentence, the
correct parse translates into a sequence of parsing actions, each
taken from a particular configuration.  Each configuration corresponds
to a learning instance, and the correct action is the true label.
The accuracy() function reports on the accuracy of the trained
oracle on the test instances.::

   (missing example)

It gives the
proportion of correct predictions that it makes on the testing instances.

To train:::

   >>> nivre.train('foo')

The file 'foo.exp' must exist.
This writes a lot of files, split by part of speech of INPUT[0].
The list of parts of speech occurring in training is written to
StatsTrainParts and those in test files are written to
StatsTestParts.  Training is only done where both training and
testing files exist.

To compute the accuracy of the predictions on the test files:::

   >>> nivre.accuracy()
   Accuracy: 0.581359329446 correct= 6381 ntest= 10976
   Fa acc= 0.333333333333 correct= 1 ntest= 3
   Fc acc= 0.639606396064 correct= 520 ntest= 813
   Fd acc= 0.576923076923 correct= 15 ntest= 26
   ...

Options
-------

The train() function takes the following options:

 * features: the filename of a set of feature specifications.

 * split_ftr: the attribute to use for splitting
   up the training data.