Language Digitization
*********************

At first glance, Selkie is the combination of two very different
things: an "old school" natural language processing (NLP) pipeline
that is packaged as a conversational agent, and an
application for language documentation. But both are parts of a larger
research programme. (The NLP pipeline is "old school" in the sense
that it is manually written software and not a neural network.)

The aim is to support the learning of grammars as *comprehensible*
models of languages. The NLP pipeline is driven by a grammar
written in a comprehensible, and relatively lightweight,
format. And the goal of providing a language documentation tool is to
make it easier to create datasets from which grammars can be learned.

Ideally, one would like documentation for all the world's languages:
a *universal corpus*.
The computational-linguistic community has made
significant progress toward the development of such a corpus,
with the **Universal Dependencies (UD)** treebanks as the most prominent example.

The number of languages documented in the UD treebanks has been
increasing linearly since it began, almost a decade ago. But there are
two problems: (1) A UD treebank is a narrowly syntactic description;
in many cases it does not even include translations. (2) The rate of
increase in documented languages is too slow: at the current rate, it will
be almost 400 years before the UDT includes all the world's languages.

For the majority of the world's languages, it is difficult to obtain
electronic documentation. One thinks immediately of the difficulty of
obtaining audio recordings in the field, but there are reasons
to doubt that the problem is primarily one of obtaining recordings.
Bird has shown that relatively large
bodies of audio recordings can be collected in brief time periods [2880,3639,3486].
Anecdotally, two Ojibwe tribes that I have interacted with have made
audio-video recordings of immersion sessions for educational purposes,
and as a result have accumulated thousands of hours of language data
over the years.
Such observations suggest that, at least for languages that are not
yet moribund, the main difficulty is not in making primary recordings, but rather
the low throughput of the standard documentary pipeline that leads from
primary recordings to finished datasets.
The standard tools used in documentary linguistics are sophisticated
but place high demands on users; they typically emphasize finesse in
annotation over streamlining and ease of use. By contrast, the UD
approach has made progress because the annotation emphasizes 
speed and simplicity over sophistication of
annotation. A UD syntax tree is starkly simpler than a tree in a
conventional treebank.

Selkie also adopts simplicity and lightweightedness as key
design criteria.
In particular, I hope to enable speakers of the language to contribute
directly to the effort of documenting their language, and thereby
to increase the size and diversity of a universal corpus.

The question arises immediately what might motivate a language
speaker to contribute to a universal linguistic corpus.
We cannot reasonably expect speakers of
low-resource languages to be motivated by the rather esoteric aims of 
academic linguistic research.  But speaker communities—particularly
in the cases of languages in which transmission to the next
generation is faltering—often do have a
strong interest in assembling linguistic data for purposes
of language instruction and preservation.

That provides another medium-term design goal:
to support a *mutually beneficial*
collaboration between computational linguistics and speaker
communities.  The intention is to create an application that is 
not only a platform for research in machine language learning, but
also a tool to assist in human language learning. If the collaboration brings
benefit to the community, it is more likely that they will be willing
to release at least some vetted portion of the data for research use,
and even in the absence of released data, computational-linguistic
goals are served by incorporating language-learning algorithms and
their evaluation into the platform.  One may view the latter approach as bringing the
algorithms to the data rather than the data to the algorithms.