Corpora and treebanks

The Brown corpus

The module selkie.data.brown behaves like an NLTK corpus, and indeed it dispatches to nltk.corpus.brown in most cases. However, it provides an alternative reduced tagset:

>>> from selkie.data import brown
>>> brown.tagged_words(tagset='base')[:3]
[('The', 'AT'), ('Fulton', 'NP'), ('County', 'NN')]

Contrast this with the default tagset:

>>> brown.tagged_words()[:3]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL')]

Functionality

NLTK provides the Brown corpus, though the version in selkie.data.brown is tweaked. The two basic functions are:

>>> brown.words()
>>> brown.tagged_words()

The latter takes an optional argument tagset. If absent or equal to “original,” the full Brown tags are returned. If equal to “base,” the prefix FW- and the suffixes (-NC, -TL, and -HL) are removed from the tags. There are a few places where -T occurs in the original as an error for -TL; these are also stripped.

One can also call the function brown.base() on a tag to strip its prefixes and suffixes, if any. In addition, the function brown.ispunct() indicates whether a tag is a punctuation tag or not, and brown.isproper() indicates whether a tag is a proper name tag or not.

Both brown.words() and brown.tagged_words() can be called with optional parameters categories or fileids, with the same interpretation as in NLTK.

The Brown tagset

There are 188 base tags, which break down as follows:

NIL (1 tag)

Compound tags (96 tags)

Simple tags (91 tags)

Punctuation tags (9 tags)

Proper-noun tags (4 tags)

Regular word tags (78 tags)

Tags for unique lexical items (21 tags)

Closed-class tags (39 tags)

Open-class tags (18 tags)

NIL. There are 157 tokens in the original that are tagged “NIL.” This appears to be simply a gap in the tagging. They are not removed from the output of stripped().

Compound tags. The compound tags are for contracted word pairs; four of them are actually contracted word triples. They exclude possessives, inasmuch as the possessive marker is not an independent word. The majority of contractions involve either the combination of a verb tag with *, which represents the contraction “n’t”; or the combination of a noun or pronoun tag with a verb or auxiliary tag. There are, however, a fair number of other cases as well. The simple tags making up compound tags all occur independently, with the exception of “PP,” which occurs in a compound tag but not standing alone. It is probable that this is an error for PPS or PPO, particularly since it occurs at the end of a long tag that may have gotten truncated.

Punctuation tags. Of the 91 simple tags, nine are punctuation tags:

' '' ( ) , -- . : ``

Proper-noun tags. Four tags represent proper nouns:

NP NP$ NPS NPS$

The tag NP includes titles such as “Mr.” and “Jr.,” as well as place names, month names, and the like. NPS includes words like “Republicans.”

Unique lexical items. There are 21 tags that represent unique lexical items. We ignore spelling variation, nonstandard dialect forms, and foreign words. A few of the possessive tags, namely DT$, JJ$, AP$, CD$, RB$, appear on only one word each, but those represent rare constructions or questionable tagging decisions, and are listed elsewhere.

*	not
ABX	both
BE	be
BED	were
BEDZ	was
BEG	being
BEM	am
BEN	been
BER	are
BEZ	is
DO	do
DOD	did
DOZ	does
EX	there
HV	have
HVD	had
HVG	having
HVN	had
HVZ	has
TO	to
WQL	how, however

Closed-class tags. There are 39 closed-class tags:

Conjunctions

CC	and, but, or, nor, either, yet, neither, plus, minus, though
CS	complementizers

Specifiers

ABL	such, quite, rather
ABN	all, half, many, nary
AP	other, many, more, same, …
AP$	other’s
AT	the, a(n), no, every
DTI	some, any
DTS	these, those
DTX	either, neither, one
DT	this, that, each, another
DT$	another’s
QLP	enough, indeed, still

Numbers

CD	cardinal numbers
CD$	1960’s, 1961’s
OD	ordinal numbers

Pronouns

PPS	he, it, she
PPSS	I, they, we, you
PPO	it, him, them, me, her, you, us
PP$	his, their, her, its, my, our, your
PP$$	his, mine, ours, yours, theirs, hers
PPL	himself, itself, myself, herself, yourself, oneself
PPLS	themselves, ourselves, yourselves
PN	one; (some-, no-, any-, every-) + (-thing, -body)
PN$	one’s, anyone’s, everybody’s, …
RN	here, then, afar

Interrogatives

WDT	which, what, whichever, whatever
WPS	who, that, whoever, what, whatsoever, whosoever
WPO	whom, that, what, who
WP$	whose, whosever
WRB	when, where, how, why, plus many variants

Other Closed Classes

MD	modals
NR	adverbial nouns: days of the week, cardinal directions, etc.
NRS	plural adverbial nouns
NR$	possessive adverbial nouns
QL	qualifiers (adverbs that modify quantifiers)
IN	prepositions
RP	particles
UH	interjections

Open-class tags. There are 18 open-class tags, of which two (JJ$ and RB$) appear to be the result of phrasal use of the possessive, and should probably be placed in the class of compound tags.

Nouns

NN	singular
NNS	plural
NN$	possessive
NNS$	possessive plural

Verbs

VBZ	third-person singular
VBD	past tense
VB	uninflected form
VBG	present participle
VBN	past participle

Adjectives

JJ	positive
JJR	comparative
JJS	intrinsically superlative
JJT	morphologically superlative
JJ$	Great’s

Adverbs

RB	adverb
RBR	comparative
RBT	superlative
RB$	else’s

The Penn Treebank

Another source of trees is the Penn treebank, represented by the module ptb. It contains functions to access the Penn Treebank and its parts.

One may specify in the Selkie configuration file the pathname for the contents of LDC99T42.

Fileids and categories

The treebank consists of 2312 files divided into 25 sections. There is a traditional division into train, test, dev train, dev test, and reserve test parts:

Division	Sections	Files
dev_train	00-01	0-198
train	02-21	199-2073
reserve_test	22	2074-2156
test	23	2157-2256
dev_test	24	2257-2311

The functions follow the conventions of the NLTK corpus readers. The function fileids() returns a list of file identifiers, which are actually numbers in the range [0,2312). One can also specify one or more categories. Category names are either WSJ section names, in the form ‘00’, ‘01’, up to ‘24’, or one of the following: ‘train’, ‘test’, ‘dev_train’, ‘dev_test’, ‘reserve_test’. One can get a list of the fileids in a given category, or the categories that a given file belongs to:

>>> from selkie.data import ptb
>>> len(ptb.fileids())
2312
>>> len(ptb.fileids(categories='train'))
1875
>>> ptb.fileids('dev_train')[-5:]
[194, 195, 196, 197, 198]
>>> ptb.categories(0)
['00', 'dev_train']
>>> ptb.categories(2311)
['24', 'dev_test']
>>> for c in sorted(ptb.categories()):
...     if c.islower():
...         print(c, len(ptb.fileids(c)))
...
dev_test 55
dev_train 199
reserve_test 83
test 100
train 1875

Filenames

One can obtain the filename for a given fileid:

>>> ptb.orig_filename(199)[-15:]
'02/wsj_0200.mrg'

Reverse look-up is also possible:

>>> ptb.orig_to_fileid('0200')
199

The reverse look-up table is loaded the first time that orig_to_fileid() is called.

Trees

The method trees() returns a list of all the individual trees in the treebank or a slice of it:

>>> trees = ptb.trees(0)
>>> print(trees[0])
0   (
1      (S
2         (NP:SBJ
3            (NP
4               (NNP Pierre)
5               (NNP Vinken))
6            (, ,)
...
>>> len(ptb.trees(categories='dev_test'))
1346

There is also a function iter_trees() that returns iterations rather than lists.

Empty nodes

In the original treebank, typical empty nodes look like this:

(NP-SBJ (-NONE- *-1) )
(SBAR (-NONE- 0)
   (S (-NONE- *T*-1) ))

We omit “-NONE-” and treat “,” “0,” or “*T” as the category. The word and children are both None. For example:

>>> trees = ptb.trees(categories='dev_test')
>>> tree = trees[30]
>>> np = tree[18]
>>> print(np)
0   (NP:SBJ
1      (*T* &amp;1))
>>> t = np.children[0]
>>> t.cat
'*T*'
>>> t.word
''
>>> tree = trees[86]
>>> s1 = tree[36]
>>> print(s1)
0    (SBAR
1       (0)
2       (S
3          (*T* &amp;1)))
>>> s1.children[0].cat
'0'
>>> s = s1.children[1]
>>> s.children[0].cat
'*T*'

Methods

The module ptb is summarized in the following table. The optional f and c are optional and can also be provided by keyword: fileids and categories, respectively.

fileids(c) — The file IDs in categories c

categories(f) — The categories for fileids f

trees(f,c) — The trees in the given files/categories

words(f,c) — The words

sents(f,c) — Sentences (lists of words)

raw_sents(f,c) — Sentence strings

abspath(f) — The absolute pathname for the fileid

text_filename(f) — Pathname for the text file

orig_filename(f) — The original pathname

fileid_from_orig(o) — Convert original ID (4 digits)

text_files(f,c) — List of text filenames

orig_files(f,c) — List of original filenames

The function fileid_from_orig() takes an original file identifier. It strips a trailing file suffix, if any, and then ignores everything except the last four characters, which should be digits, such as “0904,” which represents file 04 in WSJ section 09. Accordingly, “parsed/mrg/wsj/09/wsj_0904.mrg,” “wsj_0904.mrg,” and simply “0904” are treated as synonymous.

Statistics

Bikel [2767] reports a number of statistics for the standard training slice (sections 02–21) of the Penn Treebank. We can compute our own statistics and compare, as follows. (Be warned, the calls that iterate over trees take on the order of minutes to return.)

Number of sentences. Bikel counts 39,832 sentences. Our count agrees:

>>> count(ptb.trees(categories='train'))
39832

Number of word tokens. Bikel counts 950,028 word tokens (not including null elements). Our count agrees:

>>> count(n for t in ptb.trees(categories='train')
...             for n in t.nodes()
...                 if n.isword())
950028
>>> count(ptb.words(categories='train'))
950028

Number of word types. Bikel counts 44,114 unique words (not including null elements). Our count is slightly higher. I do not know why there is a discrepancy:

>>> len(set(n.word for t in ptb.trees(categories='train')
...                    for n in t.nodes()
...                        if n.isword()))
44389
>>> len(set(ptb.words(categories='train')))
44389

Number of words with a count greater than 5. Bikel reports that 10,437 word types occur 6 times or more. Our count is again a little higher:

>>> count(w for w in wcts if wcts[w] >= 6)
10530

Number of interior nodes. Bikel reports 904,748 brackets. Our count is quite a bit lower:

>>> count(n for t in ptb.trees(categories='train')
...             for n in t.nodes()
...                 if n.isinterior())
792794

Number of nonterminal categories. Bikel reports 28 basic nonterminals, excluding roles (“function tags,” in his terms) and indices. Including roles and indices, he reports 1184 full nonterm labels:

>>> ntcats = set(n.cat for t in ptb.trees(categories='train')
...                        for n in t.nodes()
...                            if n.isinterior())
>>> len(ntcats)
27
>>> sorted(ntcats)
[ADJP, ADVP, CONJP, FRAG, INTJ, LST, NAC, NP, NX, PP, PRN, PRT,
PRT|ADVP, QP, RRC, S, SBAR, SBARQ, SINV, SQ, UCP, VP, WHADJP, WHADVP,
WHNP, WHPP, X]

It is not clear what Bikel’s extra category is. Possibly he went beyond the training data.

Actually, we should probably replace “PRT|ADVP” with either PRT or ADVP. That would leave only 26 categories.

Number of terminal categories. Bikel reports 42 unique part of speech tags. We count 55:

>>> parts = set(n.cat for t in ptb.trees(categories='train')
...                       for n in t.nodes()
...                           if n.isleaf())
>>> len(parts)
55
>>> sorted(parts)
[#, $, '', *, *?*, *EXP*, *ICH*, *NOT*, *PPA*, *RNR*, *T*, *U*, ,,
  -LRB-, -RRB-, ., 0, :, CC, CD, DT, EX, FW, IN, JJ, JJR, JJS, LS, MD,
  NN, NNP, NNPS, NNS, PDT, POS, PRP, PRP$, RB, RBR, RBS, RP, SYM, TO,
  UH, VB, VBD, VBG, VBN, VBP, VBZ, WDT, WP, WP$, WRB, ``]

Eliminating empty leaves reduces the number of parts of speech to 45:

>>> parts = set(n.cat for t in ptb.trees(categories='train')
...                       for n in t.nodes()
...                           if n.isleaf() and not n.isempty())
>>> len(parts)
45
>>> sorted(parts)
[#, $, '', ,, -LRB-, -RRB-, ., :, CC, CD, DT, EX, FW, IN, JJ, JJR,
  JJS, LS, MD, NN, NNP, NNPS, NNS, PDT, POS, PRP, PRP$, RB, RBR, RBS,
  RP, SYM, TO, UH, VB, VBD, VBG, VBN, VBP, VBZ, WDT, WP, WP$, WRB, ``]

Number of roles. Bikel does not count roles separately. We can:

>>> roles = set(imap(Node.role, trn.nodes()))
>>> roles
set([TMP, DIR, PRP-CLR, SBJ-TTL, LOC-HLN, TPC, CLR-TPC, CLF,
CLF-TPC, PUT-TPC, PRD-TPC, NOM-TPC, LGS, PRP-TPC, PRD-TTL,
TPC-TMP, MNR, TPC-PRD, LOC-PRD-TPC, DIR-PRD, LOC-TMP, SBJ,
TMP-TPC, MNR-PRD, HLN, MNR-CLR, BNF, LOC-MNR, PRD-LOC-TPC,
LOC-CLR, TTL, NOM-SBJ, CLR-LOC, NOM, DIR-TPC, TPC-CLR, PRD-TMP,
CLR, TTL-PRD, TMP-CLR, TMP-HLN, LOC-TPC-PRD, PRP-PRD, LOC-TPC,
None, LOC-CLR-TPC, VOC, EXT, MNR-TMP, PRD, NOM-LGS, CLR-TMP,
TMP-PRD, ADV, DTV, NOM-PRD, TTL-SBJ, TPC-LOC-PRD, LOC-PRD,
PRD-LOC, ADV-TPC, CLR-MNR, DIR-CLR, PUT, TTL-TPC, PRP, LOC,
CLR-ADV, MNR-TPC])
>>> len(roles)
69

Categories

The categories occurring in the treebank can be divided into three groups: nonterminal categories, parts of speech, and empty categories.

Nonterminal categories label interior nodes, that is, nodes that have children. (In the treebank, no interior nodes are labeled with words.) There are 28 nonterminal categories, as follows.

ADJP — Adjective phrase

ADVP — Adverb phrase

ADVP|PRT — Indecision

CONJP — Conjunction phrase

FRAG — Fragment

INTJ — Interjection

LST — List enumerator

NAC — Not a constituent

NP — Noun phrase

NX — NP head fragment

PP — Prepositional phrase

PRN — Parenthetical

PRT — Particle

PRT|ADVP — Indecision

QP — Quantifier phrase

RRC — Reduced relative clause

S — Sentence

SBAR — Subordinate clause

SBARQ — Interrogative clause

SINV — Inverted sentence

SQ — Interrogative sentence

UCP — Unlike coord’d phrase

VP — Verb phrase

WHADJP — Wh adjective phrase

WHADVP — Wh adverb phrase

WHNP — Wh noun phrase

WHPP — Wh prepositional phrase

X — Unknown, unbracketable

Parts of speech label nodes that have words. There are 45 parts of speech, as follows.

# — Monetary sign

$ — U.S. dollars

‘’ — Close quotes

, — Comma

-LRB- — Left parenthesis

-RRB- — Right parenthesis

. — Period

: — Colon

CC — Coordinator

CD — Number

DT — Determiner

EX — Existential there

FW — Foreign word

IN — Preposition

JJ — Adjective

JJR — Comparative adjective

JJS — Superlative adjective

LS — List enumerator

MD — Modal

NN — Common noun

NNP — Proper noun

NNPS — Plural proper noun

NNS — Plural common noun

PDT — ?

POS — Possessive marker

PRP — Personal pronoun

PRP$ — Possessive pronoun

RB — Adverb

RBR — Comparative adverb

RBS — Superlative adverb

RP — Particle

SYM — Symbol

TO — Infinitival to

UH — Interjection

VB — Uninflected verb

VBD — Verb + ed

VBG — Verb + ing

VBN — Verb + ed/en

VBP — Plural verb

VBZ — Verb + -s

WDT — Wh determiner

WP — Wh pronoun

WP$ — whose

WRB — Wh adverb

`` — Open quotes

Empty categories label empty leaf nodes, that is, nodes that have neither children nor words. There are 10 empty categories, listed in the following table.

— PRO or trace of NP-movement; preterminal cat is NP

? — Elipsis

EXP — Pseudo-attachment: extraposition

ICH — Pseudo-attachment: “interpret constituent
here” (discontinuous dependency)

NOT — “Anti-placeholder” in template gapping

PPA — Pseudo-attachment: “permanent predictable ambiguity”

RNR — Pseudo-attachment: right-node raising

T — Trace of wh-movement

U — Unit

0 — Null complementizer

NX is generally used in coordinate structures. It may be used for N-bar coordination: “the [NX red book] and [NX yellow pencils].” It is also used in non-constituent coordination structures such as “20 thin [NX] and 10 fat [NX] [NX dogs],” where “dogs” is treated as a right-node raised node. It is also used for book/movie titles that have premodifiers.

Lists of the categories are found in the following variables:

>>> len(ptb.nonterminal_categories)
28
>>> len(ptb.parts_of_speech)
45
>>> len(ptb.empty_categories)
10

These lists were constructed using the function collect_categories(). It returns a list containing three sets: nonterminal categories, parts of speech, and empty categories. A category is defined to be nonterminal if it appears on a node with children, a part of speech if it appears on a node with a word, and an empty category otherwise. Note that the empty string is included as an extra nonterminal category: there are some nonterminal nodes (root nodes) without a category.

Roles

The roles that occur in the PTB are listed in the following table.

ADV — Adverbial (form vs function) — Used on NP or SBAR, but not ADVP or PP. Subsumes more-specific adverbial tags.

BNF — Benefactive (adverbial) — May be used on indirect object.

CLF — Cleft (misc) — It clefts. Marks the whole sentence; not actually a role.

CLR — Closely related (misc) — Intermediate between argument and modifier.

DIR — Direction (adverbial) — May be multiple: from, to.

DTV — Dative (grammatical role) — Only used if there is a double-object variant. Also ablative meaning: ask a question [of X]. But anything with for is BNF. Not used on indirect object!

EXT — Extent (adverbial) — Distance, amount. Not for obligatory complements, e.g. of weigh.

HLN — Headline (misc) — Marks the whole phrase; not actually a role.

LGS — Logical subject (grammatical role) — The NP in a passive by-phrase.

LOC — Locative (adverbial)

MNR — Manner (adverbial)

NOM — Nominal (form vs function) — Marks headless relatives behaving as substantives. Not actually a role. Co-occurs with SBJ and other argument roles.

PRD — Predicate (grammatical role) — Any predicate that is not a VP. Also, the so in do so.

PRP — Purpose or reason (adverbial)

PUT — Locative of put (grammatical role)

SBJ — Subject (grammatical role)

TMP — Temporal (adverbial)

TPC — Topicalized (grammatical role) — Only if there is a trace or resumptive pronoun after the subject.

TTL — Title (misc) — The title of a work, implies NOM. Marks the whole phrase; not actually a role.

VOC — Vocative (grammatical role)

Perseus Latin and Greek Treebanks

The module perseus contains small Latin and Greek treebanks from Project Perseus. The main method for these treebanks is stemmas(), which returns an iterator over the stemmas in the treebank. (Yes, “stemmata” is the correct plural, but it is rather pedantic, so we have anglicized):

>>> from selkie.data import perseus
>>> stemmas = list(perseus.latin.stemmas())
>>> len(stemmas)
3473
>>> print(stemmas[0])
0 *root*  _         _       _    _
1 In      r-------- in1     AuxP 4
2 nova    a-p---na- novus1  ATR  7
3 fert    v3spia--- fero1   PRED 8
4 animus  n-s---mn- animus1 SBJ  2
5 mutatas t-prppfa- muto1   ATR  6
6 dicere  v--pna--- dico2   OBJ  2
7 formas  n-p---fa- forma1  OBJ  5
8 corpora n-p---na- corpus1 OBJ  0

Dependency treebanks

Accessing datasets

A dataset has a language and a version. Languages are specified as ISO 639-3 codes. There are currently four different versions, as follows. The original CoNLL treebanks from the 2006 shared task have version orig. Datasets converted to the Das-Petrov universal tagset (DPU) have version umap. The Universal Dependency Treebank (UDT) with standard encoding has version uni. The Universal Dependency Treebank with content-head encoding (ch). The Penn Treebank (PTB) converted to dependencies using my adaptation of the Magerman-Collins (MC) rules has version dep. The same converted to the Das-Petrov tagset has version umap. The following table lists the currently available datasets. (DPU = Das-Petrov Universal tagset; UDT = Universal Dependency Treebank.}

Name	Lg	Ver	Description
arb.orig	arb	orig	CoNLL-2006 Arabic
arb.umap	arb	umap	CoNLL-2006 + DPU, Arabic
bul.orig	bul	orig	CoNLL-2006 Bulgarian
bul.umap	bul	umap	CoNLL-2006 + DPU, Bulgarian
ces.orig	ces	orig	CoNLL-2006 Czech
ces.umap	ces	umap	CoNLL-2006 + DPU, Czech
dan.orig	dan	orig	CoNLL-2006 Danish
dan.umap	dan	umap	CoNLL-2006 + DPU, Danish
deu.ch	deu	ch	UDT, content-head, German
deu.orig	deu	orig	CoNLL-2006 German
deu.umap	deu	umap	CoNLL-2006 + DPU, German
deu.uni	deu	uni	UDT, German
eng.dep	eng	dep	Penn Treebank, MC heads
eng.umap	eng	umap	Penn Treebank, MC heads + DPU
fin.ch	fin	ch	UDT, content-head, Finnish
fra.ch	fra	ch	UDT, content-head, French
fra.uni	fra	uni	UDT, French
ind.uni	ind	uni	UDT, Indonesian
ita.uni	ita	uni	UDT, Italian
jpn.uni	jpn	uni	UDT, Japanese
kor.uni	kor	uni	UDT, Korean
nld.orig	nld	orig	CoNLL-2006 Dutch
nld.umap	nld	umap	CoNLL-2006 + DPU, Dutch
por.orig	por	orig	CoNLL-2006 Portuguese
por.umap	por	umap	CoNLL-2006 + DPU, Portuguese
por.uni	por	uni	UDT, Portuguese
slv.orig	slv	orig	CoNLL-2006 Slovenian
slv.umap	slv	umap	CoNLL-2006 + DPU, Slovenian
spa.ch	spa	ch	UDT, content-head, Spanish
spa.orig	spa	orig	CoNLL-2006 Spanish
spa.umap	spa	umap	CoNLL-2006 + DPU, Spanish
spa.uni	spa	uni	UDT, Spanish
swe.ch	swe	ch	UDT, content-head, Swedish
swe.orig	swe	orig	CoNLL-2006 Swedish
swe.umap	swe	umap	CoNLL-2006 + DPU, Swedish
swe.uni	swe	uni	UDT, Swedish
tur.orig	tur	orig	CoNLL-2006 Turkish
tur.umap	tur	umap	CoNLL-2006 + DPU, Turkish

The name of a dataset is language-dot-version, for example dan.orig. The function dataset() gives access to a dataset by name:

>>> from selkie.data import dep
>>> dep.dataset('dan.orig')
<Dataset dan.orig>

The function datasets() gives access to sets of datasets. Language or version may be specified:

>>> dep.datasets(lang='dan')
[<Dataset dan.orig>, <Dataset dan.umap>]
>>> len(dep.datasets(version='orig'))
18
>>> len(dep.datasets())
52

Dataset instances

The class Dataset represents a treebank. There are two specializations, UMappedDataset and FilterDataset. Each dataset has a name, a description, a language represented as an ISO 639-3 code, and a version:

>>> ds = dep.dataset('dan.orig')
>>> ds.name
'dan.orig'
>>> ds.desc
'Danish, CoNLL-2006'
>>> ds.lang
'dan'
>>> ds.version
'orig'

Simple datasets also have a training file pathname, a test file pathname, and (sometimes) a dev file pathname. (To be precise, datasets in the uni and ch collections have a dev file pathname, but orig datasets do not.) The pathnames are also available for umapped datasets, but the files contain the original (unmapped) trees. Filter datasets do not have pathnames:

>>> ds.train[ds.train.find('conll'):]
'conll/2006/danish/ddt/train/danish_ddt_train.conll'
>>> ds.test[ds.test.find('conll'):]
'conll/2006/danish/ddt/test/danish_ddt_test.conll'
>>> ds.dev
>>>

Sentences

A dataset instance has a sents() method that generates sentences for a specified section of the treebank. All treebanks have ‘train’ and ‘test’ sections. In addition, uni and ch datasets have a ‘dev’ section, and the English datasets have ‘dev_train’, ‘dev_test’, and ‘reserve_test’ sections:

>>> sents = list(ds.sents('train'))
>>> len(sents[0])
14

A convenience function called sents() is also available to retrieve the sentences for a particular segment of a dataset directly:

>>> sents = list(dep.sents('dan.orig', 'train'))

A sentence can be viewed as a list of records. Word~0 is always the root pseudo-word. “Real” words start at position 1. The length of the sentence includes the root, so the last valid index is the length minus one:

>>> s = sents[0]
>>> s[0]
<Word 0 *root*>
>>> s[1]
<Word 1 Samme/AN:ROOT (/A.degree=po...) govr=0>
>>> s[13]
<Word 13 ./XP:pnct (/X) govr=1>

The Sentence and Word classes were discussed earlier. Each record is represented by a Word instance, with ten fields: i, form, lemma, cpos, cat, morph, govr, role, pgovr, and prole. The field cpos represents the coarse part of speech, and cat represents the fine part of speech. The fields pgovr and prole represent the word’s governor and role in the projective stemma. They may not be available. The fields govr and role are always available, but they are not guaranteed to be projective.

All fields except i, govr, and pgovr are string-valued. If not available, their value is the empty string. The values for i, govr, and pgovr are integers. If they are not available, their value is None. The fields i and govr are always available, except that word 0 has no govr.

The values for govr and pgovr can be used used as an index into the sentence, with the value 0 representing the root.

One can get just a list of word forms (strings) using the method words(). This provides suitable input for a standard parser. The root pseudo-word is not included. The method nwords() returns the number of words excluding the root:

>>> ws = s.words()
>>> ws[:3]
['Samme', 'cifre', ',']
>>> len(ws)
13
>>> s.nwords()
13

Column-major view

A sentence provides separate methods for each of the word attributes, indexed by the word number, with 0 being the root pseudo-word:

>>> s.form(0)
'*root*'
>>> s.form(1)
'Samme'
>>> s.form(13)
'.'

The attributes are as listed above: form, lemma, cpos, cat, morph, govr, role, pgovr, and prole:

>>> s.form(2)
'cifre'
>>> s.lemma(2)
''
>>> s.cpos(2)
'N'
>>> s.cat(2)
'NC'
>>> s.morph(2)
'gender=neuter|number=plur|case=unmarked|def=indef'
>>> s.govr(2)
1
>>> s.role(2)
'nobj'

Word forms need not be ascii:

>>> from selkie.cld.seal.misc import as_ascii
>>> as_ascii(s.form(12))
'v{e6}rtsnation'

Without as_ascii, the form would print as “værtsnation.”

One can fetch a column as a tuple using the method column():

>>> g = s.column('govr')
>>> g[:5]
(None, 0, 1, 1, 7)

Creating a sentence

If desired, one can create a Sentence as follows:

>>> from selkie.nlp.dep import Sentence, Word
>>> s = Sentence()
>>> s.append(Word(1, 'This', ('PRON', 'PRON'), 'this', '', 2, 'subj'))
>>> s.append(Word(2, 'is', ('VB', 'VB'), 'be', '', 0, 'mv'))
>>> s.append(Word(3, 'a', ('DT', 'DT'), 'a', '', 4, 'det'))
>>> s.append(Word(4, 'test', ('N', 'N'), 'test', '', 2, 'prednom'))

The numbers must be sequential from 1; they provide a quality check.

Dependency files

On disk, the training and test files are in CoNLL dependency format. The sents() method uses selkie.nlp.dep.conll_sents() to read them:

>>> from selkie.nlp.dep import conll_sents
>>> f = conll_sents(ds.train)
>>> s = next(f)
>>> len(s)
14

The file ‘depsent1’ provides an example of the file format:

     This    this    pron    pron    _       2       subj    2       subj
     is      is      vb      vb      _       0       mv      0       mv
     a       a       dt      dt      _       4       det     4       det
     test    test    n       n       _       2       prednom 2       prednom

Each sentence is (obligatorily) terminated by an empty line. Fields are separated by single tab characters. There are ten fields: id, form, lemma, cpos, fpos, morph, govr, role, pgovr, prole.

Universal Pos Tags

The ‘umap’ versions of the treebanks are mapped from the ‘orig’ versions using the tag tables of Petrov, Das & McDonald [3300]. They are instances of UMappedDataset, which uses UMappedDepFile:

>>> ds = dep.dataset('dan.umap')
>>> s = next(ds.sents('train'))
>>> s[1].form
'Samme'
>>> s[1].cat
'ADJ'

BioNLP

The BioNLP dataset contains biomedical texts with annotations.