Files and directories ********************* The selkie.cld.db package provides a database implemented as a directory hierarchy whose files and subdirectories represent database objects. Persistent Objects ------------------ The File class .............. A File represents a persistent object, that is, one whose contents live on disk. Files support multiple users, including a permissions system and file locking and backup. The main complexity lies in synchronizing changes between the object in memory and the disk representation. One does not directly instantiate File, rather, one defines specializations of File. When defining a specialization, the most basic information that is required is how to **write** the contents of the object to a stream, and how to **read** the contents from a stream. The methods __write__ and __read__ implement those operations, and must be provided by the specialization. One also does not directly instantiate the specialization. Rather, a File is obtained from a persistent Directory, either by calling the Directory's **__getitem__** method to obtain an existing object or by calling its **new_child** method to create a new object. (The root object is created by calling **open_database,** discussed below.) Each child has a **name** (a string) that uniquely identifies it relative to the parent directory. When instantiating a child, the Directory determines which class to instantiate by determining the child's **typename.** The environment, which is accessible as the File member **env,** provides a one-one mapping between typenames (which are strings) and classes. The child's typename is determined as follows: * If the child already exists, it is stored with a record of its typename. * The Directory may have a **signature,** which is a table that maps child names to typenames. * If the child name is not listed in the signature, or there is no signature, the Directory may have a value for **childtype,** giving a default typename to use for children. * Otherwise, the user must specify the typename when calling **new_child.** A File's representation on disk is created by the method **__create__.** The new_child method immediately calls __create__. The function **create_database** is used to create root File. It calls open_database to instantiate the root file, and then it calls the file's **create** method (no underscores). The *create* method is only available for root Files. When a File is instantiated, its contents are not immediately loaded. The reason is that one should be able to list the contents of a directory without loading into memory the entire hierarchy, or even all the files in the directory. To break the recursion, Files are initially instantiated as unloaded stubs. The possible **states** that a File may be in are as follows: * **instantiated** - the object has been instantiated, but its contents have not yet been loaded from disk. * **loaded** - the object has been loaded from disk; its content members have been set. * **writing** - we are currently within the scope of a writer. * **modified** - the contents have been modified and a save is required. Saving will return the object to the loaded state. * **deleted** - the object has been deleted. Any attempt to access the content members results in an error. The following methods change the object's state: * **__create__** - This is a private method intended for the sole use of *new_child* and *create.* It creates an empty content file. If any of the ancestor directories are missing, it creates them, as well. Note that __create__ does *not* do permissions checks nor update the index. Those are handled by *new_child,* and unnecessary in the case of the *create* method. * **require_load** - Assures that the contents have been loaded, and that they are only loaded once. The first time require_load is called, it immediately calls __load__, and after that, it is a no-op. * **__load__** - Opens an input stream and calls __read__, which a specialization must implement. This is a private method intended for the sole use of require_load. * **writer** - Any modification to the contents should be done within the scope of a writer, which means, in the body of a *with self.writer()* statement. One need not actually modify the contents, but if one does modify them, one should also call *modified* to mark the File as being modified. When the body of *with self.writer()* is exited, if the File has been marked as modified, its __save__ method is called. * **__save__** - Opens an output stream and calls __write__, which a specialization must implement. This is a private method intended for the sole use of the writer. * **reparent** - Changes the File's parent. That is, the File is moved to a different directory. * **delete** - Deletes the File. The disk representation is deleted and the File is removed from the parent's list of children. The instance is flagged as deleted, so that any subsequent attempt to access its content signals an error. Some of the actions signal an error unless the user has proper permissions. In particular: * **__create__** - The user must have write permission for the File and Directory. * **__load__** - The user must have read permission. * **__save__** - The user must have write permission. * **reparent** - The user must have write permission for the File and for both the source and target Directories. * **delete** - The user must have write permission for the File and Directory. To change the permissions of a File or Directory, the user must have admin permission. Defining a specialization ......................... When defining a specialization of File, it is the specializer's responsibility to manage the contents. To assure that the object behaves as expected, one should adhere to the following discipline: * Contents reside in private members, and the only access to the contents or to any part of them is via the object's **access methods** (like __getitem__) and **update methods** (like __setitem__). * Contents and parts of contents are either immutable, or they are constructed from "virtual view" classes that dispatch to the access and update methods of the File. * Each access method definition begins with a call to require_load, assuring that the content members have been set. * Each update method definition is **protected** by being wrapped in *with self.writer().* The writer immediately calls require_load, so one can count on the content members existing. An update method is not obliged to modify the contents, but if it does, it should call *modified*. * The __read__ method is given an input stream, and it should set the content members. The first time it is called, the input stream will be empty; it should accommodate that case. It should not count on the content members being undefined: it is also used to re-initialize the File after a failed save. * The __init_contents__ method is called by __create__. The default implementation is a no-op, but specializations may override it to initialize the content members. It is an *update* method, meaning that it should either be protected with a call to *with self.writer()* or it should only call protected update methods. Since the writer calls require_load, the __read__ method actually gets called *before* __init_contents__ is called. * The __write__ method is given an output stream as argument. It should write the contents in the form expected by __read__. * One may specify that this File **requires** one or more other Files. Then any time a writer is created for this file, the required files are automatically added to it. (See section XX below.) Metadata ........ In addition to the usual contents, a File may contain **metadata.** For example, any specialization that sets __has_permissions__ to True will have a permissions metadata item. Metadata items are a special form of content, and are treated accordingly. They reside in private members; accessors should call require_load; and updaters should be protected by a writer. A metadata item should be a specialization of Metadata, not of File. Among other things, a metadata item is not independently loadable and cannot be moved or deleted. The File that the metadata item belongs to is called the **host.** There is an implementation issue that is handled by the Metadata class: the *require_load* and *writer* methods do not directly call __load__ and __save__ (which do not exist for Metadata), but rather dispatch to the corresponding methods of the host. A specialization of Metdata should have __read__ and __write__ methods, like a File. When the host's contents are read or written, calls are also placed to the __read__ or __write__ method of all of its metadata items. (There are two constraints, which are due to my laziness. The __write__ method must **not** write the line "##EOM", which is used as a separator between metadata sections. And the last line that is written **must** end in a newline.) To add metadata items, a specialization should set the class member **__metadata__** to be a tuple that extends the parent class's value. Elements are pairs (member, class). For example:: class MyObject (Directory): __metadata__ = Directory.__metadata__ + (('_foo', Foo),) Summary of members and methods of File ...................................... The following provides a summary of the members and methods of File. Some of the following have not yet been introduced, but will be introduced in later sections. Members: * __metadata__ — List of metadata member names. (A class variable.) * __has_permissions__ — Set to True to add permissions to the metadata. * env — The environment. * indexed — The list of typenames that are to be indexed. Discussed in section XX. Instantiation: * create_root_env() — Create the environment for the root File. Discussed in section XX. * create_env() — Create the environment for a non-root File that hosts an index. Discussed in section XX. Creation: * __create__() — Creates the disk representation. * create() — Create the disk representation for a root File. Signals an error if called for a File that is not root. Reading: * require_load() — Calls __load__, and assures that it is called only once. * __load__() — Load the contents from disk. Opens an input stream and calls __read__. * __read__(f) — Read the contents from an input stream. To be provided by the specialization. Writing: * writer() — Returns a Writer. Should be called in a with-statement. The File will be saved when the body of the 'with' exits, provided that it is modified. * modified() — Marks the File as modified. * __save__() — Save the contents to disk. Opens an output stream and calls __write__. * __write__(f) — Write the contents to an output stream. To be provided by the specialization. Hierarchy modification: * reparent(par,i) — Par is the new parent. The argument *i* is optional; it indicates the position among the new parent's children where this File is to be inserted. * delete() — Delete the File. The Directory class ................... A Directory behaves like a dict that maps names to child Files. I have already mentioned that the method __getitem__ accesses an existing child, and new_child adds a new child to the Directory. Directory is a specialization of File, so one Directory may be a sub-Directory of another. The absolute location of a File can be given as a **path,** which is a sequence of names. The method **follow** takes a path as argument. If the path contains only one name, it simply calls __getitem__ with that name. Otherwise, it calls __getitem__ with the first name to get a subdirectory, and passes the remaining names to the subdirectory's follow method. We have already encountered the methods that modify the directory hierarchy: * The parent directory's **new_child** method is used to create a file. * The file's own **reparent** method is used to move a file to a different location. * The file's **delete** method is used to delete the file and all its descendants. The set of children represents the Directory's primary contents. The children are read by the __read__ method and written by the __write__ method, and they are stored in the private member **_children.** Hence, specializations do not need to define __read__ and __write__. Summary of members and methods of Directory: * signature — A table that maps child names to typenames. * childtype — The default typename for children. * __getitem__(name) — Get the child with the given name. * new_child(name,suffix,cls,i) — Create a new child. All parameters are optional keyword arguments. * follow(path) — Find a particular descendant given a sequence of names. Database and Environment ........................ The module selkie.cld.db.core contains two functions for opening a database. An existing database is opened by calling **open_database**, and a new database is created by calling **create_database.** A database is really just a root File, that is, a File whose parent is None. The main thing that sets a database apart from any other File is that it creates an Environment for itself and its descendants. A root File also automatically includes a GroupsFile metadata item for use of the permissions system. The Environment is a dict-like object containing global information. It contains a pointer to the database, under the key 'root', and it contains the tables that map between typenames and classes. If there is an index, it also contains a pointer to the index. (Environments are discussed in more detail later.) Files other than the root are allowed to create an index, by having a list of typenames in the class variable __indexed__. A File with an index creates a fresh copy of its parent's environment, and sets 'index_root' to itself. If a File is neither a root nor indexed, it simply uses its parent's environment unmodified. The motivation for having indices is as follows. Organizing Files into a hierarchy, rather than using the usual relational representation, has the advantage that entire subhierarchies can be moved or deleted as a unit. It has the disadvantage that one requires a path, and not just a name, to find an object. Indices make it possible to access Files by **identifier** instead of path, where an identifier is a pair (typename, name). A simple identifier suffices for Files that are indexed in the root directory. When there are multiple indices, we must also include the index name in identifiers. A **global identifier** is a triple (indexname, typename, name). When writing specializations of Directory, one may also write specializations of Environment. To link the two, define the File method **create_env**. The default implementation instantiates and returns Environment, provided that the File is either the root or indexed. Otherwise, it returns None, which indicates that the parent environment should be used. To delete a database that one has opened, call its *delete* method. A function **delete_database** is also available that takes just a filename. Implementation issues --------------------- Filename ........ A Directory is implemented as a disk directory containing a distinguished file called *_children,* which contains the actual contents. The distinction between the disk directory and the file *_children* is hidden in the implementation. The distinction resides primarily in associating two different disk filenames with a Directory object. The method **_filename** returns the absolute pathname of the disk representation for the sake of relocating or deleting the File. In the case of a Directory, it returns the filename of the disk directory. The method **_contents_filename** returns the absolute pathname of the disk representation for the sake of loading and saving the File. In the case of a Directory, it returns the pathname of the *_children* file. In the case of a File that is not a Directory, _filename and _contents_filename are the same. **Relocatability of Files.** Because a File may be moved, we would like to minimize the number of things that need to be updated if a File's location in the hierarchy changes. There are three aspects to the issue: a File should be self-contained, a File should not cache context-dependent information, and external information that needs updating when the File moves or is deleted should be minimized. **Self-containedness.** All pieces of the File should be containined within the physical file that represents it. To give an example: we wish to be able to identify a file's type by inspection, and one way of doing that would be to place type information in the parent's list of children. However, doing so would reduce self-containedness: the type information would reside outside the child file and would need to be updated if the child moved. Instead, in the current implementation, the filename on disk combines the File's name and typename. This consideration also motivates the current implementation of metadata (including permissions), in which metadata is represented on disk as a section of the same file that contains the File contents. A less acceptable approach would be to store permission information in a sibling file. **No context-dependent cached information.** To give an example, it would be natural to cache a File's physical disk location, but that information would need to be updated if the File is moved. In the current implementation, the pathname is always computed rather than cached. One unavoidable form of context caching is the env member, which contains global information. The current approach is to insist that env be essentially immutable, and hence to restrict a File from being attached to a new parent whose value from env differs from the File's. **Minimizing location-dependent external records.** If the File has an indexed type, then information about the location of a file is stored in an index. That is hardly avoidable, and must be updated if the File is relocated. In the current implementation, that is the main case of external information that must updated, apart from the obvious modifications to the new and old parents. There are a couple of additional dependencies that arise in CLD. * The *user*.media PropDict maps suffixed names to text IDs. If the text is deleted, the PropDict needs to be updated. * A Lexicon contains lists of **references** to locations where tokens of a given lemma occur. If a TokenFile is deleted, references to it need to be deleted as well. To manage updates to external records, there are three methods of File that can be used to notify resources of relevant events: * **Created.** A file is created by the Directory's **new_child** method or by its own **create** method (in the case of a root File). A newly created file receives a *created* call, which creates an entry in the index, if appropriate, and may be wrapped by specializations. * **Moved.** A file is relocated by its own **reparent** method. The file and all its descendants receive a *moved* call, which updates the index entries and may be wrapped by specializations. * **Deleted.** A file is deleted by its own **delete** method. The file and all its descendants receive a *deleted* call, which deletes the index entries and may be wrapped by specializations. When an entire directory is moved or deleted, the *reparent* or *delete* method walks the subhierarchy and notifies each descendant of the change. **File format.** In the current implementation, metadata is stored in the same disk file as the File contents. For this reason, all Files are opened as text files using UTF-8 character encoding. Binary files such as audio and video are stored separately; they cannot directly provide the contents of a File. The input stream passed to __read__ is not a genuine stream, but rather a "pseudo-stream" (of type MetadataInputStream) that is merely an iterator over lines of text. One MetadataInputStream is created for each section of the file metadata header. **Initialization.** When a File is initialized, the Environment and Metadata items are instantiated. To keep things from becoming an unmanageable snarl, the sequence is strictly as follows: * File.__init__ is called first. It is always responsible for instantiating Environment and Metadata. One should never create the Environment first. * Metadata.__init__ stores the host's *env* in its own *env* member. For that reason, the Environment must be instantiated before instantiating the Metadata items. * However, Metadata.__init__ should not assume that all environment variables have been set. Some depend on other metadata items. The exact sequence is as follows: * Initialize members. * The arguments to File.__init__ are the parent directory, the name, and typename. If the File is the root file, all three must be None. * The members _parent, _name_, and _suffix are set from the arguments. The members _perm and _writer are set to None. The members _loaded and _modified are set to False. The member _metaitems is set to an empty list and _npermitems is set to 0. * Create the _metaitems list. * If __has_permissions__ is non-null or the File is the root, a Permissions item is put first on the list and _npermitems is advanced. * If the File is the root, a GroupsFile is added to the list and _npermitems is advanced. * If *indexed* is non-null, an Index is added to the list. * Any items on __metadata__ are added to the list. * Set up env. * The method create_env is called and the value is stored in *env.* If the value is None, parent.env is used. * If the File is a the root, the env keys 'root', 'log', and 'disk' are set; 'username' is set to '_root_' unless it already has a value. * If *indexed* is non-null, 'index_root' is made to point back to the File. * If there is no Permissions object on the _metaitems list, then _perm is set to an InheritedPermissions instance. * Each of the _metaitems is instantiated. One uses **open_database** to instantiate the root File. It takes the class of the root File and a filename as arguments. It sets the *env* variable 'filename' to the filename and updates *env* with any additional keyword arguments. If the disk representation does not already exist, one should use **create_database** instead of open_database. A specialization of Environment is instantiated within create_env, and keys are subsequently added to it. The __init__ method should not expect all keys to be present. The File initializer adds some keys, and open_database adds some more. When Environment is instantiated, the File already exists; it is called the *host.* Environment.__init__ accesses the host's *types* variable and *default_types* variable to set up type tables. An import paradox arises in setting the default_types in selkie.cld.db.env. The *env* module is imported by *file,* which is imported by *dir.* But the classes needed to set default_types reside in *file* and *dir.* The solution is to set selkie.cld.db.env.default_types in the __init__.py file. Checking permissions .................... Read checks are performed when one does require_load() - specifically, in File.__load__, just before calling the __read__ methods of the metadata items and the File itself. Write checks are performed when one does "with self.writer()" - specifically, in the Writer.lock_all method, which is called when the writer is entered, or when a File is added to an active Writer. The write check is done just before locking the file, which is the first step in writing it. Permissions are stored in the File member **_perm.** A File may have independent Permissions, or it may have InheritedPermissions that always just defer to the parent. The class variable __has_permissions__ determines whether it has independent permissions or not. If it does have independent permissions, '_perm' is added at the beginning of the __metadata__ list (in __init__). **Permission-system items** are metadata items of class Permissions or GroupsFile. They require special treatment. (1) Permission-system items require admin checks for writing. Those checks are placed in the *writer* method of PermItem (of which both Permissions and GroupsFile are specializations); the *writer* method is called by all methods that modify contents. One might expect the check to be placed in the *__write__* method, but that would be incorrect. The Permissions metadata item gets written any time one writes its host, and to do that, one only needs write permission, not admin permission. (2) Anyone should be allowed to examine permissions, even if they do not have read access to the Files that host the permission-system items. It is not enough to postpone the read check till after reading the permission-system items: one should not signal an error at all for reading permission-system items, only for attempting to read the File contents or other metadata files. To deal with that issue, permission-system items have a require_load method that does not simply dispatch to their host. Details ------- New child ......... The new_child method is the standard way to create a new file. There is also a method called **need_child** that returns an existing child, if available, and otherwise dispatches to new_child. (The child name is obligatory for need_child, but not for new_child.) The new_child method is called with some subset of the keyword arguments name, suffix, and cls. Its behavior is influenced by two class variables: **signature** and **childtype**. The signature maps child names to classes, and childtype provides a default class. Both are optional. There are four cases: * Neither class nor suffix are provided. If name is provided and has an entry in signature, the corresponding class is used. Otherwise, the childtype is used if available. Otherwise, an error is signalled. The suffix is determined from the class using env['suffixes']. If not found, an error is signalled. * Suffix is provided but not class. The table env['types'] is used to get the class. If there is no entry, an error is signalled. * Class is provided but not suffix. The table env['suffixes'] is used to get the suffix. If there is no entry, an error is signalled. * Both class and suffix are provided. A check is done to confirm that env['types'] maps the suffix to the given class, otherwise an error is signalled. If the name is not provided, a call is placed to the allocate_name method of the index table. If there is no index table, an error is signalled. New_child then does the following: * A permissions check is done; the user must have write permission for the directory. * The class is instantiated to obtain the child. * The parent's attach_child method is called. New_child accepts keyword argument i, which is passed on to attach_child. * The child's created method is called. If there is an index table, an entry is created for the child. The child has no children at this point, so recursion is not an issue. * The child's __create__ method is called, creating the disk-file. If the child is a regular file, an empty disk-file is created. If the child is a directory, an empty directory is created, then the metadata files are created. Specializations may wrap __create__ to create obligatory children by recursive calls to new_child. Writer ...... The File method *writer* dispatches to the function *writer* of selkie.cld.db.disk. The function accepts any number of Files as arguments. It tosses out any that are already protected, a File being protected if it has a value for *_writer.* If any unprotected Files remain, a Writer instance is created and each unprotected File is added to it. The Writer instance is returned, or a dummy writer, if there are no unprotected Files on the list. A Writer instance maintains a list of *files.* Adding a file consists of the following steps. It is possible to add additional Files to an existing Writer, so a check is first done to see if the File is protected; if so, no further action is taken. Otherwise, the File is appended to *files,* its *_writer* member is set to the Writer, and each of the Files that it requires is added to the Writer. Finally, if the Writer is already active, any new unlocked Files are locked. A Writer should be created in the context of a *with*-statement. Accordingly, it expects a call to __enter__, and later a call to __exit__. The file becomes **active** when __enter__ is called, and becomes inactive again when __exit__ is called. When a Writer becomes active, all Files on the *files* list are locked. (Any Files added while the Writer is active are locked when they are added.) Locking a File consists of the following steps. The Writer has a *locks* member that is None when the Writer is inactive, and a list when the Writer is active. To lock a File, a check is first done that it is writable, then its require_load method is called, and finally a Lock object is created for it and placed on the *locks* list. The Lock object locks the file on disk. While the Writer is active, each of the the locked files is said to be **under the control of the writer.** On __exit__, each of the files is saved. (Unless an error has occurred, in which case each of the files is reloaded.) The method __save__ opens a temporary file for writing, and passes the resulting output stream to __write__. If writing completes without error, the current version of the file becomes a backup and the temporary file is moved into its place. After saving, each of the locks is released. If an error occurs at any point - whether the error arises in the write-permission check, or in the body of the *with* statement, or during __write__ - then all of the files are restored to their saved state by calling *_reload,* which calls the __read__ method to restore the content members. For some file classes, every time a file of the class is placed under the control of a writer, there is a related file that should be put under the control of the same writer. For example, when editing a TokenFile, one also needs the associated Lexicon to be editable, so that one can intern new word forms. There is a File method **requires** that can be overridden to provide an iteration over the required files. (The default implementation returns the empty list.) When the File is added to the Writer, its *requires* method is called, and each File that is returned is also added to the Writer. Reparenting ........... Reparenting consists of the following steps: * Check that the user has write permission for both the old parent and the new parent. * Call the old parent's detach_child method to remove it from the list of children. * Move the child's file or directory on disk. * Change the child's internal _parent member. * Call the new parent's attach_child method to add it to the list of children. One may optionally specify the position with keyword argument i. * Call the child's moved method with old and new relpaths. That updates the cached relpath and, if the child is indexed, notifies the index of the old and new relpaths. It recurses to the child's descendants. Deletion ........ The delete method does the following: * Check that the user has write permission for the directory. * Call the parent's detach_child method to remove the child from its list of children. * Call the child's deleted method. If the object is indexed, this deletes it from the index. It also recurses to the child's descendants. * Finally, delete the disk-file or -directory. If it is a directory, the deletion is recursive. If it is a regular file, any file with the same suffixed name plus an added suffix (such as a backup) is also deleted. Reorder children ................ The Directory method reorder_children takes a list of child indices, and a target index. The *target child* is the child at the target index, or end-of-list. The children indicated by the indices are removed from the child list, then reinserted as a group, in the order given, just before the target child. This only involves the _children list; it does not call reparent. Reparent children ................. The Directory method reparent_children takes the same arguments: a list of child indices and a target index. The target index identifies the target child, which must be a real child and not end-of-list. The target child's **attachment_target** method is called to determine the new parent. The default implementation of attachment_target returns the child itself, but classes may override it. (The method need_child is useful here.) The new parent's attachment_target method must return the new parent itself; otherwise an error is signalled. To give a concrete example from CLD, a Text is initially created as a stub. Conceptually, one text attaches to another, but to be precise, a complex Text contains a Toc, and the child Text attaches to the Toc. Nonetheless, we may provide a Text as the target of reparent. The target Text's attachment_target method returns the Toc, creating it if necessary. The children indicated by the indices are processed in the order given. For each, child.reparent(newparent) is called. If the child originally preceded the new parent, it is attached at the beginning of the new parent's children, and if it followed the new parent, it is attached at the end. If multiple children are attached at beginning or end, they occur in the order their indices are listed. Delete children ............... The Directory method delete_children is called with a list of indices. The children at those indices are deleted by calling their delete method. A method **delete_child** is also provided as a convenience. It takes a single index. Example ------- A CLD corpus provides an example of a Database. First, let us create a CLDManager: >>> from selkie.cld.toplevel import CLDManager >>> mgr = CLDManager('/tmp/foo.cld') A string argument is interpreted as the application filename, which for CLD is the corpus. The corpus does not yet exist; we can create a corpus for testing using the 'create_test' command: >>> mgr('create_test') Create corpus '/tmp/foo.cld' ... A lot of output is generated as each file is created. Incidentally, *create_test* also creates a file called *media* in the same directory as the corpus (namely, /tmp). Now we can load the corpus: >>> corpus = mgr.corpus() >>> corpus A Corpus object is a specialization of Database, which is a specialization of Directory. It behaves like a dict in which the keys are names of children: >>> corpus['langs'] Like a dict, its length is the number of keys, and converting it to a list returns a list of keys: >>> len(corpus) 4 >>> list(corpus) ['langs', 'users', 'roms', 'glab'] In this particular case, all four children are subdirectories. If one recurses far enough, one reaches a file: >>> orig = corpus['langs']['deu']['texts']['1']['toc']['2']['orig'] >>> orig One can distinguish a file from a directory using the method is_directory(): >>> corpus.is_directory() True >>> orig.is_directory() False Instead of repeatedly accessing members, one may use the follow() method, which takes a pathname: >>> corpus.follow('/langs/deu/texts/1/toc/2/orig') Files may also behave like dicts or lists, depending on their contents. In this particular case, a TokenFile behaves like a list of TokenBlocks (essentially, tokenized sentences): >>> list(orig) [, ] One may also go up the hierarchy using the method parent(): >>> orig.parent()