disco-dop

HomePage: https://github.com/andreasvc/disco-dop/

Author: Andreas van Cranenburgh

Download: https://pypi.python.org/packages/source/d/disco-dop/disco-dop-0.4.tar.gz

        =================
Discontinuous DOP
=================

.. image:: http://staff.science.uva.nl/~acranenb/disco-dop.png
   :align: right
   :alt: contrived discontinuous constituent for expository purposes.

The aim of this project is to parse discontinuous constituents in natural
language using Data-Oriented Parsing (DOP), with a focus on global world
domination. The grammar is extracted from a treebank of sentences annotated
with (discontinuous) phrase-structure trees. Concretely, this project provides
a statistical constituency parser with support for discontinuous constituents
and Data-Oriented Parsing. Discontinuous constituents are supported through the
grammar formalism Linear Context-Free Rewriting System (LCFRS), which is a
generalization of Probabilistic Context-Free Grammar (PCFG). Data-Oriented
Parsing allows re-use of arbitrary-sized fragments from previously seen
sentences using Tree-Substitution Grammar (TSG).

.. contents:: Contents of this README:
   :local:

Features
========
General statistical parsing:

- grammar formalisms: PCFG, PLCFRS
- extract treebank grammar: trees decomposed into productions, relative
  frequencies as probabilities
- exact *k*-best list of derivations
- coarse-to-fine pruning: posterior pruning (PCFG only),
  *k*-best coarse-to-fine

DOP specific (parsing with tree fragments):

- implementations: Goodman's DOP reduction, Double-DOP.
- estimators: relative frequency estimate (RFE), equal weights estimate (EWE).
- objective functions: most probable parse (MPP),
  most probable derivation (MPD), most probable shortest derivation (MPSD),
  most likely tree with shortest derivation (SL-DOP).
- marginalization: n-best derivations, sampled derivations.

Installation
============

Requirements:

- Python 2.7+/3   http://www.python.org (need headers, e.g. python-dev package)
- Cython 0.18+    http://www.cython.org
- GCC             http://gcc.gnu.org/
- Numpy 1.5+      http://numpy.org/

For example, to install these dependencies and the latest stable release on
an `Ubuntu <http://www.ubuntu.com>`_ system
using `pip <http://http://www.pip-installer.org>`_,
issue the following commands::

    sudo apt-get install build-essential python-dev python-numpy python-pip
    pip install --user Cython
    pip install --user disco-dop

To compile the latest development version on Ubuntu,
run the following sequence of commands::

    sudo apt-get install build-essential python-dev python-numpy python-pip git
    pip install cython --user
    git clone --depth 1 git://github.com/andreasvc/disco-dop.git
    cd disco-dop
    python setup.py install --user

(the ``--user`` option means the packages will be installed to your home
directory which does not require root privileges).

If you do not run Linux, it is possible to run the code inside a virtual machine.
To do that, install `Virtualbox <https://www.virtualbox.org/wiki/Downloads>`_
and `Vagrant <http://docs.vagrantup.com/v2/installation/>`_,
and copy ``Vagrantfile`` from this repository to a new directory. Open a
command prompt (terminal) in this directory, and run the command
``vagrant up``. The virtual machine will boot and run a script to install the
above prerequisites automatically. The command ``vagrant ssh`` can then be used
to log in to the virtual machine (use ``vagrant halt`` to stop the virtual
machine).

Compilation requires the GCC compiler. To port the code to another compiler such
as Visual C, replace the compiler intrinsics in ``macros.h``, ``bit.pyx``, and
``bit.pxd`` with their equivalents for the compiler in question. This mainly
concerns operations to scan for bits in integers, for which these compiler
intrinsics provide the most efficient implementation on a given processor.

Usage
=====

Parser
------
To run an end-to-end experiment from grammar extraction to evaluation on a test
set, make a copy of the file ``sample.prm`` and edit its parameters.
These parameters can then be invoked by executing::

    discodop runexp filename.prm

This will create a new directory with the base name of the parameter file, i.e.,
``filename/`` i