CNP - A ChiNese dependency Parser

Introduction

CNP is a highly accurate dependency parser for Chinese. This package includes:

Modifications of the MSTParser (http://sourceforge.net/projects/mstparser), such as:

support for Carreras et al. (2007)'s higher-order decoding
support for subtree features described in Chen et al. (2009)

Especially, CNP has the following features:

High accuracy due to the use of the features based on subtrees extracted from auto-parsed data.
Support for labeled dependency parsing.

Changes
Download
Requirement and Compiling
Usage
License
Negligence Clause

Changes

Version 1: Initial release
Version 1.1: Experimental implementation of K-best output [We stop the release of this version temporarily because we found a bug. Please wait a moment (2012/3/30)]

Download

CNP_ALAGIN_V1.tar.gz (2.1MB): HTTP
CNP_ALAGIN_V1.1.tar.gz: HTTP [we stop the release of this version temporarily because we found a bug]

Requirement and Compiling

Requirement:

OS: Linux (tested on CentOS 5.2 or higher)
Memory 7GB or larger
Java version 1.6.0_07 or higher

This package does not include trained models needed to run the parser.
If you build your own models (including word segmenter, POS tagger, and parsing model), you need:

g++ (tested 4.5.0 or higher)
CRF++ 0.51 or higher (http://crfpp.sourceforge.net)
Perl 5.8.8 or higher
Penn2Malt (http://stp.lingfil.uu.se/~nivre/research/Penn2Malt.html)
Penn Chinese Treebank (CTB) 4.0 (LDC2004T05), 5.0 (LDC2005T01) or 6.0 (LDC2007T36).
Chienese Gigaword Corpus (LDC2009T14). CTB and Chinese Gigaword Corpus are released from Language Data Consortium (LDC) (http://www.ldc.upenn.edu/) for LDC members.

See README.CNP.NewModel for details.

Compiling:
(Note that you may need to change the path of javac according to your own environment)
First, type the following command:
$ ./make.sh

Next, in the SegPOSCRF/Model/seg directory, compile the charMake.cc as follows:
g++ -O2 -o charMake charMake.cc

Usage

Data format:

The format of the input is the CoNLL 2007 format. An example is as follows:
1 他他 PN PN - 2 SUB - -
2 是是 VC VC - 0 ROOT - -
3 一一 CD CD - 4 AMOD - -
4 名名 M M - 5 NMOD - -
5 学生学生 NN NN - 2 PRD - -
6 。。 PU PU - 2 P - -

(data for the next sentence)

Please set the 7th column as "0" and the 8th column as "NULL" when testing (i.e., analyzing new sentences) as follows.
1 他 _ PN _ _ 0 NULL
2 是 _ VC _ _ 0 NULL
3 一 _ CD _ _ 0 NULL
4 名 _ M _ _ 0 NULL
5 学生 _ NN _ _ 0 NULL
6 。 _ PU _ _ 0 NULL

More examples can be found under cntestbed/

Testing the trained model

(Note that you may need to change the path of java according to your own environment)

Usage:
$./testCNP.sh [model_name] [test_file] [output_file] [encoding] [stMARK]

Parameters:

model_name: the filename of a trained model
test_file: the file containing sentences to parse in the CoNLL 2007 format
output_file: the parsing output for test_file
encoding: the encoding type. Now supports GBK and UTF8 encoding
stMark: indicates which subtrees file to be used.

Example:
$ ./testCNP.sh ../CNP_MODEL_ALAGIN_V1.0_NEW/MSTModels/Model2.AutoPOS.GOLDSEG.CTB4.STLabel.GIGACRFFreq4 cntestbed/test.inputUTF8 cntestbed/test.out UTF8 GIGACRFFreq4

Notes:
If you want to parse raw sentences, you should first process raw sentences by word segmenter and POS tagger, and convert the results into the CoNLL 2007 format. When the models were trained on the CTB data, the word segmentation and POS tags should follow the CTB's guidelines, and you should use the tools that output CTB's POS tags.
To be self-contained, this package contains simple word segmentater and POS tagger using CRF++. These are in "SegPosCRF/Model" directory (i.e., "segmenter.sh", "postagger.sh", and other helper programs). To run these scripts, you should put a word segmentation model (trained CRF++ model) into SegPosCRF/Model/seg and a POS tagging model (trained CRF++ model) into SegPOSCRF/Model/pos. These models can be made from LDC resources (see README.CNP.NewModel for the instruction).

Then, you can use segmenter.sh as follows:

$ ./segmenter.sh [input] [output] [mask]

Parameters:

input: a file containing sentences in one sentence per line format (in GBK encoding).
mask: a prefix for the temporary files generated in "test directory". please give some string.

Then, you can use the POS tagger as:

$ ./postagger.sh [output of segmenter.sh] [output] [mask]

The output file can be input to CNP system.

The above segmenter/postagger can be used for research/commercial purposes because it uses CRF++ that allows to choose BSD license.
Alternatively, you may try other tools by other researchers such as follows: (However, be careful about the terms of use of these tools)

Word Segmentation tool: Stanford Chinese Word Segmenter and BaseSeg.
POS tagger: BasePoS

Build your own CNP

Please follow the steps of the file (README.CNP.NewModel) to build your new parser.

License

As the original MSTParser is licensed under Common Public License Version 1.0 (or later), we also distribute our materials (i.e., modification and additions to the original) under the same license. The file "LICENSE" is the original license file and "LICENSE.CNP" is the license file for our materials.

Negligence Clause

We do not guarantee the reliability of the outputs of this tool, and do not have any responsibility for any disbenefit or damages caused by the use of this tool. See LICENSE.CNP for details.

Reference

Wenliang Chen, Jun'ichi Kazama, Kiyotaka Uchimoto, and Kentaro Torisawa. Improving Dependency Parsing with Subtrees from Auto-Parsed Data, Conference on Empirical Methods in Natural Language Processing(EMNLP2009), Singapore, August 2-7, 2009.[pdf][bib]

Copyright (C) 2008-2011 Information Analysis Laboratory (previously Language Infrastructure Group)
National Institute of Information and Communications Technology

CNP - A ChiNese dependency Parser

Introduction

Table of contents

Data format: