CNP - A ChiNese dependency Parser
Introduction
CNP is a highly accurate dependency parser for Chinese. This package includes:
- Modifications of the MSTParser (http://sourceforge.net/projects/mstparser), such as:
- support for Carreras et al. (2007)'s higher-order decoding
- support for subtree features described in Chen et al. (2009)
Especially, CNP has the following features:
- High accuracy due to the use of the features based on subtrees extracted from auto-parsed data.
- Support for labeled dependency parsing.
Table of contents
- Version 1: Initial release
- Version 1.1: Experimental implementation of K-best output [We stop the release of this version temporarily because we found a bug. Please wait a moment (2012/3/30)]
- CNP_ALAGIN_V1.tar.gz (2.1MB): HTTP
- CNP_ALAGIN_V1.1.tar.gz: HTTP [we stop the release of this version temporarily because we found a bug]
Requirement:
- OS: Linux (tested on CentOS 5.2 or higher)
- Memory 7GB or larger
- Java version 1.6.0_07 or higher
This package does not include trained models needed to run the parser.
If you build your own models (including word segmenter, POS tagger, and parsing model), you need:
- g++ (tested 4.5.0 or higher)
- CRF++ 0.51 or higher (http://crfpp.sourceforge.net)
- Perl 5.8.8 or higher
- Penn2Malt (http://stp.lingfil.uu.se/~nivre/research/Penn2Malt.html)
- Penn Chinese Treebank (CTB) 4.0 (LDC2004T05), 5.0 (LDC2005T01) or 6.0 (LDC2007T36).
- Chienese Gigaword Corpus (LDC2009T14). CTB and Chinese Gigaword Corpus are released from Language Data Consortium (LDC) (http://www.ldc.upenn.edu/) for LDC members.
See README.CNP.NewModel for details.
Compiling:
(Note that you may need to change the path of javac according to your own environment)
First, type the following command:
$ ./make.sh
Next, in the SegPOSCRF/Model/seg directory, compile the charMake.cc as follows:
g++ -O2 -o charMake charMake.cc
Data format:
The format of the input is the CoNLL 2007 format. An example is as follows:
1 他 他 PN PN - 2 SUB - -
2 是 是 VC VC - 0 ROOT - -
3 一 一 CD CD - 4 AMOD - -
4 名 名 M M - 5 NMOD - -
5 学生 学生 NN NN - 2 PRD - -
6 。 。 PU PU - 2 P - -
(data for the next sentence)
Please set the 7th column as "0" and the 8th column as "NULL" when testing (i.e., analyzing new sentences) as follows.
1 他 _ PN _ _ 0 NULL
2 是 _ VC _ _ 0 NULL
3 一 _ CD _ _ 0 NULL
4 名 _ M _ _ 0 NULL
5 学生 _ NN _ _ 0 NULL
6 。 _ PU _ _ 0 NULL
More examples can be found under cntestbed/
(Note that you may need to change the path of java according to your own environment)
Usage:
$./testCNP.sh [model_name] [test_file] [output_file] [encoding] [stMARK]
Parameters:
- model_name: the filename of a trained model
- test_file: the file containing sentences to parse in the CoNLL 2007 format
- output_file: the parsing output for test_file
- encoding: the encoding type. Now supports GBK and UTF8 encoding
- stMark: indicates which subtrees file to be used.
Example:
$ ./testCNP.sh ../CNP_MODEL_ALAGIN_V1.0_NEW/MSTModels/Model2.AutoPOS.GOLDSEG.CTB4.STLabel.GIGACRFFreq4 cntestbed/test.inputUTF8 cntestbed/test.out UTF8 GIGACRFFreq4
Notes:
If you want to parse raw sentences, you should first process raw sentences by word segmenter and POS tagger, and convert the results into the CoNLL 2007 format.
When the models were trained on the CTB data, the word segmentation and POS tags should follow the CTB's guidelines, and you should use the tools that output CTB's POS tags.
To be self-contained, this package contains simple word segmentater and POS tagger using CRF++.
These are in "SegPosCRF/Model" directory (i.e., "segmenter.sh", "postagger.sh", and other helper programs).
To run these scripts, you should put a word segmentation model (trained CRF++ model) into SegPosCRF/Model/seg
and a POS tagging model (trained CRF++ model) into SegPOSCRF/Model/pos. These models can be made from LDC resources (see README.CNP.NewModel for the instruction).
Then, you can use segmenter.sh as follows:
$ ./segmenter.sh [input] [output] [mask]
Parameters:
- input: a file containing sentences in one sentence per line format (in GBK encoding).
- mask: a prefix for the temporary files generated in "test directory". please give some string.
Then, you can use the POS tagger as:
$ ./postagger.sh [output of segmenter.sh] [output] [mask]
The output file can be input to CNP system.
The above segmenter/postagger can be used for research/commercial purposes
because it uses CRF++ that allows to choose BSD license.
Alternatively, you may try other tools by other researchers such as follows:
(However, be careful about the terms of use of these tools)
Please follow the steps of the file (README.CNP.NewModel) to build your new parser.
As the original MSTParser is licensed under Common Public License Version 1.0 (or later),
we also distribute our materials (i.e., modification and additions to the original)
under the same license.
The file "LICENSE" is the original license file and "LICENSE.CNP" is the license file
for our materials.
We do not guarantee the reliability of the outputs of this tool, and do not have
any responsibility for any disbenefit or damages caused by the use of this tool.
See LICENSE.CNP for details.
-
Wenliang Chen, Jun'ichi Kazama, Kiyotaka Uchimoto, and Kentaro Torisawa. Improving Dependency Parsing with Subtrees from Auto-Parsed Data, Conference on Empirical Methods in Natural Language Processing(EMNLP2009), Singapore, August 2-7, 2009.[pdf][bib]
Copyright (C) 2008-2011 Information Analysis Laboratory (previously Language Infrastructure Group)
National Institute of Information and Communications Technology