CNP - A ChiNese dependency Parser

Introduction

CNP is a highly accurate dependency parser for Chinese. This package includes:

Especially, CNP has the following features:

Table of contents

Changes

Download

Requirement and Compiling

Requirement:
This package does not include trained models needed to run the parser.
If you build your own models (including word segmenter, POS tagger, and parsing model), you need: See README.CNP.NewModel for details.

Compiling:
(Note that you may need to change the path of javac according to your own environment)
First, type the following command:
$ ./make.sh

Next, in the SegPOSCRF/Model/seg directory, compile the charMake.cc as follows:
g++ -O2 -o charMake charMake.cc

Usage

Data format:

The format of the input is the CoNLL 2007 format. An example is as follows:
1 他 他 PN PN - 2 SUB - -
2 是 是 VC VC - 0 ROOT - -
3 一 一 CD CD - 4 AMOD - -
4 名 名 M M - 5 NMOD - -
5 学生 学生 NN NN - 2 PRD - -
6 。 。 PU PU - 2 P - -

(data for the next sentence)

Please set the 7th column as "0" and the 8th column as "NULL" when testing (i.e., analyzing new sentences) as follows.
1 他 _ PN _ _ 0 NULL
2 是 _ VC _ _ 0 NULL
3 一 _ CD _ _ 0 NULL
4 名 _ M _ _ 0 NULL
5 学生 _ NN _ _ 0 NULL
6 。 _ PU _ _ 0 NULL

More examples can be found under cntestbed/

Testing the trained model

(Note that you may need to change the path of java according to your own environment)

Usage:
$./testCNP.sh [model_name] [test_file] [output_file] [encoding] [stMARK]

Parameters:

Example:
$ ./testCNP.sh ../CNP_MODEL_ALAGIN_V1.0_NEW/MSTModels/Model2.AutoPOS.GOLDSEG.CTB4.STLabel.GIGACRFFreq4 cntestbed/test.inputUTF8 cntestbed/test.out UTF8 GIGACRFFreq4

Notes:
If you want to parse raw sentences, you should first process raw sentences by word segmenter and POS tagger, and convert the results into the CoNLL 2007 format. When the models were trained on the CTB data, the word segmentation and POS tags should follow the CTB's guidelines, and you should use the tools that output CTB's POS tags.
To be self-contained, this package contains simple word segmentater and POS tagger using CRF++. These are in "SegPosCRF/Model" directory (i.e., "segmenter.sh", "postagger.sh", and other helper programs). To run these scripts, you should put a word segmentation model (trained CRF++ model) into SegPosCRF/Model/seg and a POS tagging model (trained CRF++ model) into SegPOSCRF/Model/pos. These models can be made from LDC resources (see README.CNP.NewModel for the instruction).

Then, you can use segmenter.sh as follows:

$ ./segmenter.sh [input] [output] [mask]

Parameters:

Then, you can use the POS tagger as:

$ ./postagger.sh [output of segmenter.sh] [output] [mask]

The output file can be input to CNP system.

The above segmenter/postagger can be used for research/commercial purposes because it uses CRF++ that allows to choose BSD license.
Alternatively, you may try other tools by other researchers such as follows: (However, be careful about the terms of use of these tools)

Build your own CNP

Please follow the steps of the file (README.CNP.NewModel) to build your new parser.

License

As the original MSTParser is licensed under Common Public License Version 1.0 (or later), we also distribute our materials (i.e., modification and additions to the original) under the same license. The file "LICENSE" is the original license file and "LICENSE.CNP" is the license file for our materials.

Negligence Clause

We do not guarantee the reliability of the outputs of this tool, and do not have any responsibility for any disbenefit or damages caused by the use of this tool. See LICENSE.CNP for details.

Reference

Copyright (C) 2008-2011 Information Analysis Laboratory (previously Language Infrastructure Group)
National Institute of Information and Communications Technology