Opinion extraction tool

About Opinion extraction tool
What is new
Caution for use
Download
Requirements and compiling
Usage of Opinion extraction tool
Generating model data
Directories
References
Copyright & Licences

About Opinion extraction tool

This tool is developed by Information Credibility project, in Knowledge Clustered Group, National Institute of Information and Communications Technology (NICT), Japan*1. It judges whether a given Japanese sentence indicates opinion, evaluation, or proposal (henceforth, we call them "opinions") using machine learning techniques. The tool outputs the followings:

Linguistic expressions which represent opinions (opinion expression)
Semantic categories of opinion expressions (opinion type)
Polarities of the opinion expressions (positive/negative polarity)
Persons or organizations who assert the opinion expressions (opinion holders)

For example, the following sentence is analysed as below.

ほうれん草はビタミンが豊富だ。(Spinach is rich in vitamins.)
- Opinion expression: ビタミンが豊富だ (rich in vitamins)
- Opinion type: メリット (merit)
- Polarity: ポジティブ (positive)
- Opinion holder: 著者 (author)

In the case above, rich in vitamins is the opinion expression. Based on the meaning of the opinion expressions, opinion expressions are classified into one of the following types: emotion (emotional opinion expressions), evaluation (subjective, but not emotional expression), merit (relative objective opinion expressions), etc. In the above example, the opinion type of rich in vitamins is categorized as "merit" because rich in vitamins is a relatively objective opinion expression (see Appendix 1 for details). The polarity represents whether the opinion expressions have positive meanings or negative meanings (some opinion types have no polarity). The opinion holder is a person or organization who asserts the opinion expressions. In this case, the opinion holder is the author of the sentence. Detailed information is described in the manual for annotation, which is attached in this tool (spec.pdf). According to this specification, an opinion corpus was constructed by hand and used to train the opinion mining tool. Note that it is not guaranteed that the opinion mining tool outputs the results that conform to the specification completely, although it tries to do so, because the training is based on statistical machine learning.

*1) This project completed in 2011/3.

What is new

2012/3/1 Version 1.2 is released.
2011/9/22 Version 1.1 is released.
- The tool for generating model data is attached.
2011/8/31 Version 1.0 is released.

Caution for use

This tool extracts and classifies information by using machine learning. Please note the following:

It is possible that this tool classifies particular individuals, organizations or services with the wrong polarity by mistake as a result of automated processing. The wrong result can lead to a proposal of libel in an extreme case if the tool outputs negative polarity by mistake. When using this tool, please state clearly that the output of the tool is the result of automated processing.
Some expressions can be regarded as discriminatory expressions when the expressions are regarded as negative. It is possible that the tool classifies these kind of expressions as negative. Please be careful about such cases.

Bearing in mind the above cases, we do not guarantee the reliability of the outputs of this tool, and do not have any responsibility for any disbenefit or damages caused by the use of this tool. See Copyright & License.

Download

Opinion extraction tool (Version 1.2) : extractopinion-1.2.tar.gz (10MB) [HTTP]

Requirements and Compiling

Requirements:

OS: Linux (tested on Cent OS 5.5)
Memory: 4GB or larger

Required programs:

CRF++ (Version 0.54 or higher)
iconv (Version 2.5 or higher)
gawk (Version 4.0.0 or higher)
gcc (Version 4.2.1 or higher)
perl (Version 5.8.8 or higher)
- Text::CSV_XSmodule for generating model data
JUMAN (Version 6.0)
KNP (Version 3.01)

Installation:

Download extractopinion-1.1.tar.gz from the download link.
Extract the downloaded file and move to the resulting directory
```
% tar zxvf extractopinion-1.1.tar.gz
% cd extractopinion-1.1/
```
Type the following command and compile the program in svmtools/

% cd svmtools/
% make clean ; make

Type the following command and compile the program in pol/
```
% cd ../pol/
% make clean ; make  
```

Usage of the opinion extraction tool

Input file

Prepare an input file. The input file must be written in UTF-8. One Japanese sentence per line is written on the text file. We use "sample.txt" as the input file name in the following explanation.

Example (sample.txt) ("↵" means line break: \n) :

ほうれん草はビタミンが豊富だ。↵

京都は日本にある。↵
商品Aは良くない。↵
太郎は学校に行くべきだ。↵
道州制は国の一律の規制が解かれ地域経済の活性化が図られるので、商機が拡大すると考えられる。↵

English translation:

Spinach is rich in vitamins.↵
Kyoto is in Japan.↵
Product "A" is not good.↵
Taro should go to school.↵

The Regional system increases the opportunity for trade because it eases regulations and activates local economy.↵

Command

Execute the following command with the input file name as the first argument (see Appendix 2).

% cd extractopinion-1.1/
% ./extract.sh sample.txt

Output

After executing the command, each sentence of the text is processed, outputting the following:

<Document ID>	Input file name
<Sentence ID>	The ID of the sentence in the input file (Integer)
<Opinion holder>	Strings (UTF-8)
<Opinion type and polarity>	Strings (UTF-8) Polarity is empty in case of non-polarity type (Deontic/Demand)
<Opinion expression>	Strings (UTF-8)

Example of output:

sample.txt   1   [著者]   メリット＋   ビタミンが豊富だ。
sample.txt   2   
sample.txt   3   [著者]   批評−       良くない。
sample.txt   4   [著者]   当為         学校に行くべきだ。
sample.txt   5   [著者]   メリット＋   地域経済の活性化が図られるので、
sample.txt   5   [著者]   メリット＋   商機が拡大すると考えられる。

English translation:

sample.txt   1   [Author]   Merit+        rich in vitamins.
sample.txt   2   
sample.txt   3   [Author]   Evaluation-   not good.
sample.txt   4   [Author]   Deontic       should go to school.
sample.txt   5   [Author]   Merit+        increases the opportunity for trade
sample.txt   5   [Author]   Merit+        activates local economy.

Generating model data

It is possible to generate model data by constructing an annotated corpus for machine learning. For the generation, follow the procedure below.

Install Text::CSV_XS module through CPAN.
```
% cpan
cpan> install Text::CSV_XS
```
Collect a list of words and their polarities by yourself and construct polarity dictionary according to Appendix 4.
After constructing dictionaries, put them into the directory "extractopinion-1.1/dic" (default location of dictionaries). It is possible to change the location of the directory of the dictionaries by editing the environment value "dic" in conf.sh (Appendix 3).
Construct a corpus for machine learning (see Appendix 5 about the format) by hand and save the file into the directory "extractopinion-1.1/makemodel/csv/".
Execute the following command. The command "csv2tsv.sh" adds morphological information to the corpus and converts it into tsv format. JUMAN and KNP are required for executing "csv2tsv.sh". The tsv files is located in "extractopinion-1.1/makemodel/tsv/".
```
% cd extractopinion-1.1/makemodel/csv/
% ./csv2tsv.sh
```
Execute the following command. The command "makemdl.sh" generates model data in the directory "extractopinion-1.1/makemodel/model/" from the tsv files.
```
% cd extractopinion-1.1/
% ./makemdl.sh
```
Specify the location of the model data by editing the environment variable "model" in conf.sh (Appendix 3) and copy the model data to the specified location.

Directories

readme.{html/utf.txt}	This document
spec.pdf	Opinion manual annotation specification *1
pol/	Polarity classification module *2
src/	Opinion holder extraction module
typ/	Opinion type classification module
xpr/	Opinion expression extraction module
svmtools/	SVM directory
lib/	Common functions
dic/	Polarity dictionary
	dictionary.dic	Polarity dictionary (see Appendix 4)
	reverse.dic	Reverse expression dictionary (see Appendix 4)
modeldata/sample/	Sample model data
	model.pol_mdl	Model data (polarity classification)
	model.src_crfmdl	Model data (Opinion holder extraction)
	model.src_ft	Model data (Opinion holder extraction)
	model.src_svmmdl	Model data (Opinion holder extraction)
	model.typ_ft	Model data (opinion type classification)
	model.typ_mdl	Model data (opinion type classification)
	model.xpr_mdl	Model data (opinion expression extraction)
sample.txt	Sample input file
extract.sh	Script for extracting opinion
_extract.sh	Internal script for extracting opinion
conf.sh	Configuration file (see Appendix 3)
makemdl.sh	Script for generating model data
_train.sh	Internal script for generating model data
makemodel/	Directory for generating model data
	model/	Model data
	csv/	Corpus for machine learning (see Appendix 5)
	tsv/	tsv files for the input of makemdl.sh.

*1) Opinion extraction tool does not always output the same result as the specification because the specification is assumed to be established for constructing an opinion corpus manually
*2) Please note the copyright of some files in the directory pol/. See Copyright & License.

Copyright & License

This tool is free software. The copyright belongs to National Institute of Information and Communications Technology, Japan. This tool can be used, modified, and redistribute under BSD (Modified BSD License), LGPL (GNU Lesser General Public License), or GPL (GNU General Public License).
The following files in the directory pol/, which implement L-BFGS optimization algorithm, are translated into C from the original Fortran code by hand.
- lbfgs.c
- lbfgs.h
The original Fortran code is written by Prof. Jorge Nocedal (http://users.eecs.northwestern.edu/~nocedal/lbfgs.html). The Copyright of the original Fortran code belongs to Prof. Jorge Nocedal.

References

Nakagawa, T., Inui, K. and Kurohashi, S., Dependency tree-based sentiment classification using CRFs with hidden variables, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 786–794 (2010).

Nakagawa, T., Inui, T., and Kurohashi, S., Sentiment Classification using Conditional Random Fields with Hidden Variables, PSJ SIG Technical Report 2009-NL-192, pp. 1-7 (2009), (in Japanese).

Nakagawa, T., Kawada, T., Inui, K., and Kurohashi S., Extracting Subjective and Objective Evaluative Expressions from the Web, In Proceedings of the Second International Symposium on Universal Communication (ISUC 2008), pp.251-258 (2008).

Kawada, T., Nakagawa, T., Morii, R., Miyamori, H., Akamine, S., Inui, K., Kurohashi ,and S., Kidawara, Y., Constructing evaluative information corpus on the Web. In Proceedings of the 14th Annual Meeting of the Association for Natural Language Processing, pp.524-527 (2008) (in Japanese).

Liu, D.C., and Nocedal, J., On the Limited Memory Method for Large Scale Optimization , Mathematical Programming B, 45, 3, pp. 503-528, (1989).

Nocedal, J., Updating Quasi-Newton Matrices with Limited Storage , Mathematics of Computation 35, pp. 773-782, (1980).

Appendix

Appendix 1) Opinion type

(+: Positive、-:Negative)

Emotion+/Emotion-
- Subjective and emotional opinions
  - Example: I love Kyoto (Emotion +)
Evaluation+/Evaluation-
- Subjective but not emotional opinions

Example: Kyoto is clean.

Merit+/Merit-
- Objective opinions, especially about advantages or disadvantages of something

Example: This card is everyday available (Merit+)

Adopt+, Adopt-
- Acceptance or refusal of some act
  - Example: The company accepts the summer time. (Adopt+)
Event+/Event-
- Good/bad events or experiences

Example: I won the first prize (Event＋)

Deontic
- Duties and proposals (No polarity)

Example: Electric money should be introduced. (Deontic)

Demand
- Requirements and demands (No polarity)

Example: I hope this shop can be paid by electric money.

Appendix 2) Internal processing (extract.sh)

Making tsv files for processing opinion extraction from input file (lib/in2tsv.pl)
Opinion extraction (xpr/extract.sh)
Opinion holder extraction (src/extract.sh)
Opinion type classification (typ/extract.sh)
Polarity classification (pol/extract.sh)
Converting tsv fie for processing opinion extraction into the output format (lib/tsv2out.pl)

Appendix 3) Environment variables in conf.sh

TMPDIR
- Directory for temporary file. Default value is /tmp
model
- Prefix of model data. Default value is extractopinion-1.1/modeldata/sample/model
dic
- Directory for polarity dictionary. Default value is extractopinion-1.1/dic

Appendix 4) Polarity dictionary

Format

Both dictionary.dic and reverse.dic have the same format. The format of each dictionary is a text file (EUC-JP). The following three values are tab separated.

Direction word
Polarity
Morphemes of the direction word. Each Morpheme are separated by tab.

Example:

Direction word	Polarity	Morphemes
良い	+	良い
正統派	+	正統だ	派
やさしさ	+	やさしい	さ
内分泌攪乱化学物質	-	内分泌	攪乱	化学	物質

Reverse.dic

Reverse.dic is a set of words which reverse the polarity in the entire sentence. For example, prevent in the following sentence.

This medicine prevents cancer.

While cancer is a negative noun, the verb prevent reverses the polarity and the whole sentence has positive polarity. This kind of words is collected in reverse.dic.

Location

Both dictionary.dic and reverse.dic are located in the directory "extractopinion-1.1/dic" by default. It is possible to change the location by editing the environment value "dic" in conf.sh.

Appendix 5) Corpus for machine learning (csv)

The corpus format for machine learning is the following. Each value is separated by comma and the character code is Shift-JIS.

Sentence ID (arbitrary ID)
Sentence
Opinion holder
Opinion expression
Opinion type
Target of the opinion (It represents the object of the opinion expression. This value can be ignored because it is not output in this tool.)

Example of an annotation for a given sentence "京都は美しい (Kyoto is beautiful)."

Sentence ID	Sentence	Opinion holder	Opinion expression	Opinion type	Target of the opinion
example-1	京都は美しい	[著者]	美しい	批評＋	京都