Japanese

Opinion extraction tool

Contents

  1. About Opinion extraction tool
  2. What is new
  3. Caution for use
  4. Download
  5. Requirements and compiling
  6. Usage of Opinion extraction tool
  7. Generating model data
  8. Directories
  9. References
  10. Copyright & Licences

About Opinion extraction tool

This tool is developed by Information Credibility project, in Knowledge Clustered Group, National Institute of Information and Communications Technology (NICT), Japan*1. It judges whether a given Japanese sentence indicates opinion, evaluation, or proposal (henceforth, we call them "opinions") using machine learning techniques. The tool outputs the followings:

  1. Linguistic expressions which represent opinions (opinion expression)
  2. Semantic categories of opinion expressions (opinion type)
  3. Polarities of the opinion expressions (positive/negative polarity)
  4. Persons or organizations who assert the opinion expressions (opinion holders)

For example, the following sentence is analysed as below.

In the case above, rich in vitamins is the opinion expression. Based on the meaning of the opinion expressions, opinion expressions are classified into one of the following types: emotion (emotional opinion expressions), evaluation (subjective, but not emotional expression), merit (relative objective opinion expressions), etc. In the above example, the opinion type of rich in vitamins is categorized as "merit" because rich in vitamins is a relatively objective opinion expression (see Appendix 1 for details). The polarity represents whether the opinion expressions have positive meanings or negative meanings (some opinion types have no polarity). The opinion holder is a person or organization who asserts the opinion expressions. In this case, the opinion holder is the author of the sentence. Detailed information is described in the manual for annotation, which is attached in this tool (spec.pdf). According to this specification, an opinion corpus was constructed by hand and used to train the opinion mining tool. Note that it is not guaranteed that the opinion mining tool outputs the results that conform to the specification completely, although it tries to do so, because the training is based on statistical machine learning.

What is new

Caution for use

This tool extracts and classifies information by using machine learning. Please note the following:

Bearing in mind the above cases, we do not guarantee the reliability of the outputs of this tool, and do not have any responsibility for any disbenefit or damages caused by the use of this tool. See Copyright & License.

Download

Requirements and Compiling

Requirements:

Required programs:

Installation:

Usage of the opinion extraction tool

Input file

Prepare an input file. The input file must be written in UTF-8. One Japanese sentence per line is written on the text file. We use "sample.txt" as the input file name in the following explanation.

Example (sample.txt) ("↵" means line break: \n) :

ほうれん草はビタミンが豊富だ。↵

京都は日本にある。↵
商品Aは良くない。↵
太郎は学校に行くべきだ。↵ 道州制は国の一律の規制が解かれ地域経済の活性化が図られるので、商機が拡大すると考えられる。↵


English translation:

Spinach is rich in vitamins.↵
Kyoto is in Japan.↵
Product "A" is not good.↵
Taro should go to school.↵ The Regional system increases the opportunity for trade because it eases regulations and activates local economy.↵

Command

Execute the following command with the input file name as the first argument (see Appendix 2).

% cd extractopinion-1.1/
% ./extract.sh sample.txt

Output

After executing the command, each sentence of the text is processed, outputting the following:

<Document ID> Input file name
<Sentence ID> The ID of the sentence in the input file (Integer)
<Opinion holder> Strings (UTF-8)
<Opinion type and polarity>

Strings (UTF-8) Polarity is empty in case of non-polarity type (Deontic/Demand)

<Opinion expression> Strings (UTF-8)

Example of output:

sample.txt   1   [著者]   メリット+   ビタミンが豊富だ。
sample.txt   2   
sample.txt 3 [著者] 批評− 良くない。
sample.txt 4 [著者] 当為 学校に行くべきだ。 sample.txt 5 [著者] メリット+ 地域経済の活性化が図られるので、
sample.txt 5 [著者] メリット+ 商機が拡大すると考えられる。

English translation:

sample.txt   1   [Author]   Merit+        rich in vitamins.
sample.txt   2   
sample.txt 3 [Author] Evaluation- not good.
sample.txt 4 [Author] Deontic should go to school. sample.txt 5 [Author] Merit+ increases the opportunity for trade
sample.txt 5 [Author] Merit+ activates local economy.

Generating model data

It is possible to generate model data by constructing an annotated corpus for machine learning. For the generation, follow the procedure below.

  1. Install Text::CSV_XS module through CPAN.
    % cpan
    cpan> install Text::CSV_XS
  2. Collect a list of words and their polarities by yourself and construct polarity dictionary according to Appendix 4.
  3. After constructing dictionaries, put them into the directory "extractopinion-1.1/dic" (default location of dictionaries). It is possible to change the location of the directory of the dictionaries by editing the environment value "dic" in conf.sh (Appendix 3).
  4. Construct a corpus for machine learning (see Appendix 5 about the format) by hand and save the file into the directory "extractopinion-1.1/makemodel/csv/".
  5. Execute the following command. The command "csv2tsv.sh" adds morphological information to the corpus and converts it into tsv format. JUMAN and KNP are required for executing "csv2tsv.sh". The tsv files is located in "extractopinion-1.1/makemodel/tsv/".
    % cd extractopinion-1.1/makemodel/csv/
    % ./csv2tsv.sh
  6. Execute the following command. The command "makemdl.sh" generates model data in the directory "extractopinion-1.1/makemodel/model/" from the tsv files.
    % cd extractopinion-1.1/
    % ./makemdl.sh
  7. Specify the location of the model data by editing the environment variable "model" in conf.sh (Appendix 3) and copy the model data to the specified location.

Directories

readme.{html/utf.txt} This document
spec.pdf Opinion manual annotation specification *1
pol/ Polarity classification module *2
src/ Opinion holder extraction module
typ/ Opinion type classification module
xpr/ Opinion expression extraction module
svmtools/ SVM directory
lib/ Common functions
dic/ Polarity dictionary
dictionary.dic Polarity dictionary (see Appendix 4)
reverse.dic Reverse expression dictionary (see Appendix 4)
modeldata/sample/ Sample model data
model.pol_mdl Model data (polarity classification)
model.src_crfmdl Model data (Opinion holder extraction)
model.src_ft Model data (Opinion holder extraction)
model.src_svmmdl Model data (Opinion holder extraction)
model.typ_ft Model data (opinion type classification)
model.typ_mdl Model data (opinion type classification)
model.xpr_mdl Model data (opinion expression extraction)
sample.txt Sample input file
extract.sh Script for extracting opinion
_extract.sh Internal script for extracting opinion
conf.sh Configuration file (see Appendix 3)
makemdl.sh Script for generating model data
_train.sh Internal script for generating model data
makemodel/ Directory for generating model data
model/ Model data
csv/ Corpus for machine learning (see Appendix 5)
tsv/ tsv files for the input of makemdl.sh.

Copyright & License

References

Nakagawa, T., Inui, K. and Kurohashi, S., Dependency tree-based sentiment classification using CRFs with hidden variables, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 786–794 (2010).

Nakagawa, T., Inui, T., and Kurohashi, S., Sentiment Classification using Conditional Random Fields with Hidden Variables, PSJ SIG Technical Report 2009-NL-192, pp. 1-7 (2009), (in Japanese).

Nakagawa, T., Kawada, T., Inui, K., and Kurohashi S., Extracting Subjective and Objective Evaluative Expressions from the Web, In Proceedings of the Second International Symposium on Universal Communication (ISUC 2008), pp.251-258 (2008).

Kawada, T., Nakagawa, T., Morii, R., Miyamori, H., Akamine, S., Inui, K., Kurohashi ,and S., Kidawara, Y., Constructing evaluative information corpus on the Web. In Proceedings of the 14th Annual Meeting of the Association for Natural Language Processing, pp.524-527 (2008) (in Japanese).

Liu, D.C., and Nocedal, J., On the Limited Memory Method for Large Scale Optimization , Mathematical Programming B, 45, 3, pp. 503-528, (1989).

Nocedal, J., Updating Quasi-Newton Matrices with Limited Storage , Mathematics of Computation 35, pp. 773-782, (1980).


Appendix

Appendix 1) Opinion type

(+: Positive、-:Negative)

Appendix 2) Internal processing (extract.sh)

Appendix 3) Environment variables in conf.sh

Appendix 4) Polarity dictionary

Format

Both dictionary.dic and reverse.dic have the same format. The format of each dictionary is a text file (EUC-JP). The following three values are tab separated.

Example:

Direction word Polarity Morphemes
良い + 良い      
正統派 + 正統だ    
やさしさ + やさしい    
内分泌攪乱化学物質 - 内分泌 攪乱 化学 物質

Reverse.dic

Reverse.dic is a set of words which reverse the polarity in the entire sentence. For example, prevent in the following sentence.

While cancer is a negative noun, the verb prevent reverses the polarity and the whole sentence has positive polarity. This kind of words is collected in reverse.dic.

Location

Both dictionary.dic and reverse.dic are located in the directory "extractopinion-1.1/dic" by default. It is possible to change the location by editing the environment value "dic" in conf.sh.

Appendix 5) Corpus for machine learning (csv)

The corpus format for machine learning is the following. Each value is separated by comma and the character code is Shift-JIS.

  • Example of an annotation for a given sentence "京都は美しい (Kyoto is beautiful)."
    Sentence ID Sentence Opinion holder Opinion expression Opinion type Target of the opinion
    example-1 京都は美しい [著者] 美しい 批評+ 京都

    Copyright 2011 Information Analysis Laboratory
    National Institute of Information and Communications Technology