This tool is developed by Information Credibility project, in Knowledge Clustered Group, National Institute of Information and Communications Technology (NICT), Japan*1. It judges whether a given Japanese sentence indicates opinion, evaluation, or proposal (henceforth, we call them "opinions") using machine learning techniques. The tool outputs the followings:
For example, the following sentence is analysed as below.
In the case above, rich in vitamins is the opinion expression. Based on the meaning of the opinion expressions, opinion expressions are classified into one of the following types: emotion (emotional opinion expressions), evaluation (subjective, but not emotional expression), merit (relative objective opinion expressions), etc. In the above example, the opinion type of rich in vitamins is categorized as "merit" because rich in vitamins is a relatively objective opinion expression (see Appendix 1 for details). The polarity represents whether the opinion expressions have positive meanings or negative meanings (some opinion types have no polarity). The opinion holder is a person or organization who asserts the opinion expressions. In this case, the opinion holder is the author of the sentence. Detailed information is described in the manual for annotation, which is attached in this tool (spec.pdf). According to this specification, an opinion corpus was constructed by hand and used to train the opinion mining tool. Note that it is not guaranteed that the opinion mining tool outputs the results that conform to the specification completely, although it tries to do so, because the training is based on statistical machine learning.
This tool extracts and classifies information by using machine learning. Please note the following:
Bearing in mind the above cases, we do not guarantee the reliability of the outputs of this tool, and do not have any responsibility for any disbenefit or damages caused by the use of this tool. See Copyright & License.
% tar zxvf extractopinion-1.1.tar.gz % cd extractopinion-1.1/
% cd svmtools/ % make clean ; make
% cd ../pol/ % make clean ; make
Prepare an input file. The input file must be written in UTF-8. One Japanese sentence per line is written on the text file. We use "sample.txt" as the input file name in the following explanation.
Example (sample.txt) ("↵" means line break: \n) :
ほうれん草はビタミンが豊富だ。↵ 京都は日本にある。↵
商品Aは良くない。↵
太郎は学校に行くべきだ。↵ 道州制は国の一律の規制が解かれ地域経済の活性化が図られるので、商機が拡大すると考えられる。↵
English translation:
Spinach is rich in vitamins.↵ Kyoto is in Japan.↵
Product "A" is not good.↵
Taro should go to school.↵ The Regional system increases the opportunity for trade because it eases regulations and activates local economy.↵
Execute the following command with the input file name as the first argument (see Appendix 2).
% cd extractopinion-1.1/ % ./extract.sh sample.txt
After executing the command, each sentence of the text is processed, outputting the following:
<Document ID> | Input file name |
<Sentence ID> | The ID of the sentence in the input file (Integer) |
<Opinion holder> | Strings (UTF-8) |
<Opinion type and polarity> | Strings (UTF-8) Polarity is empty in case of non-polarity type (Deontic/Demand) |
<Opinion expression> | Strings (UTF-8) |
Example of output:
sample.txt 1 [著者] メリット+ ビタミンが豊富だ。 sample.txt 2
sample.txt 3 [著者] 批評− 良くない。
sample.txt 4 [著者] 当為 学校に行くべきだ。 sample.txt 5 [著者] メリット+ 地域経済の活性化が図られるので、
sample.txt 5 [著者] メリット+ 商機が拡大すると考えられる。
English translation:
sample.txt 1 [Author] Merit+ rich in vitamins. sample.txt 2
sample.txt 3 [Author] Evaluation- not good.
sample.txt 4 [Author] Deontic should go to school. sample.txt 5 [Author] Merit+ increases the opportunity for trade
sample.txt 5 [Author] Merit+ activates local economy.
It is possible to generate model data by constructing an annotated corpus for machine learning. For the generation, follow the procedure below.
% cpan cpan> install Text::CSV_XS
% cd extractopinion-1.1/makemodel/csv/ % ./csv2tsv.sh
% cd extractopinion-1.1/ % ./makemdl.sh
readme.{html/utf.txt} | This document | |
---|---|---|
spec.pdf | Opinion manual annotation specification *1 | |
pol/ | Polarity classification module *2 | |
src/ | Opinion holder extraction module | |
typ/ | Opinion type classification module | |
xpr/ | Opinion expression extraction module | |
svmtools/ | SVM directory | |
lib/ | Common functions | |
dic/ | Polarity dictionary | |
dictionary.dic | Polarity dictionary (see Appendix 4) | |
reverse.dic | Reverse expression dictionary (see Appendix 4) | |
modeldata/sample/ | Sample model data | |
model.pol_mdl | Model data (polarity classification) | |
model.src_crfmdl | Model data (Opinion holder extraction) | |
model.src_ft | Model data (Opinion holder extraction) | |
model.src_svmmdl | Model data (Opinion holder extraction) | |
model.typ_ft | Model data (opinion type classification) | |
model.typ_mdl | Model data (opinion type classification) | |
model.xpr_mdl | Model data (opinion expression extraction) | |
sample.txt | Sample input file | |
extract.sh | Script for extracting opinion | |
_extract.sh | Internal script for extracting opinion | |
conf.sh | Configuration file (see Appendix 3) | |
makemdl.sh | Script for generating model data | |
_train.sh | Internal script for generating model data | |
makemodel/ | Directory for generating model data | |
model/ | Model data | |
csv/ | Corpus for machine learning (see Appendix 5) | |
tsv/ | tsv files for the input of makemdl.sh. |
Nakagawa, T., Inui, K. and Kurohashi, S., Dependency tree-based sentiment classification using CRFs with hidden variables,
Human Language Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational Linguistics, pp.
786–794 (2010).
Nakagawa, T., Inui, T., and Kurohashi, S., Sentiment Classification using Conditional Random Fields with Hidden Variables, PSJ SIG Technical Report 2009-NL-192, pp. 1-7 (2009), (in Japanese).
Nakagawa, T., Kawada, T., Inui, K., and Kurohashi S., Extracting Subjective and Objective Evaluative Expressions from the Web, In Proceedings of the Second International Symposium on Universal Communication (ISUC 2008), pp.251-258 (2008).
Kawada, T., Nakagawa, T., Morii, R., Miyamori, H., Akamine, S., Inui, K., Kurohashi ,and S., Kidawara, Y., Constructing evaluative information corpus on the Web. In Proceedings of the 14th Annual Meeting of the Association for Natural Language Processing, pp.524-527 (2008) (in Japanese).
Liu, D.C., and Nocedal, J., On the Limited Memory Method for Large Scale Optimization , Mathematical Programming B, 45, 3, pp. 503-528, (1989).
Nocedal, J., Updating Quasi-Newton Matrices with Limited Storage , Mathematics of Computation 35, pp. 773-782, (1980).
(+: Positive、-:Negative)
Both dictionary.dic and reverse.dic have the same format. The format of each dictionary is a text file (EUC-JP). The following three values are tab separated.
Example:
Direction word | Polarity | Morphemes | |||
---|---|---|---|---|---|
良い | + | 良い | |||
正統派 | + | 正統だ | 派 | ||
やさしさ | + | やさしい | さ | ||
内分泌攪乱化学物質 | - | 内分泌 | 攪乱 | 化学 | 物質 |
Reverse.dic is a set of words which reverse the polarity in the entire sentence. For example, prevent in the following sentence.
While cancer is a negative noun, the verb prevent reverses the polarity and the whole sentence has positive polarity. This kind of words is collected in reverse.dic.
Both dictionary.dic and reverse.dic are located in the directory "extractopinion-1.1/dic" by default. It is possible to change the location by editing the environment value "dic" in conf.sh.
The corpus format for machine learning is the following. Each value is separated by comma and the character code is Shift-JIS.
Sentence ID | Sentence | Opinion holder | Opinion expression | Opinion type | Target of the opinion |
---|---|---|---|---|---|
example-1 | 京都は美しい | [著者] | 美しい | 批評+ | 京都 |
Copyright 2011 Information Analysis Laboratory
National Institute of Information and Communications Technology