東北大 乾・鈴木研究室が公開している解答可能性付き読解データセットを利用し、既に公開されている利用可能な日本語BERTモデルとNICTが公開する2つの日本語BERTモデルの比較実験を行いました。解答可能性付き読解データセットに含まれる56,651件の質問・解答・文書の組に対して付与された「文書の読解によって質問に答えることができるかどうか」のスコアが2以上の事例から正解を抽出し、それ以外の事例は正解無しとして、与えられた質問に対して文書中から回答となる単語列の特定を行い、参考文献と同様に正解との完全一致の割合(EM)と正解の単語列に対する再現率と精度から求められるF1スコアの平均(F1)の2つの評価尺度で結果を比較しました。(ただし、訓練・開発・テストの分割等、実験設定の詳細は必ずしも参考文献とは一致していません)
NICTが公開するモデルとの比較に利用したBERT事前学習モデルは下記の6つです。
比較実験では Hugging Face's Transformers (https://github.com/huggingface/transformers) を利用しています。fine-tuningの手順等については実験手順を参照してください。結果を下記の表にまとめますが、NICTが公開するBERTモデルが既存の6つの日本語BERTモデルと比較して高いEM、F1を得ていることがわかります。
モデル | EM | F1 |
---|---|---|
NICT BERT 日本語 Pre-trained モデル BPEなし | 76.42 | 77.75 |
NICT BERT 日本語 Pre-trained モデル BPEあり | 77.92 | 79.49 |
BERT-Base, Multilingual Cased (Google AI Research) | 70.10 | 70.16 |
BERT 日本語 Pretrained モデル, BASE WWM版 (京都大学 黒橋・河原・村脇研究室) | 73.89 | 75.65 |
BERT 日本語 Pretrained モデル, LARGE WWM版 (京都大学 黒橋・河原・村脇研究室) | 75.79 | 77.49 |
Pretrained Japanese BERT models, MeCab + WordPiece, WWM (東北大学 乾・鈴木研究室) | 77.68 | 78.87 |
BERT with SentencePiece for Japanese text (Yohei Kikuta氏) | 73.66 | 76.83 |
hottoSNS-BERT (株式会社ホットリンク) | 61.14 | 64.93 |
事前に以下のプログラムのインストールが必要です:
- MeCab (0.996) および Juman辞書 (7.0-20130310)
- Juman辞書は、--with-charset=utf-8 オプションをつけてインストールする必要があります
- Juman++ (v2.0.0-rc2)
またこの Notebook を実行するための Python 環境には、いくつかのライブラリが必要です。NVIDIA CUDA Toolkit 10.0 と Anaconda が利用可能な Linux 環境では、以下のように Conda 仮想環境を作成できます (ここでは環境名を pt131 という名称にしています):
conda create -n pt131 python=3.7 pytorch=1.3.1 tensorflow=1.15.0 cudatoolkit=10.0 cudnn jupyter pandas tqdm requests boto3 filelock regex gcc_linux-64 gxx_linux-64
conda activate pt131
pip install --no-cache-dir transformers==2.4.1 mecab-python3 mojimoji pyknp
git clone -q https://github.com/NVIDIA/apex
cd apex
git reset --hard 3ae89c754d945e407a6674aa2006d5a0e35d540e
CUDA_HOME=/usr/local/cuda-10.0 pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
%%bash
mkdir -p data
if [ \! -f data/all-v1.0.json.gz ]; then
wget -P data 'http://www.cl.ecei.tohoku.ac.jp/rcqa/data/all-v1.0.json.gz'
fi
%%bash
mkdir -p src
if [ \! -f src/run_squad.py ]; then
wget -P src 'https://github.com/huggingface/transformers/raw/v2.4.1/examples/run_squad.py'
fi
%%bash
mkdir -p models
%%bash
# NICT BERT 日本語 Pre-trained モデル
ls models/NICT_BERT-base_JapaneseWikipedia_100K.zip
ls models/NICT_BERT-base_JapaneseWikipedia_32K_BPE.zip
# BERT-Base, Multilingual Cased (Google AI Research)
ls models/multi_cased_L-12_H-768_A-12.zip
# BERT 日本語 Pretrained モデル, BASE WWM版 (京都大学 黒橋・河原・村脇研究室)
ls models/Japanese_L-12_H-768_A-12_E-30_BPE_WWM_transformers.zip
# BERT 日本語 Pretrained モデル, LARGE WWM版 (京都大学 黒橋・河原・村脇研究室)
ls models/Japanese_L-24_H-1024_A-16_E-30_BPE_WWM_transformers.zip
# Pretrained Japanese BERT models, MeCab + WordPiece, Whole Word Masking (東北大学 乾・鈴木研究室)
ls models/BERT-base_mecab-ipadic-bpe-32k_whole-word-mask.tar.xz
# BERT with SentencePiece for Japanese text (Yohei Kikuta氏)
ls models/bert-wiki-ja/wiki-ja.model \
models/bert-wiki-ja/wiki-ja.vocab \
models/bert-wiki-ja/model.ckpt-1400000.meta \
models/bert-wiki-ja/model.ckpt-1400000.index \
models/bert-wiki-ja/model.ckpt-1400000.data-00000-of-00001
# hottoSNS-BERT (株式会社ホットリンク)
ls models/hottoSNS-bert_20190311/bert_config.json \
models/hottoSNS-bert_20190311/tokenizer_spm_32K.model \
models/hottoSNS-bert_20190311/tokenizer_spm_32K.vocab.to.bert \
models/hottoSNS-bert_20190311/model.ckpt-1000000.meta \
models/hottoSNS-bert_20190311/model.ckpt-1000000.index \
models/hottoSNS-bert_20190311/model.ckpt-1000000.data-00000-of-00001
%%bash
# NICT BERT 日本語 Pre-trained モデル
if [ \! -d models/NICT_BERT-base_JapaneseWikipedia_100K ]; then
unzip models/NICT_BERT-base_JapaneseWikipedia_100K.zip -d models
fi
if [ \! -d models/NICT_BERT-base_JapaneseWikipedia_32K_BPE ]; then
unzip models/NICT_BERT-base_JapaneseWikipedia_32K_BPE.zip -d models
fi
# BERT-Base, Multilingual Cased (Google AI Research)
if [ \! -d models/multi_cased_L-12_H-768_A-12 ]; then
unzip models/multi_cased_L-12_H-768_A-12.zip -d models
fi
if [ \! -f models/multi_cased_L-12_H-768_A-12/pytorch_model.bin ]; then
transformers-cli convert \
--model_type bert \
--tf_checkpoint models/multi_cased_L-12_H-768_A-12/bert_model.ckpt \
--config models/multi_cased_L-12_H-768_A-12/bert_config.json \
--pytorch_dump_output models/multi_cased_L-12_H-768_A-12/pytorch_model.bin
fi
if [ \! -L models/multi_cased_L-12_H-768_A-12/config.json ]; then
ln -s bert_config.json models/multi_cased_L-12_H-768_A-12/config.json
fi
# BERT 日本語 Pretrained モデル, BASE WWM版 (京都大学 黒橋・河原・村脇研究室)
if [ \! -d models/Japanese_L-12_H-768_A-12_E-30_BPE_WWM_transformers ]; then
unzip models/Japanese_L-12_H-768_A-12_E-30_BPE_WWM_transformers.zip -d models
fi
# BERT 日本語 Pretrained モデル, LARGE WWM版 (京都大学 黒橋・河原・村脇研究室)
if [ \! -d models/Japanese_L-24_H-1024_A-16_E-30_BPE_WWM_transformers ]; then
unzip models/Japanese_L-24_H-1024_A-16_E-30_BPE_WWM_transformers.zip -d models
fi
# Pretrained Japanese BERT models, MeCab + WordPiece, Whole Word Masking (東北大学 乾・鈴木研究室)
if [ \! -d models/BERT-base_mecab-ipadic-bpe-32k_whole-word-mask ]; then
tar Jxf models/BERT-base_mecab-ipadic-bpe-32k_whole-word-mask.tar.xz -C models
fi
if [ \! -f models/BERT-base_mecab-ipadic-bpe-32k_whole-word-mask/tokenizer_config.json ]; then
echo '{"do_lower_case": false, "tokenize_chinese_chars": false}' \
> models/BERT-base_mecab-ipadic-bpe-32k_whole-word-mask/tokenizer_config.json
fi
# BERT with SentencePiece for Japanese text (Yohei Kikuta氏)
if [ \! -f models/bert-wiki-ja/config.json ]; then
echo '{"vocab_size": 32000}' > models/bert-wiki-ja/config.json
fi
if [ \! -f models/bert-wiki-ja/pytorch_model.bin ]; then
transformers-cli convert \
--model_type bert \
--tf_checkpoint models/bert-wiki-ja/model.ckpt-1400000 \
--config models/bert-wiki-ja/config.json \
--pytorch_dump_output models/bert-wiki-ja/pytorch_model.bin
fi
if [ \! -f models/bert-wiki-ja/special_tokens_map.json ]; then
echo '{"unk_token": "<unk>"}' > models/bert-wiki-ja/special_tokens_map.json
fi
if [ \! -f models/bert-wiki-ja/tokenizer_config.json ]; then
# set do_lower_case=False and do it manually in the dataset creation as we don't need _run_strip_accents
echo '{"do_lower_case": false, "tokenize_chinese_chars": false}' > models/bert-wiki-ja/tokenizer_config.json
fi
if [ \! -f models/bert-wiki-ja/vocab.txt ]; then
cut -f1 models/bert-wiki-ja/wiki-ja.vocab > models/bert-wiki-ja/vocab.txt
fi
# hottoSNS-BERT (株式会社ホットリンク)
if [ \! -f models/hottoSNS-bert_20190311/pytorch_model.bin ]; then
transformers-cli convert \
--model_type bert \
--tf_checkpoint models/hottoSNS-bert_20190311/model.ckpt-1000000 \
--config models/hottoSNS-bert_20190311/bert_config.json \
--pytorch_dump_output models/hottoSNS-bert_20190311/pytorch_model.bin
fi
if [ \! -L models/hottoSNS-bert_20190311/config.json ]; then
ln -s bert_config.json models/hottoSNS-bert_20190311/config.json
fi
if [ \! -f models/hottoSNS-bert_20190311/special_tokens_map.json ]; then
echo '{"unk_token": "<unk>", "pad_token": "<pad>"}' > models/hottoSNS-bert_20190311/special_tokens_map.json
fi
if [ \! -f models/hottoSNS-bert_20190311/tokenizer_config.json ]; then
# set do_lower_case=False and do it manually in the dataset creation as we don't need _run_strip_accents
echo '{"do_lower_case": false, "tokenize_chinese_chars": false}' > models/hottoSNS-bert_20190311/tokenizer_config.json
fi
if [ \! -L models/hottoSNS-bert_20190311/vocab.txt ]; then
ln -s tokenizer_spm_32K.vocab.to.bert models/hottoSNS-bert_20190311/vocab.txt
fi
import gzip
import json
import os
import subprocess
import unicodedata
import MeCab
import mojimoji
import pyknp
import sentencepiece as spm
dicdir = subprocess.run(["mecab-config", "--dicdir"], check=True, stdout=subprocess.PIPE, text=True).stdout.rstrip()
jumandic_dir = ([d for d in [f"{dicdir}/juman-utf8", f"{dicdir}/jumandic"] if os.path.exists(d)] + [None])[0]
assert " " not in dicdir and jumandic_dir, "Please install mecab-jumandic"
tagger_jumandic = MeCab.Tagger(f"-Owakati -d{jumandic_dir}")
assert tagger_jumandic.dictionary_info().charset == "utf-8"
tagger_ipadic = MeCab.Tagger(f"-Owakati")
spm_bert_wiki_ja = spm.SentencePieceProcessor()
spm_bert_wiki_ja.Load("models/bert-wiki-ja/wiki-ja.model")
spm_hottoSNS_bert = spm.SentencePieceProcessor()
spm_hottoSNS_bert.Load("models/hottoSNS-bert_20190311/tokenizer_spm_32K.model")
jumanpp = pyknp.Juman("jumanpp")
dataset = []
with gzip.open("data/all-v1.0.json.gz", "rt", encoding="utf-8") as fp:
for line in fp:
data = json.loads(line)
if data["documents"]:
dataset.append(data)
train_dataset = [data for data in dataset if data["timestamp"] < "2009"]
dev_dataset = [data for data in dataset if "2009" <= data["timestamp"] < "2010"]
test_dataset = [data for data in dataset if "2010" <= data["timestamp"]]
os.makedirs("data/dataset_train", exist_ok=True)
os.makedirs("data/dataset_eval", exist_ok=True)
for suffix, tokenize_func in (
("text", lambda x: x),
("jumandic", lambda x: tagger_jumandic.parse(mojimoji.han_to_zen(x).replace("\u3000", " ")).rstrip("\n")),
("jumanpp", lambda x: " ".join(mrph.midasi for mrph in jumanpp.analysis(
mojimoji.han_to_zen(x).replace("\u3000", " ").replace("\n", " ")).mrph_list() if mrph.midasi != "\\ ")),
("ipadic", lambda x: tagger_ipadic.parse(unicodedata.normalize("NFKC", x)).rstrip("\n")),
("bert-wiki-ja", lambda x: " ".join(spm_bert_wiki_ja.EncodeAsPieces(x.lower()))),
("hottoSNS-bert", lambda x: " ".join(spm_hottoSNS_bert.EncodeAsPieces(x.lower())))):
for filename, datasplit in (
(f"data/dataset_train/train-v1.0.{suffix}.json", train_dataset),
(f"data/dataset_train/dev-v1.0.{suffix}.json", dev_dataset),
(f"data/dataset_eval/test-v1.0.{suffix}.json", test_dataset)):
entries = []
for data in datasplit:
for i, document in enumerate(data["documents"]):
q_id = "{}{:04d}".format(data["qid"], i + 1)
question = tokenize_func(data["question"])
answer = "".join(ch for ch in tokenize_func(data["answer"]) if not ch.isspace() and ch != "▁")
context = tokenize_func(document["text"])
is_impossible = document["score"] < 2
if not is_impossible:
context_strip, offsets = zip(*[(ch, ptr) for ptr, ch in enumerate(context) if not ch.isspace() and ch != "▁"])
idx = "".join(context_strip).index(answer)
answer_start, answer_end = offsets[idx], offsets[idx + len(answer) - 1]
answer = context[answer_start:answer_end + 1]
entries.append({"title": q_id, "paragraphs": [{"context": context, "qas": [
{"id": q_id, "question": question, "is_impossible": is_impossible,
"answers": [{"text": answer, "answer_start": answer_start}] if not is_impossible else []}]}]})
with open(filename, "w", encoding="utf-8") as fp:
json.dump({"data": entries}, fp)
%%bash
declare -A MODELS
MODELS=(
["NICT_BERT-base_JapaneseWikipedia_100K"]="jumandic"
["NICT_BERT-base_JapaneseWikipedia_32K_BPE"]="jumandic"
["multi_cased_L-12_H-768_A-12"]="text"
["Japanese_L-12_H-768_A-12_E-30_BPE_WWM_transformers"]="jumanpp"
["Japanese_L-24_H-1024_A-16_E-30_BPE_WWM_transformers"]="jumanpp"
["BERT-base_mecab-ipadic-bpe-32k_whole-word-mask"]="ipadic"
["bert-wiki-ja"]="bert-wiki-ja"
["hottoSNS-bert_20190311"]="hottoSNS-bert"
)
# Hyperparameters from Appendix A.3, Devlin et al., 2019
BATCHES=(16 32)
LRS=(5e-5 3e-5 2e-5)
EPOCHS=(2 3 4)
command='( \
python3 \
src/run_squad.py \
--model_type bert \
--model_name_or_path models/${MODEL} \
--output_dir expr/train_output_${MODEL}_batch${BATCH}_lr${LR}_epochs${EPOCH} \
--data_dir data/dataset_train \
--train_file train-v1.0.${SUFFIX}.json \
--predict_file dev-v1.0.${SUFFIX}.json \
--overwrite_cache \
--version_2_with_negative \
--do_train \
--do_eval \
--gradient_accumulation_steps $((BATCH / 8)) \
--per_gpu_train_batch_size 8 \
--per_gpu_eval_batch_size 32 \
--learning_rate ${LR} \
--num_train_epochs ${EPOCH} \
--save_steps 10000 \
--fp16 \
--fp16_opt_level O2 \
&& \
python3 \
src/run_squad.py \
--model_type bert \
--model_name_or_path expr/train_output_${MODEL}_batch${BATCH}_lr${LR}_epochs${EPOCH} \
--output_dir expr/eval_output_${MODEL}_batch${BATCH}_lr${LR}_epochs${EPOCH} \
--data_dir data/dataset_eval \
--predict_file test-v1.0.${SUFFIX}.json \
--overwrite_cache \
--version_2_with_negative \
--do_eval \
--per_gpu_eval_batch_size 32 \
)'
mkdir -p expr
/bin/true > expr/jobs.txt
for MODEL in ${!MODELS[@]}; do
SUFFIX=${MODELS[$MODEL]}
for BATCH in ${BATCHES[@]}; do
for LR in ${LRS[@]}; do
for EPOCH in ${EPOCHS[@]}; do
export MODEL SUFFIX BATCH LR EPOCH
if /opt/pbs/bin/qstat -q kad > /dev/null 2>&1 ; then
# for NICT cluster enviromnet
echo '(cd "${PBS_O_WORKDIR}" && '"${command}"')' | /opt/pbs/bin/qsub -q kad -kdoe -joe \
-v PATH,MODEL,SUFFIX,BATCH,LR,EPOCH \
-l select=1:ngpus=1:gpumodel=P100 -l walltime=16:00:00 \
-N ${MODEL}_batch${BATCH}_lr${LR}_epochs${EPOCH} \
-o expr/${MODEL}_batch${BATCH}_lr${LR}_epochs${EPOCH}.log \
>> expr/jobs.txt
else
# for general environment
script -c "${command}" expr/${MODEL}_batch${BATCH}_lr${LR}_epochs${EPOCH}.log
fi
done
done
done
done
if /opt/pbs/bin/qstat -q kad > /dev/null 2>&1; then
# for NICT cluster environment, wait for experiments
/opt/pbs/bin/qsub -q kad -Roe -z \
-W block=true -W depend=$(perl -0pe 's/^/afterok:/mg; s/\n/,/g; chop;' expr/jobs.txt) -- /bin/true
fi
import glob
from collections import namedtuple
PairedResult = namedtuple("PairedResult", ("filename", "dev", "test"))
Result = namedtuple("Result", (
"exact", "f1", "total", "HasAns_exact", "HasAns_f1", "HasAns_total", "NoAns_exact", "NoAns_f1", "NoAns_total",
"best_exact", "best_exact_thresh", "best_f1", "best_f1_thresh"))
best_results = {}
for model in ("NICT_BERT-base_JapaneseWikipedia_100K",
"NICT_BERT-base_JapaneseWikipedia_32K_BPE",
"multi_cased_L-12_H-768_A-12",
"Japanese_L-12_H-768_A-12_E-30_BPE_WWM_transformers",
"Japanese_L-24_H-1024_A-16_E-30_BPE_WWM_transformers",
"BERT-base_mecab-ipadic-bpe-32k_whole-word-mask",
"bert-wiki-ja",
"hottoSNS-bert_20190311"):
results = []
for filename in sorted(glob.glob(f"expr/{model}_*.log")):
with open(filename, "r", encoding="utf-8") as fp:
result = [Result(**eval(line.split(" - ", 3)[3].split(":", 1)[1]))
for line in fp if " - INFO - __main__ - Results: " in line]
results.append(PairedResult(filename, *result))
best_results[model] = max(results, key=lambda result: float(result.dev.exact))
import pandas as pd
pd.DataFrame(data=((model, result.dev.exact, result.dev.f1, result.test.exact, result.test.f1, result.filename)
for model, result in best_results.items()),
columns=("Model", "Dev EM", "Dev F1", "Test EM", "Test F1", "Log File")).style \
.highlight_max(axis=0) \
.hide_index() \
.set_properties(**{"text-align": "left"}) \
.set_table_styles([{"selector": "th", "props": [("text-align", "left")]}])