Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles
Japanese Page
Overview
The Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles aims mainly at supporting research and development relevant to high-performance multilingual machine translation, information extraction, and other language processing technologies. The National Institute of Information and Communications Technology (NICT) has created this corpus by manually translating Japanese Wikipedia articles (related to Kyoto) into English.
Unique Features
- A precise and large-scale corpus containing about 500,000 pairs of manually-translated sentences.
- Can be exploited for research and development of high-performance multilingual machine translation, information extraction, and so on.
- The three-step translation process (primary translation -> secondary translation to improve fluency -> final check for technical terms) has been clearly recorded.
- Enables observation of how translations have been elaborated so it can be applied for uses such as research and development relevant to translation aids and error analysis of human translation.
- Translated articles concern Kyoto and other topics such as traditional Japanese culture, religion, and history.
- Can also be utilized for tourist information translation or to create glossaries for travel guides.
- The Japanese-English Bilingual Kyoto Lexicon is also available. This lexicon was created by extracting the Japanese-English word pairs from this corpus.
Contents
- 2010-10-27 Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles Version 1.0 released
- 2010-12-20 Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles Version 2.0 released
- 2011-01-13 Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles Version 2.01 released
- A bugfix release that has all known small bugs and omissions fixed since the release of Version 2.0.
- NICT has created this corpus by translating Japanese Wikipedia articles into English and has made it publicly available under the conditions of Creative Commons Attribution-Share-Alike License 3.0. Users of the corpus are advised to read Wikipedia's copyright policy carefully to ensure proper usage.
- The content of the selected Wikipedia articles have been translated for this corpus. Users of the corpus are requested to take careful consideration when encountering any instances of defamation, discriminatory terms, or personal information that might be found within the corpus.
- NICT bears no responsibility for the contents of the corpus and the lexicon and assumes no liability for any direct or indirect damage or loss whatsoever that may be incurred as a result of using the corpus or the lexicon.
- If any copyright infringement or other problems are found in the corpus or the lexicon, please contact us at kyoto-corpus[at]khn[dot]nict[dot]go[dot]jp. We will review the issue and undertake appropriate measures when needed.
One Wikipedia article is stored as one XML file in this corpus, and the corpus contains 14,111 files in total.
The following is a short quotation from a corpus file titled Ryoan-ji Temple.
Each tag has different implications. For example:
- <j>Original Japanese sentence<j>
- <e type="trans" ver="1">Primary translation</e>
- <e type="trans" ver="2">Secondary translation</e>
- <e type="check" ver="1">Final translation</e>
- <cmt>Comment added by translators</cmt>
The meanings of all tags are shown in readme.pdf.
Unfortunately, this corpus currently has a limitation in the accuracy of its section/paragraph breaks since the positions of the breaks were automatically identified by detecting what seems to be a section header in untagged Wikipedia articles. Therefore, the current section/paragraph breaks should be referred to only as a guide.
<?xml version="1.0" encoding="UTF-8"?>
<art orl="ja" trl="en">
<inf>jawiki-20080607-pages-articles.xml</inf>
<tit>
<j>龍安寺</j>
<e type="trans" ver="1">Ryoan-ji Temple</e>
<cmt></cmt>
<e type="trans" ver="2">Ryoan-ji Temple</e>
<cmt>修正なし</cmt>
<e type="check" ver="1">Ryoan-ji Temple</e>
<cmt>修正なし</cmt>
</tit>
<par id="1">
<sen id="1">
<j>龍安寺(りょうあんじ)は、京都府京都市右京区にある臨済宗妙心寺派の寺院。</j>
<e type="trans" ver="1">Ryoan-ji is a temple in the Myoshinji branch of the
Rinzai sect, and is located in Ukyo-ku, Kyoto.</e>
<cmt></cmt>
<e type="trans" ver="2">Ryoan-ji is a temple that belongs to the Myoshinji
school of the Rinzai sect, and is located in Ukyo-ku, Kyoto city.</e>
<cmt>妙心寺派の「派」はschoolの方がよく用いられている。「妙心寺派の」という表現は「妙心寺
派に属する」という意味である。「京都市」だけを訳出してあるので、cityを添えた。</cmt>
<e type="check" ver="1">A temple belonging to the Myoshinji school of the
Rinzai sect, Ryoan-ji Temple is located in Ukyo-ku, Kyoto city.</e>
<cmt>フィードバックに基づき翻訳を修正しました。</cmt>
</sen>
<sen id="2">
中略
</par>
</art>
The files have been divided into 15 categories: school, railway, family, building, Shinto, person name, geographical name, culture, road, Buddhism, literature, title, history, shrines and temples, and emperor (Click the link to view a sample file for each category).
Use and/or redistribution of the Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles and the Japanese-English Bilingual Kyoto Lexicon is permitted under the conditions of Creative Commons Attribution-Share-Alike License 3.0. Details can be found at http://creativecommons.org/licenses/by-sa/3.0/.
(*) The manner of attribution is specified as follows;
"The [data] used in this service contains English contents which is translated by the National Institute of Information and Communications Technology (NICT) from Japanese sentences on Wikipedia. Our use of this data is licensed by the Creative Comons Attribution-Share-Alike License 3.0. Please refer to http://creativecommons.org/licenses/by-sa/3.0/ or http://alaginrc.nict.go.jp/WikiCorpus/ for details."
Caution: the user may have to replace the reference to [data] in the above text by more appropriate expressions such as [translated sentences] or [definitions] where necessary.
The condition (*) was added on 2/23/2012. All data downloaded from that day on is required to comply with this condit
MASTAR Project
Information Analysis Laboratory (previously Language Infrastructure Group) & Multilingual Translation Laboratory
(previously Language Translation Group)
National Institute of Information and Communications Technology