Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles


Japanese Page

Overview


The Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles” aims mainly at supporting research and development relevant to high-performance multilingual machine translation, information extraction, and other language processing technologies. The National Institute of Information and Communications Technology (NICT) has created this corpus by manually translating Japanese Wikipedia articles (related to Kyoto) into English.

Unique Features

  1. A precise and large-scale corpus containing about 500,000 pairs of manually-translated sentences.

  2. The three-step translation process (primary translation -> secondary translation to improve fluency -> final check for technical terms) has been clearly recorded.

  3. Translated articles concern Kyoto and other topics such as traditional Japanese culture, religion, and history.

  4. The Japanese-English Bilingual Kyoto Lexicon is also available. This lexicon was created by extracting the Japanese-English word pairs from this corpus.

Contents


What's New


Precautions


Sample


One Wikipedia article is stored as one XML file in this corpus, and the corpus contains 14,111 files in total.

The following is a short quotation from a corpus file titled “Ryoan-ji Temple”.
Each tag has different implications. For example: The meanings of all tags are shown in readme.pdf.
Unfortunately, this corpus currently has a limitation in the accuracy of its section/paragraph breaks since the positions of the breaks were automatically identified by detecting what seems to be a section header in untagged Wikipedia articles. Therefore, the current section/paragraph breaks should be referred to only as a guide.

<?xml version="1.0" encoding="UTF-8"?>
<art orl="ja" trl="en">
<inf>jawiki-20080607-pages-articles.xml</inf>
<tit>
<j>龍安寺</j>
 <e type="trans" ver="1">Ryoan-ji Temple</e>
 <cmt></cmt>
 <e type="trans" ver="2">Ryoan-ji Temple</e>
 <cmt>修正なし</cmt>
 <e type="check" ver="1">Ryoan-ji Temple</e>
 <cmt>修正なし</cmt>
</tit>
<par id="1">
 <sen id="1">
  <j>龍安寺(りょうあんじ)は、京都府京都市右京区にある臨済宗妙心寺派の寺院。</j>
   <e type="trans" ver="1">Ryoan-ji is a temple in the Myoshinji branch of the
   Rinzai sect, and is located in Ukyo-ku, Kyoto.</e>
   <cmt></cmt>
   <e type="trans" ver="2">Ryoan-ji is a temple that belongs to the Myoshinji
   school of the Rinzai sect, and is located in Ukyo-ku, Kyoto city.</e>
   <cmt>妙心寺派の「派」はschoolの方がよく用いられている。「妙心寺派の」という表現は「妙心寺
   派に属する」という意味である。「京都市」だけを訳出してあるので、cityを添えた。</cmt>
   <e type="check" ver="1">A temple belonging to the Myoshinji school of the
   Rinzai sect, Ryoan-ji Temple is located in Ukyo-ku, Kyoto city.</e>
   <cmt>フィードバックに基づき翻訳を修正しました。</cmt>
 </sen>
 <sen id="2">

中略

</par>
</art>

Categories


The files have been divided into 15 categories: school, railway, family, building, Shinto, person name, geographical name, culture, road, Buddhism, literature, title, history, shrines and temples, and emperor (Click the link to view a sample file for each category).

Download


License


Creative Commons License
Use and/or redistribution of the Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles and the Japanese-English Bilingual Kyoto Lexicon is permitted under the conditions of Creative Commons Attribution-Share-Alike License 3.0. Details can be found at http://creativecommons.org/licenses/by-sa/3.0/.
(*) The manner of attribution is specified as follows;
"The [data] used in this service contains English contents which is translated by the National Institute of Information and Communications Technology (NICT) from Japanese sentences on Wikipedia. Our use of this data is licensed by the Creative Comons Attribution-Share-Alike License 3.0. Please refer to http://creativecommons.org/licenses/by-sa/3.0/ or http://alaginrc.nict.go.jp/WikiCorpus/ for details."
Caution: the user may have to replace the reference to [data] in the above text by more appropriate expressions such as [translated sentences] or [definitions] where necessary.
The condition (*) was added on 2/23/2012. All data downloaded from that day on is required to comply with this condit


MASTAR Project
Information Analysis Laboratory (previously Language Infrastructure Group) & Multilingual Translation Laboratory (previously Language Translation Group)
National Institute of Information and Communications Technology