The NICT Japanese Learner English (JLE) Corpus
Japanese Page
Overview
In 2004, National Institute of Information and Communications Technology created a learner corpus, The NICT JLE Corpus. The source of the corpus data is the transcripts of the audio-recorded speech samples (1,281 samples, 1.2 million words, 300 hours in total) of English oral proficiency interview test, ACTFL-ALC SST (Standard Speaking Test).
Unique Features
- Proficiency Level
- The advantage of using the SST data as a source is that each speaker's data includes his or her proficiency level (9 levels) based on the SST scoring method, which makes it possible to easily analyze and compare the characteristics of interlanguage of each developmental stage. This is one of the strengths of The NICT JLE Corpus that cannot be often found in other learner corpora.
- Annotation
- There are two kinds of annotation contained in the corpus: basic tags(for all files) and error tags(for 167 files). There are more than 30 basic tags. These are divided into four groups: tags for representing the structure of the interview, tags for the interviewee's profile, tags for speaker turns, and tags for representing utterance phenomena such as fillers, repetitions, self-corrections, overlapping, and so on. Analyzing errors produced by learners is an efficient way of finding out the learners' stages of development and for deciding the most appropriate teaching method for them. We are aware that it is quite difficult to design a consistent and generic error tagset as the learners' errors extend across various linguistic areas. We need to have a robust error typology to accomplish this. We designed the original error tagset only for learners' grammatical and lexical errors. The error tagset consists of 47 tags.
- Sub Corpus
- We have also compiled a subcorpus for comparison. It is a native English speakers' corpus. This subcorpus is considered to be quite useful for comparing the utterances of native speakers and Japanese learners. We were able to make this comparison by collecting the speech data of native speakers', conducting a similar type of interview to that of the SST. As stated above, we performed error tagging only for grammatical and lexical errors, so the subcorpus may cover what we are unable to examine solely by error tagging.
Contents
- 2012-10-17 The NICT JLE Corpus Version 4.1 released
- Users of the corpus are requested to take careful consideration when encountering any instances of defamation, discriminatory terms, or personal information that might be found within the corpus.
- NICT bears no responsibility for the contents of the corpus and assumes no liability for any direct or indirect damage or loss whatsoever that may be incurred as a result of using the corpus.
- If any copyright infringement or other problems are found in the corpus, please contact us at JLE-Corpus[at]khn[dot]nict[dot]go[dot]jp. We will review the issue and undertake appropriate measures when needed.
One transcribed text of an interview is stored as one TXT file in this corpus, and the corpus contains 1,281 files in total.
The following is a short quotation from a corpus file.
Each tag has different implications. For example:
- <head version="1.3">interviewee's profile such as sex、experience overseas、proficiency level</head>
- <A>interviewer's utterance</A>
- <B>interviewee's utterance</B>
- <task>task activity assigned to an interviewee</task>
- <followup>followup talk after a task activity</followup>
- <F>filler</F>
- <R>repetition</R>
- <SC>self-correction</SC>
The meanings of all tags are shown in Tag List.
<head version="1.3">
<date>1999-12-16</date>
<sex>female</sex>
<age></age>
<country>Japan</country>
<overseas></overseas>
<category></category>
<step>1.5</step>
<TOEIC>765</TOEIC>
<TOEFL></TOEFL>
<other_tests></other_tests>
<SST_level>6</SST_level>
<SST_task2>restaurant</SST_task2>
<SST_task3>train_advanced</SST_task3>
<SST_task4>department store</SST_task4>
</head>
...
<stage2>
<task>
<A>I see. O K. Now, let me show you the first picture. Please describe this picture.</A>
<B>O K. <F>Er</F> <R>this is a</R> this is a <.></.> room in a hotel. And <.></.> <F>oh</F> sorry, it's not. Yeah, I think it's a restaurant. And there are three tables, <R>and</R> and there are three couples and <SC>two server</SC> two <R>waiter</R> waiter are serving. And <R>in the</R> in the middle of the restaurant, the couple is <F>er</F> drinking wine. And <F>err</F> the man is <.></.> testing the wine and saying something to the waiter. Maybe he is sommelier. And <R>he</R> he show the bottle to the man. I guess he is explaining something. And <F>er</F> the couple, <F>er</F> they dressed very nicely. <CO><R>And</R> <.></.> <F>mhmm</F> <R>and</R> <.></.> <R>and</R> <F>well</F> and</CO>. <.></.></B>
</task>
<followup>
<A>O K.</A>
<B>O K?</B>
<A>O K. Thank you very much. <F>Er</F> how do you spend time with your husband?</A>
<B><.></.> You mean, in our free time?</B>
<A><F>Mhmm</F>.</A>
<B><F>Er</F> like this? <.></.> <F>Well</F> <F>er</F> <R>I</R> I sometimes eating out with my husband. But we don't get dressed like this. <nvs>laughter</nvs> <..></..></B>
<A>Can you compare the restaurant you often go to to this picture?</A>
<B><nvs>laughter</nvs> It's very different from restaurant to we often go. We often go to a kind of family style restaurant <.></.> such as Denny's or Skylark. So I wish I could <SC>go like</SC> go to a nice restaurant like this.</B>
<A><F>Er</F> what is good about family-type restaurant?</A>
<B><F>Well</F> <SC>fir</SC> at first, it's very cheap and they served very quickly. And, <F>er</F> most of the cases, <F>er</F> that kind of restaurant is in suburb, so <SC>people are very</SC> <F>er</F> people can go there very easily. I think they are good point of family-type restaurant.</B>
</followup>
</stage2>
...
Use and/or redistribution of the The NICT JLE Corpus is permitted under the conditions of Creative Commons Attribution-Share-Alike License 3.0. Details can be found at http://creativecommons.org/licenses/by-sa/3.0/.
Information Analysis Laboratory (previously Language Infrastructure Group)
National Institute of Information and Communications Technology
Copyright 2004-2012 NICT All Rights Reserved.