JNAS instruct.txt

970618

ASJ Continuous Speech Corpus
------ Japanese Newspaper Article Sentences (JNAS) ------

1. Outline

This corpus consists of 16 CD-ROMs. It contains speech recordings and their orthographic transcriptions of 306 speakers (153 males and females each) reading excerpts from the Mainichi Newspaper and the ATR 503 PB- Sentences. All utterances and sentences are in the Japanese language.
We prepared 155 text sets. Each set consists of about 100 sentences from the Mainichi Newspaper. As a general rule, each text set was read by one male and one female. Every speaker also read any subset of the ATR 503 PB-Sentences (about 50 sentences for each subset). That is, this corpus contains utterances of about 45,000 sentences as a whole with all speakers reading about 150 sentences each.
Each utterance was recorded with two microphones: a head-set microphone (all recoding sites used Sennheiser HMD410/HMD25-1 or the equivalent) and a desk-top microphone of different types at each site (Sanken, Sony, and so on). These two-microphone data were stored into separate files and have a parallel directory structure in the CD-ROM directories; eight of the discs (Vol.1 through Vol.8) contain the head-set -microphone data and the other (Vol.9 through Vol.16) the desk-top- microphone data.
The speech waves were sampled at 16 kHz and quantized into 16 bits. They are stored in the compressed format mentioned below.
The corpus includes orthographic transcriptions of the speech data and the bigram language models for the Mainichi Newspaper articles from which the prompting text was selected. These materials are contained in Vol.1 and Vol.9.
The Speech Database Committee of the Acoustical Society of Japan, established in July 1990, has discussed the design and creation of this corpus, which has been recorded in collaboration with 39 institutions. The recording and AD conversion characteristics, including low-pass filter characteristics, are not necessarily unified.

2. Sentences of the Mainichi Newspaper Articles

The Large Vocabulary Continuous Speech Database Working Group of the Information Processing Society of Japan, established in November, 1995, selected 155 text sets for reading, using articles of the Mainichi Newspaper issued during 1991-1994.
A bigram language model was estimated from the articles of 45 months with their morphological information taken from RWCP text corpus(RWC-DB- TEXT-95-1) which was automatically generated with a morphological analyzer. The CMU SLP toolkit was used for the estimation. Sentences in the articles for three months were classified into 30 categories based on the bigram model. Each category is characterized by the sentence length(2 types), the vocaburaly size(5 types) and perplexity(3 types).
A statistically controlled text set consists of 90 sentences (SC- sentences) collected from the categories according to Table 1 and about 10 connected sentences taken from a few paragraphs. 150 text sets of controlled sentences (about 100 sentences each) were prepared. Five other text sets are made up of connected sentences chosen from some paragraphs.

      Table 1.  The number of sentences collected from each category

                     LENGTH = NORMAL                 LENGTH = LONG
              PERP=P_L  PERP=P_M  PERP=P_H    PERP=P_L PERP=P_M PERP=P_H
   VOC=MID        2         6         2           1        3        1
   VOC=MID+       2         6         2           1        3        1
   VOC=LAR        4        12         4           2        6        2
   VOC=LAR+       2         6         2           1        3        1
   VOC=LAR++      2         6         2           1        3        1


           VOC=MID:        5k voc. without an unknown word
           VOC=MID+:       5k voc. with one unknown word
      
                  LENGTH=NORMAL:  5-19 morphemes 
                  LENGTH=LONG:   20-39 morphemes
                  PERP=P_L:       0 <  perplexity < 40
                  PERP=P_M:      40 <= perplexity < 85
                  PERP=P_H:      85 <= perplexity < 400

           VOC=LAR:       20k voc. without an unknown word
           VOC=LAR+:      20k voc. with one unknown word
           VOC=LAR++:     20k voc. with two or more unknown words
      
                  LENGTH=NORMAL:  5-29 morphemes 
                  LENGTH=LONG:   30-39 morphemes
                  PERP=P_L:       0 <  perplexity < 70
                  PERP=P_M:      70 <= perplexity < 130
                  PERP=P_H:     130 <= perplexity < 400

3. ATR 503 PB-Sentences

These PB-sentences were chosen by ATR Interpreting Telephony Research Laboratories. Entropy was calculated based on the clusters of two phonemes (120 CV's, 227 VC's and 55 VV's, making 402 clusters in all) and three phonemes (69 CVC's where C is an unvoiced consonant, 18 CVC's where C is a nasal consonant and 136 VCV's where C is a semivowel, making 223 clusters in all) on the assumption that they occur independently. 10,196 original sample sentences were extracted at random from newspapers, magazines, novels, letters, text books, etc. Of these, 503 PB sentences were chosen to maximize the entropy. They were sorted so that each set of 50 sentences also be phonetically ballanced.

4. Orthographic transcription

The corpus includes two kinds of transcriptions. One is the Japanese orthographic text with ruby, used as a prompting text for reading, in the TeX format. The other is the Kana or Romaji text that represents pronunciation of sentences. The Kana and Romaji texts were modified to correctly transcribe recorded utterances according to the check reports from the recording sites.

5. CD-ROM Format

The CD-ROMs are formatted according to ISO-9660 standards. The speech waves were digitized with a 16 kHz sampling frequency and 16 bit quantization. They are stored with the NIST SPHERE headers in the compressed format, using the "shorten" compression technique developed by Tony Robinson at Cambridge University/SoftSound Limited and implemented in the NIST SPHERE PACKAGE. You can reach the latest version of the package via anonymous ftp (URL=ftp://jaguar.ncsl.nist.gov/pub/sphere_x.x.tar.Z).

6. Agreement form

June 1997

Shuichi Itahashi

Copyright (c) of speech data: Shuichi Itahashi (Edited by the Acoustical Society of Japan), 1997

Copyright (c) of text of the Mainichi Newspaper: The Mainichi Newspapers, 1991-1994

Copyright (c) of morophological analysis data: Real World Computing Partnership, 1996

Copyright (c) of ATR 503 PB-Sentences: ATR Interpreting Telephony Research Laboratories, 1988

ASJ Continuous Speech Corpus : Vols. 1 - 16
--- Japanese Newspaper Article Sentences (JNAS) ---

Editor: Speech Database Committee Acoustical Society of Japan

Publisher: Acoustical Society of Japan 2-7-7 Yoyogi, Shibuya, Tokyo Japan

Assisted by:
Chiba University
Doshisha University
Kyoto Institute of Technology
Kyoto University
Nagoya University
Nara Institute of Science and Technology
Osaka University
Ryukoku University
Shinshu University
Sizuoka University
Teikyo University of Science & Technology
Tohoku University
Toyohashi University of Technology
University of Electro-Communications
University of Tokyo
University of Tsukuba
Waseda University
Yamagata University
Yamanashi University
Electrotechnical Laboratory
ATR Interpreting Telecommunications Research Laboratories
Canon Inc.
Fujitsu Laboratories Ltd.
Furui Research Laboratory, NTT Human Interface Laboratories
Hitachi,Ltd.
Kokusai Denshin Denwa Co., Ltd.
Matsushita Research Institute Tokyo, Inc.
Meidensha Corporation
Mitsubishi Electric Corp.
NEC Corporation
NTT Basic Research Laboratories
NTT Data Corporation
Oki Electric Industry Co., Ltd.
Ricoh Co.,Ltd.
Sanyo Electric Co., Ltd.
Sharp Corporation
Sony Corporation
Speech and Acoustics Laboratory, NTT Human Interface Laboratories
Toshiba Corporation

Acknowledgements:
The prompting texts and the bigram language models of the Mainichi Newspaper article sentences were prepared by the Large Vocabulary Continuous Speech Database Working Group, Special Interest Group of Spoken Language Processing, Information Processing Society of Japan.
We used the NIST SPHERE package for attaching a header to wave files and 'shorten' compression technique for reducing the number of CD-ROMs. The NIST SPHERE package was implemented by the Spoken Natural Language processing group, National Institute of Standards and Technology, U.S.A. The 'shorten' compression technique was developed by Tony Robinson at Cambrigde University and SoftSound Limited, UK.
We would like to thank all the groups and peple above.

CD-ROMs produced by : Media Drive Corporation