970618
ASJ Continuous Speech Corpus
------ Japanese Newspaper Article Sentences (JNAS) ------
1. Outline
This corpus consists of 16 CD-ROMs. It contains speech recordings and
their orthographic transcriptions of 306 speakers (153 males and females
each) reading excerpts from the Mainichi Newspaper and the ATR 503 PB-
Sentences. All utterances and sentences are in the Japanese language.
We prepared 155 text sets. Each set consists of about 100 sentences
from the Mainichi Newspaper. As a general rule, each text set was read
by one male and one female. Every speaker also read any subset of the ATR
503 PB-Sentences (about 50 sentences for each subset). That is, this corpus
contains utterances of about 45,000 sentences as a whole with all speakers
reading about 150 sentences each.
Each utterance was recorded with two microphones: a head-set
microphone (all recoding sites used Sennheiser HMD410/HMD25-1 or the
equivalent) and a desk-top microphone of different types at each site
(Sanken, Sony, and so on). These two-microphone data were stored into
separate files and have a parallel directory structure in the CD-ROM
directories; eight of the discs (Vol.1 through Vol.8) contain the head-set
-microphone data and the other (Vol.9 through Vol.16) the desk-top-
microphone data.
The speech waves were sampled at 16 kHz and quantized into 16 bits.
They are stored in the compressed format mentioned below.
The corpus includes orthographic transcriptions of the speech data and
the bigram language models for the Mainichi Newspaper articles from which
the prompting text was selected. These materials are contained in Vol.1
and Vol.9.
The Speech Database Committee of the Acoustical Society of Japan,
established in July 1990, has discussed the design and creation of this
corpus, which has been recorded in collaboration with 39 institutions.
The recording and AD conversion characteristics, including low-pass filter
characteristics, are not necessarily unified.
2. Sentences of the Mainichi Newspaper Articles
The Large Vocabulary Continuous Speech Database Working Group of the
Information Processing Society of Japan, established in November, 1995,
selected 155 text sets for reading, using articles of the Mainichi Newspaper
issued during 1991-1994.
A bigram language model was estimated from the articles of 45 months
with their morphological information taken from RWCP text corpus(RWC-DB-
TEXT-95-1) which was automatically generated with a morphological analyzer.
The CMU SLP toolkit was used for the estimation. Sentences in the
articles for three months were classified into 30 categories based on the
bigram model. Each category is characterized by the sentence length(2
types), the vocaburaly size(5 types) and perplexity(3 types).
A statistically controlled text set consists of 90 sentences (SC-
sentences) collected from the categories according to Table 1 and about 10
connected sentences taken from a few paragraphs. 150 text sets of
controlled sentences (about 100 sentences each) were prepared. Five other
text sets are made up of connected sentences chosen from some paragraphs.
Table 1. The number of sentences collected from each category
LENGTH = NORMAL LENGTH = LONG
PERP=P_L PERP=P_M PERP=P_H PERP=P_L PERP=P_M PERP=P_H
VOC=MID 2 6 2 1 3 1
VOC=MID+ 2 6 2 1 3 1
VOC=LAR 4 12 4 2 6 2
VOC=LAR+ 2 6 2 1 3 1
VOC=LAR++ 2 6 2 1 3 1
VOC=MID: 5k voc. without an unknown word
VOC=MID+: 5k voc. with one unknown word
LENGTH=NORMAL: 5-19 morphemes
LENGTH=LONG: 20-39 morphemes
PERP=P_L: 0 < perplexity < 40
PERP=P_M: 40 <= perplexity < 85
PERP=P_H: 85 <= perplexity < 400
VOC=LAR: 20k voc. without an unknown word
VOC=LAR+: 20k voc. with one unknown word
VOC=LAR++: 20k voc. with two or more unknown words
LENGTH=NORMAL: 5-29 morphemes
LENGTH=LONG: 30-39 morphemes
PERP=P_L: 0 < perplexity < 70
PERP=P_M: 70 <= perplexity < 130
PERP=P_H: 130 <= perplexity < 400
3. ATR 503 PB-Sentences
These PB-sentences were chosen by ATR Interpreting Telephony Research
Laboratories. Entropy was calculated based on the clusters of two phonemes
(120 CV's, 227 VC's and 55 VV's, making 402 clusters in all) and three
phonemes (69 CVC's where C is an unvoiced consonant, 18 CVC's where C is
a nasal consonant and 136 VCV's where C is a semivowel, making 223
clusters in all) on the assumption that they occur independently.
10,196 original sample sentences were extracted at random from newspapers,
magazines, novels, letters, text books, etc. Of these, 503 PB sentences
were chosen to maximize the entropy. They were sorted so that each set of
50 sentences also be phonetically ballanced.
4. Orthographic transcription
The corpus includes two kinds of transcriptions. One is the Japanese
orthographic text with ruby, used as a prompting text for reading, in the
TeX format. The other is the Kana or Romaji text that represents
pronunciation of sentences. The Kana and Romaji texts were modified to
correctly transcribe recorded utterances according to the check reports
from the recording sites.
5. CD-ROM Format
The CD-ROMs are formatted according to ISO-9660 standards. The speech
waves were digitized with a 16 kHz sampling frequency and 16 bit
quantization. They are stored with the NIST SPHERE headers in the
compressed format, using the "shorten" compression technique developed by
Tony Robinson at Cambridge University/SoftSound Limited and implemented
in the NIST SPHERE PACKAGE. You can reach the latest version of the package
via anonymous ftp (URL=ftp://jaguar.ncsl.nist.gov/pub/sphere_x.x.tar.Z).
June 1997
Shuichi Itahashi
Copyright (c) of speech data:
Shuichi Itahashi (Edited by the Acoustical Society of Japan), 1997
Copyright (c) of text of the Mainichi Newspaper:
The Mainichi Newspapers, 1991-1994
Copyright (c) of morophological analysis data:
Real World Computing Partnership, 1996
Copyright (c) of ATR 503 PB-Sentences:
ATR Interpreting Telephony Research Laboratories, 1988
ASJ Continuous Speech Corpus : Vols. 1 - 16
--- Japanese Newspaper Article Sentences (JNAS) ---
Editor: Speech Database Committee
Acoustical Society of Japan
Publisher: Acoustical Society of Japan
2-7-7 Yoyogi, Shibuya, Tokyo Japan
Assisted by:
Chiba University
Doshisha University
Kyoto Institute of Technology
Kyoto University
Nagoya University
Nara Institute of Science and Technology
Osaka University
Ryukoku University
Shinshu University
Sizuoka University
Teikyo University of Science & Technology
Tohoku University
Toyohashi University of Technology
University of Electro-Communications
University of Tokyo
University of Tsukuba
Waseda University
Yamagata University
Yamanashi University
Electrotechnical Laboratory
ATR Interpreting Telecommunications Research Laboratories
Canon Inc.
Fujitsu Laboratories Ltd.
Furui Research Laboratory, NTT Human Interface Laboratories
Hitachi,Ltd.
Kokusai Denshin Denwa Co., Ltd.
Matsushita Research Institute Tokyo, Inc.
Meidensha Corporation
Mitsubishi Electric Corp.
NEC Corporation
NTT Basic Research Laboratories
NTT Data Corporation
Oki Electric Industry Co., Ltd.
Ricoh Co.,Ltd.
Sanyo Electric Co., Ltd.
Sharp Corporation
Sony Corporation
Speech and Acoustics Laboratory, NTT Human Interface Laboratories
Toshiba Corporation
Acknowledgements:
The prompting texts and the bigram language models of the Mainichi
Newspaper article sentences were prepared by the Large Vocabulary Continuous
Speech Database Working Group, Special Interest Group of Spoken Language
Processing, Information Processing Society of Japan.
We used the NIST SPHERE package for attaching a header to wave files
and 'shorten' compression technique for reducing the number of CD-ROMs.
The NIST SPHERE package was implemented by the Spoken Natural Language
processing group, National Institute of Standards and Technology, U.S.A.
The 'shorten' compression technique was developed by Tony Robinson at
Cambrigde University and SoftSound Limited, UK.
We would like to thank all the groups and peple above.
CD-ROMs produced by : Media Drive Corporation