The NEMLAR Arabic Language Resources
The NEMLAR Arabic LRs comprise a set of three resources (namely, the
NEMLAR Arabic Written Corpus, NEMLAR Arabic Broadcast News Speech
Corpus and NEMLAR Arabic Speech synthesis Corpus). These resources are
owned and copyrighted by the NEMLAR Consortium and they are available
through ELRA .
The NEMLAR Arabic Written Corpus consists of 500K words of Standard
Arabic text compiled from 13 different domains (political news,
political debate, Islamic text, common-word phrases, text from
Broadcast News, business, Arabic literature,
general news, interviews, scientific press, sports press,
dictionary-entry explanations and legal domain text), aiming to achieve
a well-balanced corpus that offers a representation of the variety in
syntactic, semantic and pragmatic features of modern Arabic language.
The time span of the data included goes from late 1990’s to 2005. The
corpus is provided in 4 different versions: a) raw text, b) fully
vowelized text, c) text with Arabic lexical analysis, and d) Arabic
POS-tagged.
The NEMLAR Arabic Broadcast News Speech Corpus consists of 40 hours of
transcribed Standard Arabic data (from 209 male and 50 female speakers)
recorded from four different radio stations. Each daily-broadcast
recording contains between 25 and 30 minutes of news and interviews.
Transcriptions follow Transcriber conventions with the additional patch
for Arabic. Thus, transcriptions were done in Arabic characters and
their transliterations were automatically generated. The character set
used for the transliterations follows the ISO-8859 standard. The
annotation levels included focused on orthographic transcription of
speech, including named entities; speakers and speaker turns; segment
markers; topic/story boundaries; background noises; change of
background; music/noise, and word boundaries.
The NEMLAR Arabic Speech synthesis Corpus has been produced so as to
help build concatenative and parametric Arabic TTS systems. This corpus
consists of 10 hours of annotated recorded speech from native Arabic
speakers (5 hours of a male and 5 hours of a female speaker). All
speech data was recorded at 96 kHz, 24 bits, 2 channels (one from a
highly-sensitive large-membrane microphone, and the other for
electroglottograph (EGG) signal). The prompt sheets created were the
same for both male and female recordings. They contained 33,200 words
that offered the following distribution: a) 6,600 were extracted from
different domains of the NEMLAR Arabic Broadcast News Speech corpus; b)
16,500 were selected from different domains of the NEMLAR Arabic
Written corpus; c) 3,500 represented frequent Arabic phrases, and d)
the remaining 6,600 aimed to cover missing and rare diphones. The full
corpus comprises the following components: orthographic transcription,
prosodic transcription, phonetic transcription, phonetic segmentation
and pitch marks.
|