Nemlar logo


Arabic Language Resources

The NEMLAR Arabic LRs comprise a set of three resources (namely, the NEMLAR Arabic Written Corpus, NEMLAR Arabic Broadcast News Speech Corpus and NEMLAR Arabic Speech synthesis Corpus). These resources are owned and copyrighted by the NEMLAR Consortium and they are available through ELRA .

The NEMLAR Arabic Written Corpus consists of 500K words of Standard Arabic text compiled from 13 different domains (political news, political debate, Islamic text, common-word phrases, text from Broadcast News, business, Arabic literature, general news, interviews, scientific press, sports press, dictionary-entry explanations and legal domain text), aiming to achieve a well-balanced corpus that offers a representation of the variety in syntactic, semantic and pragmatic features of modern Arabic language. The time span of the data included goes from late 1990’s to 2005. The corpus is provided in 4 different versions: a) raw text, b) fully vowelized text, c) text with Arabic lexical analysis, and d) Arabic POS-tagged.

The NEMLAR Arabic Broadcast News Speech Corpus consists of 40 hours of transcribed Standard Arabic data (from 209 male and 50 female speakers) recorded from four different radio stations. Each daily-broadcast recording contains between 25 and 30 minutes of news and interviews. Transcriptions follow Transcriber conventions with the additional patch for Arabic. Thus, transcriptions were done in Arabic characters and their transliterations were automatically generated. The character set used for the transliterations follows the ISO-8859 standard. The annotation levels included focused on orthographic transcription of speech, including named entities; speakers and speaker turns; segment markers; topic/story boundaries; background noises; change of background; music/noise, and word boundaries.

The NEMLAR Arabic Speech synthesis Corpus has been produced so as to help build concatenative and parametric Arabic TTS systems. This corpus consists of 10 hours of annotated recorded speech from native Arabic speakers (5 hours of a male and 5 hours of a female speaker). All speech data was recorded at 96 kHz, 24 bits, 2 channels (one from a highly-sensitive large-membrane microphone, and the other for electroglottograph (EGG) signal). The prompt sheets created were the same for both male and female recordings. They contained 33,200 words that offered the following distribution: a) 6,600 were extracted from different domains of the NEMLAR Arabic Broadcast News Speech corpus; b) 16,500 were selected from different domains of the NEMLAR Arabic Written corpus; c) 3,500 represented frequent Arabic phrases, and d) the remaining 6,600 aimed to cover missing and rare diphones. The full corpus comprises the following components: orthographic transcription, prosodic transcription, phonetic transcription, phonetic segmentation and pitch marks.

The project was supported by the European Commission's INCO-MED programme and was running from February 1st 2003 until July 31st 2005

European Flag