Back Contact: nemlar@hum.ku.dk	The NEMLAR Arabic Language Resources The NEMLAR Arabic Language Resources comprise three main datasets: the Arabic Written Corpus, the Broadcast News Speech Corpus, and the Speech Synthesis Corpus. Arabic Written Corpus Contains approximately 500,000 words of Standard Arabic text collected from various domains such as news, literature, business, and scientific texts. The corpus is available in multiple formats including raw text, vowelized text, lexical analysis, and POS-tagged versions. Broadcast News Speech Corpus Includes 40 hours of transcribed Arabic speech from over 250 speakers, recorded from multiple radio stations. The dataset contains annotations for speakers, segments, and background sounds. Speech Synthesis Corpus Contains 10 hours of high-quality recorded speech from native Arabic speakers. Designed for building Arabic text-to-speech systems, including phonetic and prosodic annotations. These datasets were developed to support research in Arabic NLP, machine translation, and speech technologies.
	The project was supported by the European Commission's INCO-MED programme and ran from February 2003 until July 2005

Back

Contact:
nemlar@hum.ku.dk

The NEMLAR
Arabic Language Resources

The NEMLAR Arabic Language Resources comprise three main datasets: the Arabic Written Corpus, the Broadcast News Speech Corpus, and the Speech Synthesis Corpus.

Arabic Written Corpus
Contains approximately 500,000 words of Standard Arabic text collected from various domains such as news, literature, business, and scientific texts. The corpus is available in multiple formats including raw text, vowelized text, lexical analysis, and POS-tagged versions.

Broadcast News Speech Corpus
Includes 40 hours of transcribed Arabic speech from over 250 speakers, recorded from multiple radio stations. The dataset contains annotations for speakers, segments, and background sounds.

Speech Synthesis Corpus
Contains 10 hours of high-quality recorded speech from native Arabic speakers. Designed for building Arabic text-to-speech systems, including phonetic and prosodic annotations.

These datasets were developed to support research in Arabic NLP, machine translation, and speech technologies.

The project was supported by the European Commission's INCO-MED programme and ran from February 2003 until July 2005

The NEMLARArabic Language Resources

The NEMLAR
Arabic Language Resources