The NEMLAR Arabic Language Resources
The NEMLAR Arabic Language Resources comprise three main datasets:
the Arabic Written Corpus, the Broadcast News Speech Corpus, and the Speech Synthesis Corpus.
Arabic Written Corpus
Contains approximately 500,000 words of Standard Arabic text collected
from various domains such as news, literature, business, and scientific texts.
The corpus is available in multiple formats including raw text,
vowelized text, lexical analysis, and POS-tagged versions.
Broadcast News Speech Corpus
Includes 40 hours of transcribed Arabic speech from over 250 speakers,
recorded from multiple radio stations.
The dataset contains annotations for speakers, segments, and background sounds.
Speech Synthesis Corpus
Contains 10 hours of high-quality recorded speech from native Arabic speakers.
Designed for building Arabic text-to-speech systems,
including phonetic and prosodic annotations.
These datasets were developed to support research in Arabic NLP,
machine translation, and speech technologies.
|