Contact: nemlar@hum.ku.dk |
Unannotated corpora
|
Name of Corpus | Provider | Size | Other information | Availability, cost, manip. |
---|---|---|---|---|
Al-Hayat Arabic Arabic Corpus | ELRA | 18,639,264 tokens | The tokens cover 42,591 article within 7 domains | Price: 480-1440 1,2,1 R 1,3,1 C |
An-Nahar newspapers text corpus | ELRA | 24 million words | The words are found in 45,000 articles; Arabic from Lebanon | Price: 336-1008 1,2,1 R 1,3,1 C |
Dinar-MBC | Lyon2 | 10 million words | Lit., essays, press | 3 |
Fully diacritized/vowelized Text corpus | RDI | 3 million words | Multi domain balanced coverage:, literature, business, science, sport, politics etc. | 1,4,1 |
Arabic morphologically analyzed, PoS tagged andvowelized corpus | RDI | 750K words | Multi domain balanced coverage:, literature, business, science, sport, politics etc. | 1,4,1 |
Monolingual unannotated | Sakhr | 2 billion words | Classified on a coarse grained subject tree | 3 |
Fully diacritised monolingual Arabic corpus for Islamic domain | Sakhr | 80 million words | 3 | |
Le Monde Diplomatique | ELRA | 75,000 480,000 words | Price: 46-69 per year | |
AFP Corpus | ELRA | 450,000 documents | Price: To be announced | |
NEMLAR Written Corpus | ELRA | 500,000 words | Price: 150-2000 | |
ArabiCorpus | Brigham Young University | 1 million words | ||
Arabic Wikipedia articles | UPV (Y. Benajiba) | 11,000 articles | Free | |
Arabic Gigaword | LDC | 400 million words | Price: $200-3000 | |
General Scientific Arabic Corpus | University of Manchester | 1.6 million words | ||
Classical Arabic Corpus | University of Manchester | 5 million words | ||
Buckwalter Arabic corpus | Tim Buckwalter | 5 million words | ||
A corpus of Contemorary Arabic (CCA Corpus) | University of Leeds (UK) | 1 M words | Free to download | |
Arabic Newswire Corpus | LDC | 80 M words | $600 - 1200$ | |
International corpus of Arabic (ICA) | Bibliotheca Alexandrina, Egypt | 100 M words | - | |
Khaleej-2004 corpus | Mourad Abbas | 3 M words, More than 5000 articles | Articles taken from the online newspaper Akhbar Alkhaleej | Free for research use |
Watan-2004 corpus | Mourad Abbas | About 20.000 articles | Articles taken from the online newspaper Omani 2004 | Free for research use |