Contact: nemlar@hum.ku.dk |
Parallel corpora
|
Name of Corpus | Provider | Size | Language | Other information | Availability, price, manip. |
---|---|---|---|---|---|
EGYPT Giza Toolkit Quran Parallel Corpus | CLSP/JHU | ar-en | Free | ||
Sentence aligned bilingual Arabic English corpus | Sakhr | 1.35 million sentences | ar-en, en-ar | 3 | |
CLARA (Corpus Linguae Arabicae) | Charles University | 37 million words | ar-cz | ||
Bilingual aligned corpus | ILC | ar-it | |||
Arabic English Parallel News Part 1(Umaah) | LDC | 2 million words | ar, en | Catalog no.: LDC2004T18 | $1500-3000 |
Arabic News Translation Text Part 1 | LDC | 441,000 words | ar, en | Catalog no.: LDC2004T17 | $1500-3000 |
Arabic Newswire English Translation Collection | LDC | 551,000 words | ar-en | Catalog no.: LDC2009T22 | $1500 |
Multiple Translation Arabic part 1 | LDC | 23,000 words | ar,en | Catalog no.: LDC2003T18 | $500-1000 |
Multiple Translation Arabic part 2 | LDC | 15,000 words | ar,en | Catalog no.: LDC2005T05 | $500-1000 |
TDT4 Multilanguage Text and Annotation | LDC | ar, en, ch | Catalog no.: LDC2005T16 | $200-2000 | |
TDT5 Multilanguage Text | LDC | ar, en, ch | Catalog no.: LDC2006T18 | $200-750 | |
GALE Phase 1 Arabic Blog Parallel Text | LDC | unknown | ar,en | Catalog no.: LDC2008T02 | 1500$ |
GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 | LDC | 90,000 words | ar,en | Catalog no.: LDC2007T24 | 1500$ |
GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 | LDC | 56,000 words | ar,en | Catalog no.: LDC2008T09 | 1500$ |
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 | LDC | unknown | ar,en | Catalog no.: LDC2009T03 | 1500$ |
ISI Arabic-English Automatically extracted parallel text | LDC | 1,1 million sentence pairs | ar, en | Catalog no.: LDC2007T08 | $2000-4000 |
Holy Quran book | islamware | 78,500 words | ar, en, fr, de, es | Free | |
E-A Parallel Corpus | University of Kuwait | 2 million words | en-ar | ||
Multilingual Corpus | University of Manchester | 11.5 million words | ar, en | ||
STRAND En-Ar Parallel web pages (tool and corpus) | University of Maryland | 2190 URL pairs | en-ar | Free | |
Nijmegen Corpus | Nijmegen University | 2 million words | ar-dutch | 130? | |
OPUS KDE Open source products' manuals | OPUS e.g. EuroMatrix | 300,000 tokens | af, ar, az, be, bg, br, bs, ca, cs, cy, da, de, el, en, eo, es, et, eu, fi, fr, ga, gl, he, hr, hu, id, is, it, ja, ko, ku, lt, lv, mi, mk, mt, nb, nl, nn, oc, pl, pt, ro, ru, sk, sl, sr, sv, ta, th, tr, uk, ven, vi, wa, xh, zu | Free | |
United Nations General Assembly Resolutins | Alexandre Rafalovitch, Robert Dale | Ar: 2,721,463 words | ar, en, fr, sp, ru, ch | Free for research purpose | |
Meedan Translation Memory | Meedan | 20 K sentence pairs | ar, en | > | Open Database License (ODbL) |
Microsoft Terminology | Microsoft | 12 K experssions | ar, en,fr, gr, ch | > | |
Arcade II - Evaluation Package | ELRA | 316,000 words | ar,fr | Le Monde Diplomatique aligned sentences | 150-1000 |
CESTA Evaluation Package | ELRA | 60,000 words | ar,fr | The two corpora from Le Monde Diplomatique and from the UNICEF, WHO and FHI websites translated from 1 to 4 times | 150-1000 |