Search | Contact us
Arabic LR's
Arabic Language Resources
Arabic NLP
Speech
Summarization
OCR
Arabic LR's
Qur'an Tools
Try Online
Papers

During the last decade, the role of Language Resources (LR's) in the development of Human Language Technologies has moved up from a complementary to a leading one.

It has been realized nowadays that the investment needed to build the LR's necessary to train the engine/engines powering a given language technology may well exceed what is needed to develop this/these engine/engines. The trend towards maximizing the reusability of LR's has hence grown stronger and stronger, so that a considerable degree of independence of LR's as software components has been established, which has in turn created a standalone market of LR's.

RDI has a remarkable history in building Arabic LR's of all kinds using sophisticated tools that bring efficiency together with quality.

Fassieh© is one clear example of RDI's large-scale Arabic LR's building tools. It takes bulky crude Arabic text corpora and produces structured text corpora with the following types of annotations:

While Fassieh© can produce these types of annotations in a full automatic mode, it also allows the guided manual revision in a fully graphical and interactive environment with several auxiliary tools such as status coloring and on-line lexical dictionaries. See a screen shot of Fassieh© by clicking here .

For detailed info on Fassieh©, click here

RDI builds mega-scale written Arabic LR's, Arabic Speech LR's, and reposirtories of textually labeled scanned Arabic font-written pages. These LR's are produced to train and evaluate RDI's Arabic NLP, Arabic digital speech, and Arabic font-written OCR systems, however, RDI also builds such LR's for other industrial and academic parties to train and test their own HLT applications.

Examples of LR's built by RDI for other parties are:

  • RDI has participated through the MEDAR project in building aligned Arabic-English parallel corpora of a size of tens of millions of words for a baseline open-source Statistical Machine Translation systems.

  • 750K words Arabic balanced corpus annotated for the NEMLAR project via Fassieh© with all the types of annotations mentioned above; i.e. morphological, PoS tagging, phonological, and lexical semantics. This corpus is manually revised in full. For detailed technical info, see the specifications document of these LR's.

  • Male and female speakers DB for Arabic speech synthesis esp. concatenative Text-to-Speech (TTS) systems. This LR which has been built up to the state-of-the-art standards in this regard and which is powering RDI 's Arabic TTS engine; ArabTalk©, has been modernized and afforded for public accessibility through the Euro-Mediterranean project of NEMLAR. For more info, see the NEMLAR's LREC06 paper.

  • 40-hours long Broadcast News Speech Corpus (BNSC) where 259 speakers alternate. This LR has been built up to the state-of-the-art standards in this regard and afforded for public accessibility through the Euro-Mediterranean project of NEMLAR. For more info, see the NEMLAR's LREC06 paper.

  • A fully annotated ASR corpus of more than 1000 Egyptian speakers from 4 different regions all over Egypt covering modern standard Arabic, Egyptian slang, and English with Egyptian accent. This LR has been built as part of Orientel project.



  • The Egyptian part of the speech DB of the Arabic version of IBM's ViaVoice© dictation software.

RDI© - Research and Development International.
Since 1993 - All rights reserved.
Downloads | Jobs
www.rdi-eg.com