|
Arabic Language Resources
|
During the last decade, the role of Language Resources (LR's) in the development of Human Language Technologies has moved up from a complementary to a leading one.
It has been realized nowadays that the investment needed to build the LR's necessary to train the
engine/engines powering a given language technology may well exceed what is needed to develop this/these
engine/engines. The trend towards maximizing the reusability of LR's has hence grown stronger and stronger, so that a considerable degree of independence of LR's as software components has been established, which has in turn created a
standalone market of LR's.
RDI has a remarkable history in
building Arabic LR's of all kinds using sophisticated tools that
bring efficiency together with quality.
Fassieh© is one clear example of
RDI's large-scale Arabic LR's building tools. It takes bulky
crude Arabic text corpora and produces structured text corpora
with the following types of annotations:
Arabic morphological analysis.
(with ArabMorpho© inside)
-
Arabic PoS tagging. (with
ArabTagger©
inside)
-
Arabic diacritization/phonetic transcription. (with
ArabDiac©
inside)
-
While Fassieh©
can produce these types of annotations in a full automatic mode,
it also allows the guided manual revision in a fully graphical
and interactive environment with several auxiliary tools such as
status coloring and on-line lexical dictionaries. See a screen
shot of Fassieh©
by clicking
here for the shot at its full resolution.
For detailed info on Fassieh©,
click
here
RDI builds mega-scale written
Arabic LR's, Arabic Speech LR's, and reposirtories of textually
labeled scanned Arabic font-written pages. These LR's are
produced to train and evaluate RDI's
Arabic NLP,
Arabic digital speech,
and Arabic font-written OCR
systems, however, RDI also builds such LR's for other
industrial and academic parties to train and test their own HLT
applications.
Examples of LR's built by RDI
for other parties are:
-
RDI is currently
participating through the
NEMLAR project in building
aligned Arabic-English parallel corpora of a size of tens of
millions of words for open source Machine Translation
systems.
-
700K words Arabic
balanced corpus annotated for the
NEMLAR project via Fassieh© with all the types of annotations mentioned above;
i.e. morphological, PoS tagging, phonological, and lexical
semantics. This corpus is manually revised in full. For
detailed technical info, see the
specifications document of these LR's.
-
Male and female speakers DB for Arabic speech synthesis esp. concatenative Text-to-Speech (TTS) systems. This LR which has been built up to the state-of-the-art standards in this regard and which is powering RDI 's Arabic TTS engine;
ArabTalk©, has been modernized and afforded for public accessibility through the Euro-Mediterranean project of
NEMLAR. For more info, see the
NEMLAR's LREC06 paper.
-
40-hours long Broadcast News Speech Corpus (BNSC) where 259 speakers alternate. This LR has been built up to the state-of-the-art standards in this regard and afforded for public accessibility through the Euro-Mediterranean project of
NEMLAR. For more info, see the
NEMLAR's LREC06 paper.
-
A fully annotated ASR corpus
of more than 1000 Egyptian speakers from 4 different regions
all over Egypt covering modern standard Arabic, Egyptian
slang, and English with Egyptian accent. This LR has been
built as part of
Orientel project.
-
The Egyptian part of the
speech DB of the Arabic version of IBM's ViaVoice© dictation
software.
|
|