The orthography of many African languages includes diacritically
marked characters. Falling outside the scope of the standard Latin
encoding, these characters are often represented in digital language
resources as their unmarked equivalents. This renders corpus
compilation more difficult, as these languages typically do not have
the benefit of large electronic dictionaries to perform diacritic
restoration.
This is a demonstration system for a diacritic restoration
method that is able to automatically restore diacritics on the
basis of local graphemic context. It is based on the machine
learning method of Memory-Based learning. We have applied the
method to the African languages of Cilubà, Gĩkũyũ, Kĩkamba, Maa,
Sesotho sa Leboa, Tshivenḓa and Yoruba.
You can find more information about this method in
this paper