How we’re making the tools to connect isiXhosa and isiZulu to the digital age

23 March 2018 | Story Maria Keet Read time 7 min.

Researchers and scientists are investigating and creating language technologies, including spellcheckers for the nguni languages of isiZulu and isiXhosa.

We live in a world where around 7000 languages are spoken, and one where information and communication technologies are becoming increasingly ubiquitous. This puts increasing demands on more, and more advanced, Human Language Technologies (HLTs).

These technologies comprise computational methods, computer programmes and electronic devices that are specialised for analysing, producing or modifying texts and speech.

Engaging with a language like English is made easier thanks to the many tools to support you, such as spellcheckers in browsers and autocomplete for text messages. This is mainly because English has a relatively simple and well investigated grammar, more data that software can learn from, and substantial funding to develop tools. The situation is somewhat to very different for most language in the world.

This is beginning to change. Profit driven multinationals such as Google, Facebook and Microsoft, for instance, have invested in the development of HLTs also for African languages.

Researchers and scientists, myself included are also investigating and creating these technologies. It has a direct relevance for society: languages, and the identities and cultures intertwined with them, are a national resource for any country. In a country like South Africa, learning different languages can foster cohesion and inclusion.

Just learning a language, however, is not enough if there’s no infrastructure to support it. For instance, what’s the point of searching the Web in, say, isiXhosa when the search engine algorithms can’t process the words properly anyway and so won’t return the results you’re looking for? Where are the spellcheckers to assist you in writing emails, school essays, or news articles?

That’s why we have been laying both theoretical foundations and creating proof-of-concept tools for several South African languages. This includes spellcheckers for isiZulu and isiXhosa and the generation of text in mainly these languages from structured input.

Using rules of the language to develop tools

Tool development for the Nguni group of languages – and isiZulu and isiXhosa in particular – wasn’t simply a case of copy-and-pasting tools from English. I had to develop novel algorithms that can handle the quite different grammar. I have also collaborated with linguists to figure out the details of each language.

For instance, even just automatically generating the plural noun in isiZulu from a noun in the singular required a new approach that combined syntax – how it is written – with semantics (the meaning) of the nouns by using its characteristic noun class system. In English, merely syntax-based rules can do the job.

Rule-based approaches are also preferred for morphological analysers, which split each word into its constituent parts, and for natural language generation. Natural language generation involves taking structured data, information or knowledge, such as the numbers in the columns in a spreadsheet, and creating readable text from them.

A simple way of realising that is to use templates where the software slots in the values given by the data or the logical theory. This is not possible for isiZulu, because the sentence constituents are context-dependent.

A grammar engine is needed to generate even the most basic sentences correctly. We have worked out the core aspects of the workflow in the engine. This is being extended with more details of the verbs.

Using lots of text to develop tools

The rules-based approach is resource intensive. This, in combination with global hype around “Big Data”, has brought data-driven approaches to the fore.

The hope is that better quality tools may now be developed with less effort and that it will be easier to reuse those tools for related languages. This can work, provided one has a lot of good quality text, referred to as a corpus.

Such corpora are being developed, and the recently established South African Centre for Digital Language Resources (SADiLaR) aims to pool computational resources. We investigated the effects of a corpus on the quality of an isiZulu spellchecker, which showed that learning the statistics-driven language model on old texts like the bible does not transfer well to modern day texts such as news items from the Isolezwe newspaper, nor vice versa.

The spellchecker has about 90% accuracy in single-word error detection and it seems to contribute to the intellectualisation of isiZulu.

Its algorithms use trigrams and probabilities of their occurrence in the corpus to compute the probability that a word is spelled correctly, rather than a dictionary based approach that is impractical for agglutinating languages. The algorithms were reused for isiXhosa simply by feeding it a small isiXhosa corpus: it achieved about 80% accuracy already even without optimisations.

Data-driven approaches are also pursued in tools for finding information online, i.e., to develop search engines alike a ‘Google for isiZulu’. Algorithms for data-driven machine translation, on the other hand, can easily be misled by out-of-domain training data from which it has to learn the patterns.

Relevance for South Africa

This sort of natural language generation could be incredibly useful in South Africa. The country has 11 official languages, with English as the language of business. That has resulted in the other 10 being sidelined, and in particular those that were already under resourced.

This trend runs counter to citizens’ rights and the state’s obligations as outlined in the Constitution. These obligations go beyond just promoting language. Take, for instance, the right to have access to the public health system. One study showed that only 6% of patient-doctor consultations was held in the patient’s home language. The other 94% essentially didn’t receive the quality care they deserved because of language barriers.

The sort of research I’m working on with my team can help. It could contribute to, among others, realising technologies such as automatically generating patient discharge notes in one’s own language, text-based weather forecasts, and online language learning exercises.

Maria Keet is Senior Lecturer in the Department of Computer Science at the University of Cape Town.

This article was published in The Conversation, a collaboration between editors and academics to provide informed news analysis and commentary. Its content is free to read and republish under Creative Commons; media who would like to republish this article should do so directly from its appearance on The Conversation, using the button in the right-hand column of the webpage. UCT academics who would like to write for The Conversation should register with them; you are also welcome to find out more from carolyn.newton@uct.ac.za.

A longer version of this article, along with figures, can be found on Maria Keet's blog.

For licensing information please visit the source website.

Latest articles

Trevor Manuel delivers second annual Francis Wilson Memorial Lecture Former Finance Minister Trevor Manuel was the guest speaker at an annual event that honours the work of UCT economic researcher and data specialist Prof Francis Wilson. 19 Apr 2024

Apply now for The Pitch UCT Applications for The Pitch UCT are officially open. 17 Apr 2024

Straight as an arrow: The intricacies of archery Archery is more than a bow and arrow, according to chairperson of the UCT code, Monde Zwane. 17 Apr 2024

Biopharming research unit to host international farming conference UCT’s Biopharming Research Unit is gearing up to host the International Society for Plant Molecular Farming conference in May. 16 Apr 2024

Shaping African science and society UCT SARChI chairs have developed a local ecosystem of knowledge and capacity that drives scientific innovation and economic growth. 16 Apr 2024

Research & innovation

Biopharming research unit to host international farming conference

UCT’s Biopharming Research Unit is gearing up to host the International Society for Plant Molecular Farming conference in May. 16 Apr 2024 - 1 min read

Shaping African science and society

UCT SARChI chairs have developed a local ecosystem of knowledge and capacity that drives scientific innovation and economic growth. 16 Apr 2024 - 8 min read

QS rankings place four UCT subjects in the top 100

UCT ranked 15th in the world for Development Studies and is among the top 100 institutions in a further three subjects. 12 Apr 2024 - 3 min read

Securing a sustainable future through resource efficiency, minerals and metals 10 Apr 2024

Leading UCT researchers honoured with SAMRC scientific awards 05 Apr 2024

Taking Africa from biomass to green gas 04 Apr 2024

Building a healthier Africa through discovery 04 Apr 2024

A journey of curiosity, healing, and hope 28 Mar 2024

Looking through the lens of missed vaccination opportunities 26 Mar 2024

Bending the arc towards health equity 25 Mar 2024

Adopt a social approach to managing TB 24 Mar 2024

Infrastructure and innovation for a healthy Africa 22 Mar 2024

SARChI: A national programme that has powered impactful and resonant research 20 Mar 2024

Decomposing pig bodies provide insight into novel methods to identify human remains 20 Mar 2024

UCT celebrates inspirational women graduates 19 Mar 2024

UCT master’s student in line for a global McCall MacBain Scholarship award A UCT master’s student is headed to the final round of interviews for the Canadian global McCall MacBain Scholarships. 18 Mar 2024

UCT’s H3D spearheads Africa’s drug discovery accelerator programme UCT’s H3D will play a leading role in supporting drug discovery in sub-Saharan Africa through the Grand Challenges African Drug Discovery Accelerator programme. 13 Mar 2024

Understanding the role of nature-based adaptation in human mobility and immobility Research led by Nicholas Simpson, along with a team at UCT’s ACDI, emphasises the urgent need for a comprehensive study on how climate change impacts mobility. 13 Mar 2024

Global Digital Heritage Afrika partners with UCT Global Digital Heritage Afrika has found a new home at UCT. 12 Mar 2024

What climate change means for South Africa and its people Researchers at UCT have recently released a synthesis report highlighting the potential impact of climate change in South Africa and how these changes will impact our economy and ecosystems. 11 Mar 2024

Courts in crisis, but solutions are simple UCT’s Democratic Governance and Rights Unit has released two highly impactful reports focusing on the work environment in and functioning of the South African Magistrates’ Courts. 07 Mar 2024

UCT and McMaster lead a new clinical trial to reduce HIV-related deaths in Africa A new trial will test a low-cost antibiotic aimed at reducing deaths among 8 000 patients with advanced HIV who are starting or restarting anti-retroviral therapy, a combination of drugs to treat HIV. 06 Mar 2024

Older Person’s Grant not ‘nearly enough’ to support SA’s elderly The older persons’ care needs and social grants research study revealed that homes dependent on the Older Person’s Grant recorded a gross underspend on food. 22 Feb 2024

UCT study empowers medical students through AI and open educational resources Research by UCT’s DOT4D has brought to light an innovative approach in medical education, leveraging AI to empower students in creating open educational resources. 20 Feb 2024

Joint professorships and fellowships awarded to UCT and UoB scholars UCT and UoB award joint professorships and fellowships to develop and deepen their strategic research partnership. 15 Feb 2024

Machine-learning research earns UCT academic NRF P-rating UCT’s Dr Jan Buys has been awarded a prestigious P-rating by the National Research Foundation. 15 Feb 2024

SA Magistrates’ Courts survey reveals ‘reassuring’ outcome A survey by UCT’s Democratic Governance and Rights Unit revealed that Magistrates’ Courts are upholding the fundamental constitutional imperatives of equality and dignity. 08 Feb 2024

Three UCT academics among Clarivate’s highly cited list UCT’s Prof Estelle (Vicki) Lambert, Prof Dan Stein, and Assoc Prof Darren Martin have been included on the Clarivate Highly Cited Researchers 2023 list. 08 Feb 2024

Assessing the state of post-school education in an unequal society UCT’s Dr Emma Whitelaw spoke about the state of the economics of the post-school education system. 07 Feb 2024

Don’t take hearing loss for granted Lebogang Ramma, a professor in audiology at UCT, delivered the first inaugural lecture of 2024 on Thursday, 1 February. 05 Feb 2024

View Research and Innovation archive

How we’re making the tools to connect isiXhosa and isiZulu to the digital age

Most read

Latest articles

Research & innovation