Although increasing numbers of people are taking their multilingual practices into their online communication, search engine technology had not kept pace with this trend - until UCT-trained computer scientist Dr Mohammed Mustafa Ali came up with a solution.
Ali, who hails from the Sudan and whose mother-tongue is Arabic, became acutely aware that technology driving internet search engines had not kept up with the online practice of continually switching between two languages - a habit that is particularly prevalent in the non-English-speaking world.
So during his doctoral studies he put his computer science skills to work and devised algorithms that, he argues, can create "language-aware" search engines, able to recognise searches made in multiple languages. These algorithms could also potentially empower search engines to throw up mixed-language results, opening up new worlds of information to non-native English-speakers.
"The algorithms proposed in this thesis address the Web searching needs of such non-English speakers, who often need the most relevant information rather than just retrieving documents containing exactly their query terms," says Ali. "The proposed algorithms could empower and present a direction for future search engines, which should allow multilingual users (and their multilingual queries) to retrieve relevant information created by other multilingual users.
"Thus, it could have significant outcomes for languages with limited modern vocabulary, mostly those non-English ones, in developing countries."
Language-mixing, or code-switching, as it is known, is common in multilingual communities in which the barrier between cultures is porous, with people using more than one language in their everyday interactions.
"In such communities, natives are able to express some keywords in languages other than their native tongue or vice versa," Ali says. "From personal experience, the typical Arabic speaker speaks a mixture of tightly-integrated words in both English and Arabic, and in various slang variants. "Hong Kong speakers typically speak Cantonese with many English words. Capetonians speak English with many scattered Afrikaans words, and/or local slang."
Current search engines and traditional Information Retrieval (IR) systems perform poorly when handling multilingual querying, because in most cases they fail to provide the most relevant documents, explains Ali. This, he says, is due to two reasons.
"First, the underlying assumption in IR is that users post queries in their native tongues. Second, most traditional IR systems depend primarily on similarity ranking methods that are based solely on monolingual computations and statistics, without taking into account the multilingual text in multilingual queries.
"Ignorance of this feature causes the most dominant documents on the ranked retrieval list to be those documents that contain exactly the same terms as in the multilingual query, regardless of its languages."
Ali maintains that his algorithms will enable non-native English speakers to access large swathes of online knowledge previously difficult to find.
"With information globalisation and moving towards an international community, it becomes essential not to constrain non-English-speakers, such as Arabic users, to single languages," he says. "There are many problems introduced by the explicit handling of multiple languages, but the algorithms and experiments conducted demonstrate that these problems can be adequately resolved in an IR system.
"The evidence suggests that language-awareness and mixed-language solutions are feasible for IR systems, without diminishing the quality of results."
Story by Yusuf Omar.
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
Please view the republishing articles page for more information.