Chinese researchers say they’ve developed an AI text censor that's 91 pct accurate

Stephen Chen, South China Morning Post

Posted at Apr 14 2021 12:14 PM

A research team in China claims to have developed a text censor that can filter "harmful information" on the internet with unprecedented accuracy using artificial intelligence.

Traditional machine censors rely mainly on keywords to do this and struggle to achieve 70 per cent accuracy, while AI technology - which needs to be trained by humans - has taken that to about 80 per cent in recent years.

The team from Shenyang Ligong University and the Chinese Academy of Sciences say their AI technology does not need to be trained by humans and "outperforms other approaches" to achieve more than 91 per cent accuracy.

Do you have questions about the biggest topics and trends from around the world? Get the answers with SCMP Knowledge, our new platform of curated content with explainers, FAQs, analyses and infographics brought to you by our award-winning team.

It would be particularly useful to "identify and filter sensitive information from online news media", lead researcher Li Shu and her colleagues wrote in a paper published in the Journal of Chinese Computer Systems on Monday.

China has more than 900 million internet users, more than any other country, and is building the world's largest 5G networks to boost communication speed. But the internet is tightly controlled, with many sites blocked including Google, Facebook, Twitter and some foreign news outlets - and much of the content on the sites that are available is banned.

Prohibited topics are wide-ranging - from pornography to cults, drug abuse, firearm use, terrorism and attacks on the Communist Party and its top leaders.

But identifying them is a challenge for computers. Chinese is one of the most complex languages in the world, with nearly 10,000 characters. And sensitive words - gun, for example - could get picked up in a non-sensitive context, triggering a false alarm, or illegal information could be posted online without the use of any sensitive words.

The Chinese government and internet companies have instead relied on a huge army of censors to manually vet online content, but it is too costly and inefficient to keep pace with the growth of information on China's internet and social media.

Li, an associate professor of computer science at Shenyang Ligong University, said the technology developed by her team could keep up with the fast-evolving language used online in China, with a powerful dictionary containing not only sensitive words but their changing forms.

She said it could also read between the lines when searching for illegal content that was hidden in a different context, increasing the ability to identify text that is written in a way to bypass machine censors. Many internet users in China avoid using sensitive words and instead use homonyms or add hyphens between characters to confound the censors.

Part of the team's text censor technology came from Google, Li said. In 2017, Google developed an open-source language model known as bidirectional encoder representations from transformers, or BERT, to help its search engine better understand users' search terms. BERT can read a word in different contexts - such as "running a business" versus "running a marathon" - as a result of reading huge text databases including the entire Wikipedia site.

But BERT is not a censor by design and cannot understand text longer than 512 words. To make it work, Li's machine breaks a long text into segments, lets BERT read the shorter parts and uses another AI tool to combine the results and assess them using the most up-to-date dictionary.

Google did not respond to a request for comment.

China is investing heavily in artificial intelligence and the technology is increasingly becoming part of everyday life in China - from e-commerce to public spaces where surveillance cameras are equipped with facial recognition, to military uses.

Copyright (c) 2021. South China Morning Post Publishers Ltd. All rights reserved.