Paper summary: Measuring QQMail's automated email censorship in China (FOCI 2021)

Measuring QQMail’s automated email censorship in China
Jeffrey Knockel, Lotus Ruan
https://dl.acm.org/doi/10.1145/3473604.3474560
https://github.com/citizenlab/chat-censorship/tree/master/qqmail

The paper investigates automated keyword censorship done by QQMail, the most popular email provider in China. In common with, for example, chat platforms and online games, censorship of email messages in China is done not by the Great Firewall, but by the email provider. It happens that QQMail’s MX servers reveal whether an incoming email is censored in their SMTP reply: a censored message results in “550 Mail content denied”, while an uncensored message results in “250 OK” (if the recipient QQMail account exists) or “500 Mailbox unavailable or access denied” (if not). To test whether an email would be censored, send it to a nonexistent QQMail address and examine the reply code. This test can be done externally and does not even require a QQMail account or control over a host in China.

QQMail’s censorship system evidently works by looking for combinations of certain keywords, all of which must be present for a message to be censored. For example, the keyword 习近平 (“Xi Jinping”) alone is not censored, nor is 复辟 (“restore monarchy”), but the combination 习近平 AND 复辟 results in a blocked message. However, it is more complicated than that. For some censored keyword combinations, there are “extenuating” keyword combinations which, if also present in the message, will disable the censorship that would otherwise occur. For example, a message containing 习近平 and 复辟 will not be censored if it additionally contains the keyword 群发机 (“group messaging machine”). The authors looked for minimal combinations of censored keywords using the component-aware binary splitting (CABS) algorithm, using text sourced from Chinese-language Twitter and from keyword lists that have been discovered in past research. The existence of extenuating keywords complicates CABS: it is necessary to iterate the algorithm until it converges. Having discovered a minimal censored combination (the ANDs), the authors looked for extenuating keywords (the AND NOTs) by again running CABS, this time with inverted logic: they appended other keywords to a known-censored combination and looked for cases of no censorship. It is necessary to iterate the inverted algorithm (which the authors call CABSE) as well, because of possible interference by other censored keywords.

The tests uncovered 173 censored keyword combinations and 43 unique extenuating combinations. The set of extenuating keywords is different for each censored combination—the number of extenuations per censored combination ranges from 0 to 38. In topic, the censored combinations mostly have to do with the Xinjiang region, the Chinese Communist Party, or Xi Jinping. (And notably do not cover other topics known to be censored in China, such as gambling, pornography, and malware.) The puzzling thing, though, is that the extenuating keywords—which exempt a message from censorship—are themselves things that you might expect to be censored, including phrases like 法輪功 (“Falun Gong”) and 刘士辉 (“Liu Shihui”). The authors speculate about this counterintuitive behavior in Section 6, but find no fully satisfactory explanation. A further bit of weirdness is that QQMail does not censor incoming mail to a recipient that has previously sent mail to the sender, which is behavior more characteristic of a spam filter than a content filter.

Thanks to the authors for reviewing a draft of this summary.