ICLab: A Global, Longitudinal Internet Censorship Measurement Platform
Arian Akhavan Niaki, Shinyoung Cho, Zachary Weinberg, Nguyen Phong Hoang, Abbas Razaghpanah, Nicolas Christin, Phillipa Gill
The paper is a high-level description of the Internet censorship measurement platform ICLab and a retrospective of two years of operation. ICLab’s object of measurement is HTTP and HTTPS URLs. The system is designed to detect DNS manipulation, TCP injection, and HTTP block pages.
The ICLab platform is mainly uses commercial VPNs as vantage points. Of its 281 vantages, 264 are VPNs, and 17 are volunteer-operated Raspberry Pis. The use of VPNs as vantages offers different tradeoffs compared to other methods of censorship measurement. Some of the benefits of VPNs are that they are comparatively reliable and easy to maintain, they permit access to low-level network details in packet captures, and there are fewer ethical considerations than there are with human volunteers. The drawbacks are that you can only do measurements where commercial VPN services exist, and VPN networks may not be censored in the same way as typical residential networks. Something to watch out for is that VPN endpoints that are advertised to be in a specific country may not actually be in that country; the authors deal with this by verifying geolocation using round-trip times and excluding endpoints that may not be where they claim. ICLab’s VPN-based approach may be contrasted with the remote-measurement techniques of Censored Planet, which have good geographic coverage but are more limited in what measurements they can do; and the volunteer-operated probes of OONI, which must take extra care not to put their human operators at risk or overtax their resources.
An ICLab measurement comprises a GET request to a URL, from start to finish and including all intermediate steps: DNS resolution, TCP connection, TLS certificates, HTTP request and response, and a traceroute. Measurement results are accompanied by a full packet capture. The measurement is compared to a measurement from one or more control sites and examined for evidence of DNS manipulation, TCP injection, or block pages. DNS manipulation is detected when the same name resolves to two different ASes in two separate queries; when a name resolves to different ASes in the vantage and a control; or when a name resolves at a control but experiences an error at the vantage. TCP injection is detected when there is a sequence number overlap that contains a RST, FIN, or block page; or when a URL is accessible at a control but receives a RST or ICMP error at the vantage. Block pages are detected by first looking for complete HTTP responses that fit in one TCP segment. The HTML of a potential block page is compared against about 300 handwritten regular expressions, and clustered with known block pages by structural and textual similarity. Another block page classification technique I quite like is the URL-to-country ratio: for a given page, divide the number of URLs that may yield it by the number of countries it has been seen in; higher scores are more likely to be block pages. The idea is that block pages tend to be specific to one country, and apply to a large number of URLs. The paper reports the discovery of previously unknown block pages.
The authors give highlights of ICLab results from January 2017 to September 2018, a total of 54 million measurements of 45 thousand URLs in 62 countries. The countries that stand out as having the highest levels of URL censorship are Iran, South Korea, Saudi Arabia, India, and Kenya. A snapshot of the data supporting the paper is available at https://iclab.org/post/iclab_data/, which links to https://archive.org/details/@iclab and https://drive.google.com/drive/folders/1jjSNjIIB0EWDachhaheHePXjvLQeNHkr.
Thanks to Phillipa Gill for reviewing a draft of this summary.