How China Detects and Blocks Shadowsocks

gfw.report · December 29, 2019, 3:53pm

How China Detects and Blocks Shadowsocks

Authors: Anonymous, Anonymous, Anonymous, David Fifield, Amir Houmansadr

Date: Sunday, December 29, 2019

This report first appeared on GFW Report. We also maintain an up-to-date copy of the report on both net4people and ntc.party.

Shadowsocks is one of the most popular circumvention tools in China. Since May 2019, there have been numerous anecdotal reports of the blocking of Shadowsocks from Chinese users. This report contains preliminary results of research into how the Great Firewall of China (GFW) detects and blocks Shadowsocks and its variants. Using measurement experiments, we find that the GFW passively monitors the network for suspicious connections that may be Shadowsocks, then actively probes the corresponding servers to test whether its guess is correct. The blocking of Shadowsocks is likely controlled by human factors that increase the severity of blocking during politically sensitive times. We suggest a workaround—changing the sizes of network packets during the Shadowsocks handshake—that (for now) effectively mitigates active probing of Shadowsocks servers. We will continue collaborating with developers to make Shadowsocks and related tools more resistant to blocking.

Main Findings

The Great Firewall (GFW) has started to identify Shadowsocks servers using active probing. The GFW combines passive and active detection: first it monitors the network for connections that may be Shadowsocks, then sends its own probes to the server (as if it were another user) to confirm its guess. The GFW is known to use active probing against various circumvention tools, and now Shadowsocks is a member of that group as well.
The active probing system sends a variety of probe types. Some are based on replay of previously recorded, genuine Shadowsocks connections, while others bear no apparent relation to previous connections.
Just as in previous research, active probes come from diverse source IP addresses in China, making them hard to filter out. Also as in previous research, network side-channel evidence suggests that these thousands of apparent probers are not independent but are centrally controlled.
Only a small number of genuine client connections (more than 13) suffice to trigger active probing against a Shadowsocks server. The server will continue to be probed as long as legitimate clients attempt to connect to it. The first replay probes usually arrive within seconds of a genuine client connection.
Once active probing has identified a Shadowsocks server, the GFW may block it by dropping future packets sent by the server—either from a specific port or from all ports on the server’s IP address. Or a server may not be immediately blocked, despite being probed. The degree of blocking of Shadowsocks servers is likely controlled by some human factors during politically sensitive periods of time.
The firewall’s initial passive monitoring for suspicious connections is at least partially based on network packet sizes. Modifying packet sizes, for example by installing brdgrd on the Shadowsocks server, significantly mitigates active probing by disrupting the first step of classification.

How Do We Know This?

We set up our own Shadowsocks servers and connected to them from inside China, while capturing traffic on both sides for analysis. All experiments were conducted between July 5, 2019 and November 11, 2019. Most of the experiments were conducted since the reported large-scale blocking of Shadowsocks starting September 16, 2019.

In most of the experiments, we used shadowsocks-libev v3.3.1 as both client and server, since it is an actively maintained and representative Shadowsocks implementation. We believe the vulnerabilities we discovered applies to many Shadowsocks implementations and its variants, including OutlineVPN.

Unless explicitly specified, all clients and servers were used without any modification to their network functions, for example firewall rules. Shadowsocks can be configured with different encryption settings. We tested servers running both Stream ciphers and AEAD ciphers.

Details About Active Probes

Shadowsocks is an encrypted protocol, designed not to have any static patterns in packet contents. It has two main operating modes, both keyed by a master password: Stream (deprecated) and AEAD (recommended). Both modes are meant to require the client to know the master password before using the server; however in Stream mode the client is only weakly authenticated. Both modes are susceptible to replay of previously seen authenticated packets, unless separate measures to prevent replay are taken.

Probe payload types and censors’ intentions

We have observed 5 types of active probes:

Replay based:

Identical replay (of the first data-carrying packets in a legitimate connection);
Replay with byte 0 changed;
Replay with bytes 0–7 and 62–63 changed;

Seemingly random (not a replay of any genuine connection that we can identify):

Probes of length 7–50 bytes, accounting for around 70% of the random probes;
Probes of length exactly 221 bytes, accounting for around 30% of the random probes.

We suspect that the active probing system identifies Shadowsocks servers and its variants by comparing a server’s responses to several of these probes.

Shadowsocks-libev has a replay filter; however most other Shadowsocks implementations do not. The replay filter blocks only exact replay, not replay that has been modified, and is not by itself enough to prevent active probing from comparing the responses to several slightly different probes.

How many connections are required to trigger active probing?

It appears that a certain threshold of genuine simultaneous connections are required to trigger active probing. For example, in one experiment, as few as 13 connections were enough to trigger the active probing. Initial result also shows it may require a slightly more connections for the Shadowsocks servers using AEAD ciphers to get probed.

Relationship between genuine connections and active probings

We let a client make 16 connections to a Shadowsocks server every 5 minutes. Although our connections triggered a large number of active probes, the Shadowsocks server was never blocked, for reasons we do not fully understand.

The figure above shows that while legitimate clients attempt to connect to the server, it receives active probes; and when they stop trying to connect, the active probing mostly stops. The number of active probes sent per legitimate connection is variable and not 1:1.

Delay of replay attacks

The active probing system may save a genuine connection payload and replay it later, even in response to a separate, future connection. The figure below shows the variability of the delay between legitimate connections and the ensuing replay-based probes. Because one legitimate connection may cause many (up to 47 in one case) replay attacks, we present two different cases: the orange line is samples only the first replay-based probe for a particular legitimate connection; the blue line is samples all replay-based probes.

The result shows that more than 90% of the replayed probes were sent within an hour of the connection from the legitimate client. The minimum observed delay was 0.4 seconds, while the maximum was around 400 hours.

Origin of the probes

Throughout all the experiments we conducted so far, we have seen 35,477 active probes sent from 10,547 unique IP addresses which all belong to China.

Origin ASes. The two autonomous systems that account for most of the Shadowsocks probes, AS 4837 (CHINA169-BACKBONE CNCGROUP China169 Backbone,CN) and AS 4134 (CHINANET-BACKBONE No.31,Jin-rong Street, CN), are the same as have been documented in previous work.

Centralized Structures. Despite coming from thousands of unique IP addresses, it appears that all active probing behavior is centrally managed by only a small number of processes. The evidence for this observation comes from network side channels. The figure below shows the TCP timestamp value that is attached to the SYN segment of each probe. The TCP timestamp is a 32-bit counter that increases at a fixed rate. It is not an absolute timestamp, but is relative to however the TCP implementation was initialized when the operating system last booted. The figure shows that what at first seem to be thousands of independent probers actually share only a small number of linear TCP timestamp sequences. In this case there are at least nine different physical systems or processes, with one of the nine accounting for the great majority of probes. We say “at least” nine process because we can probably not distinguish two or more independent processes sharing a very close interception value. The slopes of the sequences represent a timestamp increment frequency of 250 Hz.

How Can We Circumvent the Blocking?

Detection of Shadowsocks proceeds in two steps:

Passive identification of suspected Shadowsocks connections.
Actively probing the server of suspected connections.

Therefore, to avoid blocking, you can (1) evade the passive detector, or (2) respond to active probes in a way that does not result in blocking. We will show how to do (1) by installing software that alters the sizes of packets.

Brdgrd is software that you can run on a Shadowsocks server that causes the client to break its Shadowsocks handshake into smaller packets. It was originally intended to disrupt the detection of Tor relays by forcing the GFW to do complicated TCP reassembly, but here we take advantage of brdgrd’s shaping of packet sizes from client to server. It seems that the GFW at least partially relies on packet sizes to passively detect Shadowsocks connections. Modifying packet sizes can significantly mitigate active probing by disrupting the first step in classification.

The figure shows a Shadowsocks server undergoing active probing, and then the probing going to zero within several hours of brdgrd being activated. As soon as we disabled brdgrd, active probing resumed. The second time we enabled brdgrd, the probes completely stopped for around 40 hours, but then a few more probes came.

Another experiment shows that brdgrd may be even more effective if used from the very beginning, before the server has been probed for the first time.

Brdgrd works by rewriting the server’s TCP window size to a rarely small value. Therefore it is likely possible to detect that brdgrd is being used. So while brdgrd can effectively reduce active probing for the time being, it cannot be regarded as a permanent solution to Shadowsocks blocking.

Unresolved Questions

While the fact that active probing happens is clear, it is still unclear to us how active probing affects the blocking of Shadowsocks servers. That is, we have 33 Shadowsocks servers located all over the world. While most of them experienced heavy active probing, only 3 of them were ever blocked. More interestingly, one of the servers that was blocked was used for only a very short period of time, and thus had not received as many probes as some other servers that did not get blocked.

We came up with three hypotheses, attempting to explain this interesting phenomenon:

The blocking of Shadowsocks servers is likely controlled by some human factors. That is, the GFW may maintain a list of highly suspected Shadowsocks servers and it depends on human factors whether known servers are blocked (or unblocked). This hypothesis would also partly explains why more blockings have been reported during politically sensitive periods of time.
Another hypothesis is that active probing was ineffective against the particular Shadowsocks implementations that we used for most of the experiments. Indeed, all three servers that got blocked were running a different implementation than others. This can be true if the GFW has been exploiting some unique server reactions that are characteristics of only a certain set of Shadowsocks implementations.
The third hypothesis is there exists some geolocation inconsistency in censorship. All three servers that got blocked were running in a datacenter different from others, and were connected from a different residential network. This can be true if the GFW pays special attention to address ranges belonging to certain known datacenters, and/or pays special attention to connections from residential networks.

Thanks

We want to thank these people for research and helpful discussion on this topic:

shadowsocks-libev developers
Outline developers
Eric Wustrow and many other researchers from CU Boulder

Contacts

We encourage you to share your questions, comments or evidence on our findings and hypotheses publicly or privately. Our private contact information can be found at the footer of GFW Report.

gfw.report · January 4, 2020, 4:17am

To be more specific:

Vinicius Fortuna and the Outline VPN developers at Jigsaw

fortuna · August 7, 2020, 11:07pm

For Outline, we’ve addressed the packet size issue without using brdgrd.

On the server-side, we now merge the salt and the initial data in one packet. This makes the size of the first server packet variable: https://github.com/Jigsaw-Code/outline-ss-server/pull/69. This change has been in production for about a month.

On the client-side, we merge the salt, SOCKS address and the initial data. This also makes the size of the first client packet variable: https://github.com/Jigsaw-Code/outline-ss-server/pull/73
The client change is available in version 1.4.0. The Android and iOS releases are still under way.

I did not measure, but I’d love to know if the new changes would prevent the active probing as seen by @gfw.report.

Also, since this report we’ve added replay protection to the Outline Server and it has been in production for a few months. It has short memory, but it should help with sets of probes that are sent within a few hours.