On Precisely Detecting Censorship Circumvention in Real-World Networks (NDSS 2024)

On Precisely Detecting Censorship Circumvention in Real-World Networks
Ryan Wails, George Arnold Sullivan, Micah Sherr, Rob Jansen
https://www.robgjansen.com/publications/precisedetect-ndss2024.html

A primary problem faced by censors is the low base rate of circumvention connections. Traffic classifiers must have a very low rate of false positives, or else they will end up blocking more non-circumvention traffic than circumvention traffic. This paper takes a critical look at past research on detecting circumvention systems and argues that the techniques that have been proposed have too many false positives to be practical at the base rates seen in real networks. The authors create hand-tuned and deep learning–based classifiers that improve on the state of the art, but say that even these would have unrealistically many false positives in practice. To mitigate poor per-flow precision, they suggest host-based analysis: aggregating many flows to the same endpoint over time. They apply their classifiers to the two transports currently available in Tor Browser, obfs4 and Snowflake; in the case of Snowflake, they consider both the rendezvous and data transfer phases. Empirical measurements of classifier performance use two weeks of traffic from a university network, about 60 million flows.

The paper introduces a λ notation to quantify base rates. λ is how many non-circumventing flows there are for every 1 circumventing flow. For example, λ = 1 means an equal mix, and λ = 100 means approximately 1% of flows are circumventing. The 2015 paper “Seeing through Network-Protocol Obfuscation”—which deserves credit for considering false positive rates when earlier work, like “The Parrot is Dead”, had not; and for using real traffic captures—used a λ = 1 traffic mix in evaluation. This paper claims that the results in “Seeing through…” are not as good as they seem: Table I shows how a precision that looks good at λ = 1 dwindles to uselessness at λ = 1000 (which is likely still an overestimate of the actual rate of occurrence of obfs4 in the wild). The authors prepare a new, hand-tuned classifier for obfs4 using expert knowledge—it scales to base rates of λ = 1000 but not further, and moreover it is brittle: a small tweak to the protocol (along the lines of GFW Report’s entropy-reduction patch for Shadowsocks) makes the hand-tuned classifier completely ineffective.

They next try a classifier based on deep learning. Deep learning has been used for website fingerprinting and encrypted traffic analysis, but has not yet been applied to circumvention detection. Their best-performing classifier, using a convolutional neural network on packet size and direction features, outperforms the “Seeing through…” classifiers on obfs4 and generalizes better than the hand-tuned classifier. They test it against Snowflake as well: the Snowflake data channel is easier to detect than obfs4, but the rendezvous exchange is harder. But the precision of the deep learning classifier, too, drops to zero for all protocols at about λ = 10000. (The performance might be improved if packet payload or endpoint addresses were taken into account.)

Finally, the authors suggest a way to achieve good classification performance, even with flow classifiers that don’t scale to high base rates. The idea is to shift from classifying flows to classifying hosts. Instead of making an allow/block decision for each flow, keep a small amount of state for each remote host, aggregating observations over multiple flows. They show a simple way to convert a flow-based classifier into a host-based one: classify every flow as before, and when the flow is classified positive, increase a counter for the destination host. Label a host as a circumvention server when (1) it has received at least a certain number of flows, and (2) at least a certain percentage of those flows were positive. The number of flows needed is logarithmic in the error rate parameter: you only need to see a few dozens or hundreds to bring the host-based false positive rate almost to zero. This threshold scheme is just one way of doing host-based analysis. Other techniques may do even better, and this is a good topic for future circumvention research and development. On the defense side, host-based analysis is mitigated in circumvention systems that do not rely on proxies at stable addresses, or that have a polymorphic protocol fingerprint.

Great read, thanks!