CurveCP: Usable security for the Internet

Introduction

Main features:

Confidentiality

Integrity

Availability

Efficiency

Decongestion

Addressing

Protocol details:

Integration:

Programming:

Decongestion

The job of an Internet router is to forward packets from incoming network links (wires, radios, etc.) to outgoing network links. The router first stores each packet in an internal queue in case the outgoing link is busy. Congestion means that packets are continuing to arrive at the router more quickly than the outgoing link can handle them. The queue length then increases; if this persists then eventually the queue fills up and the router is forced to discard packets.

This web page discusses the congestion-control and congestion-avoidance mechanisms in TCP and in CurveCP. These mechanisms are handled by packet schedulers that decide when to transmit packets and when to retransmit packets. There are actually several different TCP packet schedulers in common use, including CUBIC (Linux), NewReno (FreeBSD), and Compound (Microsoft). CurveCP uses a new scheduler called Chicago that decongests routers, including routers suffering from "bufferbloat"; Chicago efficiently uses the available bandwidth while minimizing packet loss and reducing latency for interactive applications.

Congestion-induced inefficiency

Congestion might at first seem to be a minor issue. Data is not permanently lost when packets are lost: clients and servers send packets again and again until the packets are acknowledged. The outgoing link will continue transmitting data at full speed—obviously the best it can do—and eventually will transmit the lost data.

The primary problem is that each lost packet wastes time on the incoming link. A packet sent 10 times through this link, because it was lost the first 9 times, consumes 10 times as much space as it would otherwise have consumed, effectively reducing the bandwidth of the incoming link by a factor of 10—a huge efficiency problem. Sometimes this reduction means that the incoming link is overloaded, congesting the previous router and causing even more packets to be lost.

A secondary problem is that increased queue lengths cause increased delays for packets in the queue. This is not a bandwidth problem but it is a latency problem. Users waiting for data (web pages, streaming video, etc.) frequently encounter long delays attributable directly to queue congestion, often several seconds or more.

These two problems naturally classify schedulers into three categories:

Primary evaluation (packet loss) Secondary evaluation (latency) Examples Summary

Bad Bad TCP schedulers on the Internet until the late 1980s Senders send packets as quickly as they can. When senders notice packet loss, they retransmit packets at high speed, causing further congestion and further packet loss.

Good Bad TCP schedulers on the Internet today: e.g., CUBIC (Linux), NewReno (FreeBSD), Compound (Microsoft) Senders increase packet-sending rates until they notice packet loss as a sign of congestion. Senders then reduce packet-sending rates to keep loss rates under control. To detect increases in available network capacity (e.g., someone else's download has finished), senders continue trying higher rates, keeping queues congested and periodically causing packet loss.

Good Good TCP Vegas, LEDBAT (µTP in BitTorrent), Chicago (CurveCP) Senders increase packet-sending rates until they notice packet loss or increased delays as a sign of congestion. Senders then adapt packet-sending rates to keep loss and delays under control. Queue congestion is minimized.

There are many more TCP schedulers. Most of these schedulers are in the Good+Bad line. Good+Good is obviously more desirable; the Internet's continued use of Good+Bad is discussed below.

Unfairness

Further problems appear when two or more flows (active connections) are competing for bandwidth on the same link. Users expect each flow to promptly set a fair rate: half of the link bandwidth when there are two flows, or 1/N of the link bandwidth when there are N flows. This requires communication between the flows.

The Internet does not, in general, provide explicit communication between flows. Two flows instead communicate implicitly. Each flow causes delays (or, more clumsily, packet losses). Presumably this signal is visible to both flows, or at least has an equal chance of being seen by each flow. Each flow separately adjusts its rate in response to this signal. If this adjustment has the effect of bringing the rates closer together, and other adjustments do not have the effect of bringing the rates farther apart, then eventually the rates will converge.

For example, many TCP schedulers use an "AIMD" adjustment mechanism that works as follows:

Goal: The difference |R1-R2| will rapidly decrease towards 0. Here R1 and R2 are the two flow rates.
"Multiplicative decrease": Each flow reduces its rate in half upon seeing a congestion signal. This chops |R1-R2| in half.
"Additive increase": Each flow periodically increases its rate by a constant. This does not affect |R1-R2|.

This is fair if the flows have the same idea of what "periodically" means. However, for most TCP schedulers, "periodically" is defined by the round-trip time (RTT), producing RTT unfairness: a flow with a faster RTT will use more of the link than a flow with a slower RTT. Some TCP schedulers schedule "periodically" on an RTT-independent scale, such as once per second, to avoid RTT unfairness; Chicago also does this.

Here is a much worse example, the late-comer's advantage. TCP Vegas measures the minimum RTT that it sees, and adjusts its rate so that the RTT is somewhat larger, say min+delta. Here delta is not very large (large queue delays indicate congestion), but it is also not very small (empty queues indicate an idle network). Once TCP Vegas has found a rate that keeps the RTT stably at min+delta, it does not adjust the rate further; the RTT stays at min+delta. Now suppose that one Vegas flow starts using an empty link, and then later a second Vegas flow starts using the same link. The first Vegas flow sees the RTT of the empty link as the minimum RTT, and increases its rate so that the RTT is min+delta. The second Vegas flow then arrives, sees min+delta as the minimum RTT, and quickly pushes the RTT up to min+2delta. The first Vegas flow interprets the increased RTT as a sign of competition for the network, and reduces its rate so that the RTT drops below min+2delta; the second Vegas flow then increases its rate, pushing the RTT back up to min+2delta. This continues until the rate of the first Vegas flow has converged to essentially zero. The second Vegas flow ends up monopolizing the link.

Widely deployed Good+Bad schedulers such as NewReno and CUBIC avoid this type of problem by implicitly creating a synchronized congestion cycle. Each cycle begins with a moment of maximum congestion (i.e., maximum queue length), decreases down to a lower level of congestion, and then increases back up to maximum congestion, ending the cycle and starting the next cycle. This congestion cycle plays a critical, and underappreciated, role as a flow-communication mechanism: each flow recognizes maximum congestion at the same moment (through packet loss) and, at that moment, decreases rate multiplicatively—enough to bring the cumulative rate below the capacity of the bottleneck link, prompting the decrease in congestion. Each flow increases rate at all other times, eventually prompting the increase in congestion.

Like NewReno and CUBIC, but unlike Vegas, Chicago uses frequent additive increases and occasional multiplicative decreases to create a synchronized congestion cycle. Unlike NewReno and CUBIC, Chicago keeps track of long-term delay statistics, and explicitly recognizes cycles that merely vary in delay, rather than requiring each cycle to end with maximum congestion and packet loss. Chicago gradually pushes the delays down, creating short cycles where the top and bottom of the cycle are at very low levels of congestion, drastically reducing latency and eliminating typical congestion-induced packet loss.

Unfriendliness

Even more problems appear when two flows using different schedulers are competing for bandwidth on the same link.

Here is a bad example, extreme unfriendliness of one TCP scheduler towards another TCP scheduler. Suppose one flow uses a widely deployed Good+Bad scheduler such as TCP CUBIC, while the other flow uses TCP Vegas. The CUBIC flow will increase its rate until it causes packet loss, filling queues and creating delays. The Vegas flow will respond to the delays by reducing its rate, while the CUBIC flow is blind to the delays. Vegas obtains bandwidth for brief moments after the CUBIC rate decreases multiplicatively, but in general uses only a very small fraction of the link. To summarize, CUBIC is extremely unfriendly to Vegas. Other Good+Bad TCP schedulers, such as NewReno, are also extremely unfriendly to Vegas. Experiments with Vegas have almost uniformly found Vegas running at extremely low speeds whenever there is even a single Good+Bad competitor. (The only exceptions are simulations of old routers using very short queues.) This is obviously dissatisfying to users who try Vegas, and appears to be a major reason for the failure of Vegas to be deployed on the Internet.

(CUBIC, NewReno, etc. are extremely unfriendly to BitTorrent's µTP LEDBAT for similar reasons. This is advertised as a feature of BitTorrent: if a web browser begins downloading a large web page then BitTorrent will very quickly stop using the network. However, for exactly the same reason, users who try LEDBAT for web pages will find their web browsers waiting practically forever if any other connection is using CUBIC.)

The conventional wisdom is that users will be unhappy with a new scheduler if NewReno/CUBIC/etc. are extremely unfriendly to that scheduler or vice versa. On the other hand, small imbalances in network usage seem much less important to users. CUBIC is somewhat unfriendly towards NewReno, for example, and doesn't provide RTT fairness, but these problems seem to have generated very few complaints; each connection continues to receive a tolerable share of the bandwidth, even if not a fair share.

Chicago uses edge-triggered backoffs so that there are only a constant number of backoffs in a typical cycle. When a Good+Bad flow runs alongside a Chicago flow, the Good+Bad flow will not notice the delays it is creating, and will not notice Chicago backing off as a result of those delays; but Chicago will then see that the cycle is continuing, and will continue increasing its rate until the actual end of the cycle. Chicago, unlike Vegas, thus receives a tolerable share of the bandwidth.

Hammering

The original TCP schedulers would begin a flow by sending every possible packet within the receiver's advertised window. This spike of traffic was often far more than a link could handle in one RTT; routers would leap from zero congestion to heavy congestion.

Modern TCP schedulers instead limit their initial transmissions (and new transmissions after some idle time) by the following algorithm. The sender transmits a single packet; then, after an RTT, two packets; then, after another RTT, four packets; then, after another RTT, eight packets; and so on. This pattern continues until a packet is lost; the scheduler then begins AIMD.

This algorithm is called slow start and is widely advertised as a gentle, safe way to discover the available bandwidth. However, slow start is actually quite dangerous: it can reach extremely high rates, far beyond the link capacity, placing huge spikes of traffic into router queues. If the sender continues transmitting data then the heavy congestion created by slow start will eventually produce packet loss, but if the sender stops before this then slow start will hammer the router queue almost as badly as the original TCP schedulers.

Chicago watches delays so that it can see when rate-doubling is beginning to create congestion.

False congestion alarms

Packets sent through wireless networks are often destroyed by radio interference. Most TCP schedulers incorrectly treat these packet losses as signs of congestion and back off multiplicatively. These schedulers make poor use of the available bandwidth: they cannot transmit more than c/sqrt(p) packets per RTT, where c is a constant depending on the scheduler and p is the packet-loss probability.

Chicago keeps track of long-term loss statistics the same way that it keeps track of long-term delay statistics. The scheduler does not confuse persistent loss with congestion-induced loss. This allows Chicago to use lossy wireless networks with reasonable efficiency.

Old TCP schedulers also misunderstood naturally occurring large delays, typically from slow modems or from delayed acknowledgments, as timeouts. Newer TCP schedulers typically use Jacobson's algorithm to set a timeout that takes account of variance in delays, along with either timestamps or Karn's algorithm to prevent miscomputation of delays for retransmitted packets. Chicago uses similar algorithms, with explicit acknowledgment of message IDs to prevent miscomputation of delays.

Version

This is version 2011.02.12 of the decongestion.html web page.