TCP is UNreliable

Been to long between blogs…

TCP Is Not Reliable” – what’s THAT mean?

Means: I can cause TCP to reliably fail in under 5 mins, on at least 2 different modern Linux variants and on modern hardware, both in our datacenter (no hypervisor) and on EC2.

What does “fail” mean?  Means the client will open a socket to the server, write a bunch of stuff and close the socket – with no errors of any sort.  All standard blocking calls.  The server will get no information of any sort that a connection was attempted.  Let me repeat that: neither client nor server get ANY errors of any kind, the client gets told he opened/wrote/closed a connection, and the server gets no connection attempt, nor any data, nor any errors.  It’s exactly “as if” the client’s open/write/close was thrown in the bit-bucket.

We’d been having these rare failures under heavy load where it was looking like a dropped RPC call.  H2O has it’s own RPC mechanism, built over the RUDP layer (see all the task-tracking code in the H2ONode class).  Integrating the two layers gives a lot of savings in network traffic, most small-data remote calls (e.g. nearly all the control logic) require exactly 1 UDP packet to start the call, and 1 UDP packet with response.  For large-data calls (i.e., moving a 4Meg “chunk” of data between nodes) we use TCP – mostly for it’s flow-control & congestion-control.  Since TCP is also reliable, we bypassed the Reliability part of the RUDP.  If you look in the code, the AutoBuffer class lazily decides between UDP or TCP send styles, based on the amount of data to send.  The TCP stuff used to just open a socket, send the data & close.

So as I was saying, we’d have these rare failures under heavy load that looked like a dropped TCP connection (was hitting the same asserts as dropping a UDP packet, except we had dropped-UDP-packet recovery code in there and working forever).  Finally Kevin, our systems hacker, got a reliable setup (reliably failing?) – it was a H2O parse of a large CSV dataset into a 5-node cluster… then a 4-node cluster, then a 3-node cluster.  I kept adding asserts, and he kept shrinking the test setup, but still nothing seemed obvious – except that obviously during the parse we’d inhale a lot of data, ship it around our 3-node clusters with lots of TCP connections, and then *bang*, an assert would trip about missing some data.

Occam’s Razor dictated we look at the layers below the Java code – the JVM, the native, the OS layers – but these are typically very opaque.  The network packets, however, are easily visible with wireshark tools.  So we logged every packet.  It took another few days of hard work, but Kevin triumphantly presented me with a wireshark log bracketing the Java failure… and there it was in the log: a broken TCP connection.  We stared harder.

In all these failures the common theme is that the receiver is very heavily loaded, with many hundreds of short-lived TCP connections being opened/read/closed every second from many other machines.  The sender sends a ‘SYN’ packet, requesting a connection. The sender (optimistically) sends 1 data packet; optimistic because the receiver has yet to acknowledge the SYN packet.  The receiver, being much overloaded, is very slow.  Eventually the receiver returns a ‘SYN-ACK’ packet, acknowledging both the open and the data packet.  At this point the receiver’s JVM has not been told about the open connection; this work is all opening at the OS layer alone.  The sender, being done, sends a ‘FIN’ which it does NOT wait for acknowledgement (all data has already been acknowledged).  The receiver, being heavily overloaded, eventually times-out internally (probably waiting for the JVM to accept the open-call, and the JVM being overloaded is too slow to get around to it) – and sends a RST (reset) packet back…. wiping out the connection and the data.  The sender, however, has moved on – it already sent a FIN & closed the socket, so the RST is for a closed connection.  Net result: sender sent, but the receiver reset the connection without informing either the JVM process or the sender.

Kevin crawled the Linux kernel code, looking at places where connections get reset.  There are too many to tell which exact path we triggered, but it is *possible* (not confirmed) that Linux decided it was the subject of a DDOS attack and started closing open-but-not-accepted TCP connections.  There are knobs in Linux you can tweak here, and we did – and could make the problem go away, or be much harder to reproduce.

With the bug root-caused in the OS, we started looking our options for fixing the situation.  Asking our clients to either upgrade their kernels, or use kernel-level network tweaks was not in the cards.  We ended up implementing two fixes: (1) we moved the TCP connection parts into the existing Reliability layer built over UDP.  Basically, we have an application-level timeout and acknowledgement for TCP connections, and will retry TCP connections as needed.  With this in place, the H2O crash goes away (although if the code triggers, we log it and use app-level congestion delay logic).  And (2) we multiplex our TCP connections, so the rate of “open TCPs/sec” has dropped to 1 or 2 – and with this 2nd fix in place we never see the first issue.

At this point H2O’s RPC calls are rock-solid, even under extreme loads.


Found this decent article:

  • It’s a well known problem, in that many people trip over it, and get confused by it
  • The recommended solution is app-level protocol changes (send expected length with data, receiver sends back app-level ACK after reading all expected data). This is frequently not possible (i.e., legacy receiver).
  • Note that setting flags like SO_LINGER are not sufficient
  • There is a Linux-specific workaround (SIOCOUTQ)
  • The “Principle of Least Surprise” is violated: I, at least, am surprised when ‘write / close’ does not block on the ‘close’ until the kernel at the other end promises it can deliver the data to the app.  Probably the remote kernel would need to block the ‘close’ on this side until all the data has been moved into the user-space on that side – which might in turn be blocked by the receiver app’s slow read rate.



28 thoughts on “TCP is UNreliable

  1. TCP is not unrealiable but your client is. Look for “TCP half-duplex close sequence”. After writing your data, you need to call “socket.getInputStream().read()” before calling “socket.close()”, otherwise you will miss the “SocketException: connection reset” caused by the server timeout RST.

    • Found this decent article:
      – it’s a well known problem, in that many people trip over it, and get confused by it
      – the recommended solution is app-level protocol changes (send expected length with data, receiver sends back app-level ACK after reading all expected data). This is frequently not possible (i.e., legacy receiver).
      – Note that setting flags like SO_LINGER are not sufficient.
      – the “Principle of Least Surprise” is violated: I am surprised when ‘write / close’ does not block on the ‘close’ until the kernel at the other end promises it can deliver the data to the app, probably blocking the ‘close’ on this side until all the data has been moved into the user-space on that side – which might in turn be blocked by the receiver app’s slow read rate.

      • TCP is not the only reliable protocol available, nor it is a hammer to use for every nail, nor does it guarantee you a “fire and forget” kind of delivery. You are sending lots of separate messages, thus SCTP (Stream Control Transmission Protocol) might have been more appropriate. It allows you to deliver multiple messages in parallel, and defines a way to close the connection (“shutdown”) that guarantees all pending messages to be sent (

        • I am limited to using protocols that are widely used, widely accepted and widely understood – or else very few will both to diagnose any failures in a non-standard protocol – and hence will not put the code into production. I’m in the “red hat” business model @, I need to get the code into paying production use. Anything other than TCP or UDP will kill that. 🙁

  2. No. The kernel is not unreliable. Nor is TCP. Your code is stupid. Your second fix is just a sensible optimisation that you could have done without all the other ‘fixes’ and it would have probably fixed your issue.

    Secondly if you had bothered to check HOW TCP RST packets are implemented in Linux you would understand why you were getting what you got.

    • No offense dude, but you are way off base. My “2nd fix” IS NOT A FIX – ITS AN OPTIMIZATION – and so the code can be expected to break at any future data given a bad enough sequence of events. As for becoming a linux kernel hacker – I’m happy to let others do that, and then ask them for their expertise (exactly what this blog post is about!).

  3. Well, there is an rfc for it, about as old as I am, that states you have to do your own high level message + reply over tcp, exactly for these scenarios. Just have your server write at least one byte as response and read that on the client side. (But I am on the slowest wifi ever, ironic, and cannot find the its number, surely someone knows which one I mean.)

    Part of the problem is the APIs to drive tcp, when you say close, the kernel actually “writes” FIN, ideally, you would wait and block on “reading” FIN from the other side. That would make it reliable too.

    • That solution (write 1 byte response & block for it) is exactly what I implemented. I must say, this required handshake is NOT common knowledge, or at least nobody I work with knew about it… nor do I see it commonly implemented in anybodies code. All the code I look at (and thats a *lot*) just does what I did: open/write/close – and assumes if the close “worked”, the data got sent. i.e., the close is expected to block until the data is acknowledged.

      • Keep in mind is that TCP is a asynchronous thing, with four stops: 1 your side; 2 your kernel; 3 other kernel; 4 target. You want evidence that target has received and understood your message, but you only talk to 2. Even doing the “read() == 0” trick (as I and reddit suggested) is not enough, it is evidence that 3 has understood your message, 4 might still have crashed. And “blocking” just mean, you wait on 2 to accept your data, or to give you at least a byte of 3’s data, you don’t know how much data 4 intended to send you. If you want to block and wait on 4, you have to receive a complete high level message from 4, and block on 2 until you have it. I wish this was more common knowledge.

    • Why not? Note that I didn’t do anything special @ the jvm level; I just did an open/write/close sequence.
      (see comment below about forcing a read before close)

      • I read it this way : the receiver ack’d the first data but somehow the JVM never heard about it. Its defenately not a protocol issue. A bug in this OS tcp stack ? unlikely, but possible. A bad behavior of the JVM under load ? more probable 🙂

        • Note that this is unrelated to the JVM. We saw bad behavior at the network packet level… and then the (copious online) comments indicate it’s a known TCP protocol weakness. Or at least I claim that if the simplest and most obvious TCP usage can silently fail, then that’s a protocol failure. “Simplist”: open/write/close on one side, and open/read/close on the other … leads to silent data lossage.

  4. “The sender (optimistically) sends 1 data packet; optimistic because the receiver has yet to acknowledge the SYN packet.”

    So the sender sends a SYN followed by a data packet before receiving *anything*? How’s that supposed to work? Does the SYN packet have any options (e.g. TCP Fast Open) set?

    • No options set that *I* did, although the JVM might have tried to set options. However, this appears to be the normal behavior on the linux stack(s) we where using, we found a flag to turn it off – and throughput dropped in 1/2 – and the bug was harder to repro, but definitely was still there.

  5. Uh, TCP is unreliable in many ways. In particular an ACK does not acknowledge application receipt of the data. Your protocol needs an application-level acknowledgement of important things. This is why OpenSSH has ClientAlive and ServerAlive (using SSHv2 messages), Lustre has RPC receipt acks and even retransmits (ok, so they had to deal with infiniband first, but still), NFS also deals with retransmits over TCP.

    There’s several different things that can conspire to lose data: the network, the TCP stack, the rest of the OS (e.g., bc of the Linux OOM killer), and even the application, and even storage backending it. TCP is not magic dust that makes applications reliable; you must make your application protocols reliable all by yourself. TCP is primarily about convenience: you get an octet stream and you don’t have to worry about handling congestion control, but that’s about it.

    • Yeah, in the end I ended up with exactly that: an octet stream with congestion control.


  6. I’m fairly certain your problem can be solved by using the shutdown syscall.

    $ man 2 shutdown

    You shouldn’t just close() a socket; you should call shutdown() to half-close, then when you’ve checked for errors you can call close(). I have no idea whether those calls are wrapped in Java — if not you’ll be a bit stuck in your case; but it wouldn’t be TCPs fault.

    • Thanks. My accidental solution turns out to be the approved solution (read a byte from the sender-side that indicates an app-level handshake) – which should cover the case that shutdown() is covering.

  7. Very interesting post. Brought to light something I didn’t know.

    Never mind the other commenters; they’re a bunch of dullards leaping to conclusions, particularly since they’re post-hoc clever after having read your problem and solution.

    • yeah, been awhile since I had to moderate out a bunch of flame. Ah well, the ol’ asbestos suit still works.

  8. If everyone acknowledges that “just TCP” is insufficient, then why are there debates about using UDP vs TCP, saying that one is reliable and the other isn’t.
    In particular, is it easier creating a reliable app level protocol with TCP, or in the end do the same issues exist for both TCP and UDP?

    (or is the benefits more around things like congestion control, not reliability?)

    • The trick with “reliable” TCP is that it’s *typically* reliable, under not-crazy-load scenarios. UDP is much less reliable, although apparently it’s a matter of degree. i.e. TCP “mostly works”, and UDP “mostly does not work”. TCP does provide:
      – ordered de-dup’ed data delivery
      – congestion control
      – reliability up till you close – then the “last bit delivery” is suspect unless you take steps before the close (e.g. app-level handshake, or use ‘shutdown’ to do a half-close)
      – plus more notifications when things fail, but apparently not 100% notifications.


  9. Right. But here’s the thing I don’t understand. People say they make TCP get the final reliability behaviors they want/need at the client level, by “adding” extra stuff that they dream up.

    Now that can work, but it’s splitting responsibiltiy for the reliable channel between two people…You, and the behaviors of various OSes you want to run on. (and hw)

    So what I don’t understand, is how people can make this work, if they don’t know EXACTLY when the OS is allowed, and can, cause a TCP Reset. My theory is that people actually don’t know. and it changes (more of them) over time..

    i.e. 10 years ago, socket reliability was probably considered more, protection against denial of service attacks might be more important, as long as apps can get what they need.

    For instance: in Ubuntu 12.04 LTS, what are the N things that cause TCP Reset? If you don’t know, how do you know how to protect yourself from them? Wha about MacOS? same or different?

    I think the gentleman’s agreement, is that OS is not allowed to TCP reset unless there’s “unnatural activity” ..say heavy “load”. So that the application person, rather than coding to a protocol (solely….he has to also code to a “allowed” load demand…and the OS agrees to a better level of reliablilty as long as you’re within that load demand. (however “load” is defined)

    But even that begs the question: Do people understand what that is, or is it just that people try random things, figure out what seems to work and say “Hey you dummy, didn’t you know you were supposed to do this? and not do that?”

  10. Private Comment:


    Great to see you blogging technical stuff. I have been a big fan for years.


  11. Cliff, this does just underscore the fact that many people overlook: TCP is a solution based on certain technical realities. As with most modern low-level computing concepts, you cannot hide behind some vague abstraction of the truth; you must truly understand TCP to use it effectively. It seems to me that nothing here is really surprising given the way that TCP works – you cannot assume the network has zero latency, nor can you assume that packets will never be lost; thus, why assume that you can establish a connection, write to it, and then never have to ask the socket to know that your data will arrive? You *know* that you cannot know that the other side has received the data until you get an ACK. You *should* be shutting down the write side of your connection, and then waiting for a read result (i.e. ACK) before you assume that the other side received your data. *That* is the kind of reliability TCP gives you: if your transmission is acknowledged, then it is received; not “if you transmit it, it is received”. The better topic of this article would be “Using TCP without understanding it is a mistake” or perhaps “I am surprised that I have to understand how TCP works to use it”.

  12. Apart from what some people before me already wrote, I think they are right in that you should first close your side of socket (shutdown), so you force a FIN to be send. Then you wait for an ACK and you wait for a FIN request from the other side. After that point you can be sure that everything went wel.

    What I don’t agree with is that you should implement some kind of arbitrary protocol. I personally would recommend to use standard technologies and protocols. I had a short look at your code and I personally would recommend you to implement it as a Netty 4 Handler and to use Netty 4, which comes with a lot of stuff that you may find useful.

    Especially you get a WebSocket implementation for free:

    Which might be exactly what you want, a small standard protocol on top of TCP that adds an additional FIN frame and that even browsers understand (currently all Browers support WebSockets), plus it bypasses firewalls and routes very well, if run abote SSL. There is nothing that hinders you to run it on top of UPD either, it should work very well above UDP. Using Netty framework you can make the underlying protocol (UDP/TCP/Ethernet/…) transparent and you may use the extension data to add a check sum for UDP, implement re-transmission and so on.

    Apart from that you get ByteBuf, which relies upon Unsafe and, if possible, falls back to normal ByteBuffers when Unsafe is not available.

    my 2 cent

    • No need for websockets; the cluster members are running in a trusted environment (with an expected-good network connection), not general-over-the-net stuff. The fix I have now is fast, lightweight & reliable – even under silly heavy loads (where “fast” means: I can get the “expected good” line speeds out of the network stack)

Comments are closed.