Wordfence Issues with Rackspace and a few other hosts now resolved
This has been a thorn in my side for some time now and I’m glad to report it’s 100% resolved. We had several reports from some of our customers using Rackspace and a few other hosts that Wordfence would intermittently not be able to connect to our scanning server.
I tried working with Rackspace a while ago to resolve this and, while Rackspace were very helpful and gave us a server in their cluster to play with, we didn’t have any luck finding the root cause of the issue.
We finally went very deep on analyzing this today and found and resolved the issue. The rest of this blog entry is technical, so if you’re not a network geek, you can stop reading know and know that Wordfence runs perfectly on Rackspace and many other hosts now and no longer has connectivity issues to our scanning server.
When any TCP connection on the net is established, for example a connection to a web server, there is a three way handshake. In each part of the exchange certain flags are set in the packet. It goes like this:
From client to server: SYN
From server to client: SYN-ACK
From client to server: ACK
Once this handshake has taken place, data transfer can occur. On analyzing traffic from one of the hosting providers we were having connectivity issues with, we noticed unacknowledged SYN packets coming in.
After researching possible causes in our kernel configuration we discovered that our scanning server had “tcp_tw_recyle” enabled. This causes problems when you have many different servers behind a single public IP address that is doing network address translation. Rackspace and a few other hosts use this configuration rather than have their WordPress servers connect directly to the Internet using their public IP addresses (as in the case of Linode for example).
For more info on this, there’s here’s a quote from StackOverflow that describes the problem succinctly.
When you enable tcp_tw_recycle, the kernel becomes much more aggressive, and will make assumptions on the timestamps used by remote hosts. It will track the last timestamp used by each remote host having a connection in TIME_WAIT state), and allow to re-use a socket if the timestamp has correctly increased. However, if the timestamp used by the host changes (i.e. warps back in time), the SYN packet will be silently dropped, and the connection won’t establish (you will see an error similar to “connect timeout”). If you want to dive into kernel code, the definition of tcp_timewait_state_process might be a good starting point.
Now, timestamps should never go back in time; unless:
the host is rebooted (but then, by the time it comes back up, TIME_WAIT socket will probably have expired, so it will be a non issue);
the IP address is quickly reused by something else (TIME_WAIT connections will stay a bit, but other connections will probably be struck by TCP RST and that will free up some space);
network address translation (or a smarty-pants firewall) is involved in the middle of the connection.
In the latter case, you can have multiple hosts behind the same IP address, and therefore, different sequences of timestamps (or, said timestamps are randomized at each connection by the firewall). In that case, some hosts will be randomly unable to connect, because they are mapped to a port for which the TIME_WAIT bucket of the server has a newer timestamp. That’s why the docs tell you that “NAT devices or load balancers may start drop frames because of the setting”.Some people recommend to leave tcp_tw_recycle alone, but enable tcp_tw_reuse and lower tcp_timewait_len. I concur :-)
So in summary, if you have a busy server that is expecting lots of connections from multiple hosts behind a NAT router, you definitely want to set tcp_tw_reuse and tcp_tw_recycle to 0 in your kernel configuration. You can do this by editing /etc/sysctl.conf and adding:
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse = 0
Comments
3:17 am
Only net.ipv4.tcp_tw_recycle needs to be disabled, tw_reuse works with NAT:ed clients.