Predeployment Network Testing: Troubleshooting Firewall Session Table Problems

Kristi Thiele
November 9, 2011

One of my favorite cartoons as a kid was Mighty Mouse. If you just started singing the theme song—“Here I come to save the day!”—then you’ll relate to how I felt recently when we were called in to save the day at a data center. A recently deployed next-generation firewall/intrusion prevention system (IPS) had crashed, and it was about to be sent back to the manufacturer if someone didn’t come rescue it.

My colleague already covered the back-story in “3 Hours Versus 3 Weeks: How Predeployment Network Testing Saves Time and Money.” In this post, we’ll take a technical approach to see how this next-generation security device got in trouble in the first place. More important, you’ll learn how to “save the day” next time your infrastructure is in trouble.

Maintenance on a Holiday Weekend

Anyone who’s maintained a network or data center infrastructure knows what it’s like to have maintenance scheduled over a holiday weekend. (Basically, you don’t get a holiday weekend.) In this case, all of the change control procedures had been followed, including doing a trial deployment of the firewall/IPS in the disaster recovery site the weekend prior.

Now it’s after 2 a.m. on a Sunday morning, and this newly purchased next-generation device has been deployed into the production network. The designated users participating in the deployment have verified connectivity from their perspective and reported that everything is working within their applications. Success is declared, and everyone packs up and goes home to enjoy the rest of the weekend.

But just like in a Mighty Mouse episode, there is always that unexpected danger lurking. As overnight scheduled jobs began to run and users working on the holiday fired up their Web browsers, the danger grew. Finally, the breaking point was reached: browser requests and scheduled jobs started timing out and failing. It quickly became obvious that something wasn’t working. The IT staff followed procedures by pulling the newly deployed equipment out of service and rolling back to the old gear. Users were able to resume working, and the IT staff re-started batch jobs manually.

Isolating what went wrong

As a team of engineers from the system integrator and all the vendors assembled with the customer, they quickly began to question the amount of testing that had been performed. A few Web users and a few batch jobs that Sunday morning should not have created a problem, because the firewall/IPS was rated a throughput much higher than that. The engineers were able to quickly rule out routing as the issue . . . but what was the root cause? The only way to find out would be to create some type of simulation of the whole scenario. They worked on this for weeks to no avail—and then they contacted us.

Simulating Real World Conditions

Not knowing exactly what had happened, but drawing on my experience testing IPS devices and firewalls within enterprise networks, I know that these types of devices act very predictably when their session tables get full: no more traffic is allowed. That sounded like what they had experienced at this site.

Some devices have set limits based on the model and you can’t modify this setting in software. The limits are sometimes over 1,000,000 sessions, which would seem to be much greater than what you need for a typical enterprise use. Others, which may be deployed as a software appliance or as standard hardware, have a configuration option for the session limit, often set at 25,000 out of the box for the software version. This means that you could increase the limit without changing hardware, but you may not know when you have increased the software setting beyond what the underlying hardware can handle.

The first test I performed (you can see the whole timeline here) was aimed at filling the session table of the firewall. I simply created a test that used the BreakingPoint Session Sender test component to open valid HTTP sessions at a constant rate and leave them open. Once the session table was full, no more sessions were successful. Below is a capture of that behavior:

tcp connections firewall testing

At this point everyone (and I mean the 10 engineers assembled to see if the problem could be resolved) began to discuss traffic types active on that fateful morning and how many sessions each user would have. It took less than 3 minutes to fill the session table and all new requests failed. Was that even possible?

Application Sessions: How Many Is “Normal”

There is a simple way to answer that question: just look at the network statistic (netstat) data after going to a typical Web page and see how many sessions it needs.

tcp connections established firewall testing

As you can see, there are more than 20 “established” connections to the 74.125.x.x web servers. When a browser makes a request to a URL, often the content is on multiple web servers so it is possible for each browser session to have more than 20 connections open. But is there something else going on too?

As we looked more at the statistics on the firewall, it was curious that minutes after the tests completed, the session table was still showing full. TCP reset and timeouts should have been enforced and therefore should have expired the closed TCP sessions. I examined this issue with my next test, where I wanted to demonstrate how short-lived sessions that were open and closed would expire faster than the sessions that were held open in the first test.

I changed the Session Sender test component to be an Application Simulator component and created a load profile that would start with 100 sessions per second and add an additional 100 sessions per second every 30 seconds. Once again the session table was filled and no more traffic was allowed, which was expected, but the sessions were still not expiring and leaving the table in a reasonable time. This allowed the customer (and actually the vendor who was also present) to discover a setting on the firewall that indicated sessions could be held in the session table for up to 3600 seconds.

Digging a bit deeper we executed more tests to determine if the type of traffic made a difference. A test was run with only HTTP of various sizes of random payloads. Another was executed that had a combination of HTTP and HTTPS. And finally a test was executed with a 50-50 mix of active and passive FTP. The results were all very similar; therefore we could conclude that the type of traffic as an issue. All issues pointed to two settings on the firewall – the number of concurrent sessions allowed and the session expiration timeout.

Making Changes – Test and Verify

We then focused the rest of our time on the firewall settings. The setting was modified to allow the session table to get to 100,000 before it stopped allowing traffic, and the session expiration was decreased to allow for a faster cleanup of the closed sessions. All the tests I discussed above were run again and we showed you could have more sessions active before the session table was full. Additional changes were made based on our testing, including making sure the CPUs from the server hardware were getting used in a balanced fashion to ensure that one would not get overloaded and cause other issues.

Below is the original result for concurrent connections:

concurrent sessions firewall testing

After making the changes, here is the same graph:

firewall testing

This improvement was proof enough that the system could go back into production a few days later.

Moral of the Story

Yes, just as with Mighty Mouse, the day was saved. By spending just a few hours doing some testing and analysis, we were able to uncover configuration issues that meant the difference between a successful deployment and a disaster in the eyes of management. But just think about the value of doing the same testing BEFORE going into production. Proactive testing of devices before roll-out of anything new (software or hardware) will pinpoint issues before they occur.


Related Content:

3 Hours Versus 3 Weeks: How Predeployment Network Testing Saves Time and Money

4 Product Bakeoff Pitfalls and How to Avoid Getting Played

A Six-Step Plan for Competitive Device Evaluations

blog comments powered by Disqus