This weekend was rather eventful, and we learned a valuable lesson about viruses, security software, and professional scepticism in IT environments. I've briefly documented it below so you can learn from our mistakes.
Last week Wednesday a virus was detected on a client's network. The anti-virus (AV) host intrusion prevention system (HIPS) was updated to block access to the URLs the virus was using to fetch its payload and other control instruction.. However, the domain lookups[1] to these URLs increased massively by Friday, so much so, they caused the internal firewalls to fail due to the load from trying to inspect this traffic. Domain lookups were then blocked at the firewall, but the source of the lookups persisted. However, network access was restored and outwardly there was nothing wrong.
We now needed a sample of the virus. Without it we could not get a signature from our AV vendor and get it removed from the whole environment. We also needed to know what it did so we could advise on appropriate remediation (e.g. changing facebook passwords). We tried all sorts of things. We isolated one machine and tried to work out what would make the lookups stop (our test for the existence of the virus). We replaced the local domain cache with static links (in hosts and lmhosts files) to try and make the lookups not traverse the network (and hence not cause outages in future), while we tried looking for the virus. We ran multiple tools and scoured 'infected' hard drives looking for the virus. We even dumped running programs from RAM fearing the virus had injected itself into a legitimate windows process. We killed processes until bluescreen or reboot, everything we could think of to try and find a failure criteria that would identify a sample. Meanwhile, our AV vendor's highest level escalation analysts were connecting in to our machines, we were uploading hundreds of potentially infected files to them, and a hard drive had been priority couriered to them. This continued 24 hours a day during the weekend with IT personnel working 24hr shifts until breaking point.
Eventually, on Sunday it was discovered that there was no virus (or not as widespread as we believed) and that our AV's HIPS had in fact been causing the domain lookup flood. By assigning the HIPS to block traffic to the viruses' domain, the HIPS would perform a domain lookup to find where the domain current resides and block that. However, instead of caching the lookup's result, our AV would regularly re-lookup the domain. This is likely because of 'fast flux'[2] malware domains which bounce their actual location all over the internet. The unintended consequence however, was a DNS flood. Even then, DNS should not be large enough to cause significant network latency, and it was this in conjunction with the firewalls attempting to inspect and apply policy to this traffic that caused the failure. What's more, is that the domain lookups should have gone to local DNS resolvers and not the internal top-level DNS (which is behind the firewalls), but were bypassing their local cache due to advised against misconfigurations.
In summary, our own security software caused an unintended consequence, that behaved much like a highly complex virus. In future, I would recommend that at least two solid points of evidence for the existence of a virus be discovered before a sample is attempted to be recovered (usually an infection vector and behaviour, if you see behaviour after infection but not before, then you have something). While, on the surface, I feel quite stupid, in reality not even the top level analysts at our AV spotted the behaviour of their own software. The truth is that IT environments are sufficiently complex, that these sorts of mistakes are more likely than not. As security people, our job is to know this, and make sure we can guide people to get enough evidence to know what is happening before they waste an entire weekend. This is why I want all of you to learn from our mistake :)
[1] The internet works on IP addresses, these are four sets of numbers. However, they are not very human readable, so the domain name system (DNS) was created to allow human readable names e.g. www.google.com which a domain lookup would translate to something like 209.85.227.103
[2] Virus writers have worked out that if they rely on one domain then it is fairly easy to get them shut down. So they usually make their viruses query a couple to several thousand domains, and make those domains regularly change the IP address returned (see [1]) providing two layers of rapidly changing complexity, thus making them that much harder to stop.