We acquired a company (Company X) nearly as large as ourselves, complete with a headquarters and data center in another state. Company X's data center's functions got combined with the main one at our company's headquarters, and as a result, lots of equipment became fairly redundant. We were in the process of streamlining, but were a ways off from completing the project.
One day, unbeknownst to myself and the rest of our company's tech staff, a couple of our company's IT VPs took the corporate jet to survey the situation at Company X. We found out later that the VPs wandered around for a couple of hours, made some notes, then got on the jet and flew back in time to check into the office before heading home for dinner.
While they were in the air, all sorts of things started happening -- bad things that rarely happened, as we tried to have a very robust infrastructure that could handle multiple failures without significant loss of productivity.
First, email started backing up and then stopped being delivered all together. The mail servers and their backups were being flooded with more email than they could handle. Then a few remote sites stopped being able to access any resources, although the network was up and stable. It took a few hours for users to notice they weren't getting outside email and for a ticket to be created.
Some suspected a denial-of-service or spam attack. We started looking at the email issue. It was instantly obvious that we were being inundated, but not by nefarious attackers. We were being bombarded by friendly fire -- by something at Company X. A monitoring server was sending emails to Company X tech employees who had left the company.
The emails were being sent to Company X's old server, which didn't have an account for the users, so it bounced the emails to our company's corporate headquarters by default. Corporate HQ didn't have an account for them, either, so sent them back to Company X's server because that is where that domain primarily accepted email. The system halted this process after one loop, so it should not have been a problem. For one or two emails, it would have never been an issue.
Then we discovered the emails were being generated at the rate of several thousand a minute, and with a fat pipe between the two sites, they were able to ramp up traffic very quickly. It soon overran even our biggest servers and caused all mail to queue up.
We finally got in contact with a local Company X person, who pulled the plug on the monitoring server. It took another hour to clean out the mail queues and stop the looping of what mail remained in the queue. In all, tens of thousands of emails finally were deleted, and another few thousand pieces of valid email were delivered.
Next, we turned to the remote site connectivity problems. We easily determined the Company X remote site's local DNS server had crashed and was not able to resolve names. It was quickly fixed with a reboot. It turned out the backup server at Company X's HQ had also gone down.
At first, it all seemed an odd coincidence, until our boss mentioned that our company's IT VPs were at Company X earlier in the day. Upon their return late that afternoon, they admitted to turning off "unused" servers that had no impact on production systems.
It all started to make sense. The VPs turned off servers, the monitoring server generated massive amounts of alarm traffic to non-existent people, and one of the servers was the secondary DNS for remote sites.
Our company's tech guys on the ground had been required to submit change notices, complete with VP approvals, for years. Without one, we couldn't change an IP address, power off a server, or anything similar. But in this instance, the VPs did not follow their own protocol. Apparently, for mahogany row it was a case of "do as I say and not as I do."
One day, unbeknownst to myself and the rest of our company's tech staff, a couple of our company's IT VPs took the corporate jet to survey the situation at Company X. We found out later that the VPs wandered around for a couple of hours, made some notes, then got on the jet and flew back in time to check into the office before heading home for dinner.
While they were in the air, all sorts of things started happening -- bad things that rarely happened, as we tried to have a very robust infrastructure that could handle multiple failures without significant loss of productivity.
First, email started backing up and then stopped being delivered all together. The mail servers and their backups were being flooded with more email than they could handle. Then a few remote sites stopped being able to access any resources, although the network was up and stable. It took a few hours for users to notice they weren't getting outside email and for a ticket to be created.
Some suspected a denial-of-service or spam attack. We started looking at the email issue. It was instantly obvious that we were being inundated, but not by nefarious attackers. We were being bombarded by friendly fire -- by something at Company X. A monitoring server was sending emails to Company X tech employees who had left the company.
The emails were being sent to Company X's old server, which didn't have an account for the users, so it bounced the emails to our company's corporate headquarters by default. Corporate HQ didn't have an account for them, either, so sent them back to Company X's server because that is where that domain primarily accepted email. The system halted this process after one loop, so it should not have been a problem. For one or two emails, it would have never been an issue.
Then we discovered the emails were being generated at the rate of several thousand a minute, and with a fat pipe between the two sites, they were able to ramp up traffic very quickly. It soon overran even our biggest servers and caused all mail to queue up.
We finally got in contact with a local Company X person, who pulled the plug on the monitoring server. It took another hour to clean out the mail queues and stop the looping of what mail remained in the queue. In all, tens of thousands of emails finally were deleted, and another few thousand pieces of valid email were delivered.
Next, we turned to the remote site connectivity problems. We easily determined the Company X remote site's local DNS server had crashed and was not able to resolve names. It was quickly fixed with a reboot. It turned out the backup server at Company X's HQ had also gone down.
At first, it all seemed an odd coincidence, until our boss mentioned that our company's IT VPs were at Company X earlier in the day. Upon their return late that afternoon, they admitted to turning off "unused" servers that had no impact on production systems.
It all started to make sense. The VPs turned off servers, the monitoring server generated massive amounts of alarm traffic to non-existent people, and one of the servers was the secondary DNS for remote sites.
Our company's tech guys on the ground had been required to submit change notices, complete with VP approvals, for years. Without one, we couldn't change an IP address, power off a server, or anything similar. But in this instance, the VPs did not follow their own protocol. Apparently, for mahogany row it was a case of "do as I say and not as I do."
No comments:
Post a Comment