webdevRefinery Forum: Half of internet goes down - webdevRefinery Forum

Jump to content

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

User is offline Kyek 

  • Founder of wdR
  • Group: Administrators
  • Posts: 5078
  • Joined: 20-February 10
  • LocationPhiladelphia, PA, USA
  • Expertise:HTML,CSS,PHP,Java,Javascript,Node.js,SQL

Posted 29 June 2012 - 11:19 PM (#1)

Half of internet goes down


So tonight there are some really huge storms on the East coast of the US. The thunder where I am is deafening, but half a million people in Virginia just a few hours' drive south of me are now without power. And right smack-dab in the middle of those people is Amazon's AWS East datacenter.

Casualty list so far:
Netflix
Pinterest
Instagram
Reddit (intermittently)
Heroku
Twilio
Some Github services
Woot
...my own damn company

It's always amazing when these things happen to see who all is using the same technology we use ;-)

Edit: Also, it's about damn time Amazon let customers sync services and databases between regions for more reliable failover.
0


User is offline Quinn 

  • More pew-pew, less QQ
  • Group: Members
  • Posts: 1307
  • Joined: 08-March 10
  • LocationPalmyra, PA, USA
  • Expertise:HTML,PHP,Javascript

Posted 29 June 2012 - 11:27 PM (#2)

I had a little bit of that thunder over here, but it was worse last night.

I've heard that Hershey was a few street lights last night. Tonight, the only big thunder clap that I heard almost spilled my drink.
<Imp> [F3ar 40]  [PWNbear 17]  [magik 15]  [dissident 10]  [mark 7]

View PostKyek, on 07 February 2011 - 07:11 AM, said:

Though anyone who thinks Europe is a country should be smacked in the face. By a train.
0


User is offline TheEmpty 

  • I say words in sequences.
  • Group: Members
  • Posts: 5154
  • Joined: 02-October 10
  • Expertise:HTML,CSS,PHP,Java,Javascript,Python,Ruby on Rails,SQL

Posted 30 June 2012 - 12:16 AM (#3)

Yes Heroku is killing me! Aww man, woot too!
Reserved.
0


User is offline Daniel15 

  • dan.cx
  • Group: Moderators
  • Posts: 3415
  • Joined: 17-April 10
  • LocationMelbourne, Australia
  • Expertise:HTML,CSS,PHP,Java,Javascript,Node.js,SQL

Posted 30 June 2012 - 02:58 AM (#4)

It's funny that "the cloud" was brought down by actual clouds. :P
Daniel15! :D
Posted Image

Repeat after me: jQuery is not JavaScript. It is not the answer to every JavaScript-related question. When you have to write some JavaScript, do not instantly react with "Oh, I'll do that with jQuery!"

Spoiler
5


User is offline ianonavy 

  • Group: Members
  • Posts: 685
  • Joined: 14-April 10
  • Expertise:HTML,CSS,Java,Javascript,Python

Posted 30 June 2012 - 03:03 AM (#5)

View PostDaniel15, on 30 June 2012 - 02:58 AM, said:

It's funny that "the cloud" was bought down by actual clouds. :P

Posted Image

but I lol'd.
reputation += 1 if post.is_helpful else 0
0


User is offline TheMaster 

  • *-c0de master-*
  • Group: Members
  • Posts: 748
  • Joined: 24-May 10
  • LocationAustralia
  • Expertise:HTML,CSS,PHP,Java

Posted 30 June 2012 - 03:41 AM (#6)

Hahahaha! Nice, Daniel :P

Hmmm. Shouldn't datacentres have backup generator's? Or like...UPS backups or something? Surely a major datacentre would have measures against.....thunderstorms?
0


User is offline derTechniker 

  • BadBoy™
  • Group: Members
  • Posts: 1210
  • Joined: 06-July 10
  • LocationAustria
  • Expertise:HTML,CSS,PHP,Javascript,SQL

Posted 30 June 2012 - 05:08 AM (#7)

Just some days ago i read an article on how a lot of those startup-web2.0 companies rely on amazon and how all of them would go down if amazon makes crap
0


User is offline AwesomezGuy 

  • Certified Asshole™
  • Group: Members
  • Posts: 1245
  • Joined: 08-March 10
  • LocationIreland
  • Expertise:HTML,CSS,PHP,Javascript,SQL

Posted 30 June 2012 - 06:34 AM (#8)

Yet another reason to host your sites in Ireland. No major weather events. Some, but not much copyright law. Cold most of the year, so cooling systems don't have to use much energy.
0


User is offline Daniel15 

  • dan.cx
  • Group: Moderators
  • Posts: 3415
  • Joined: 17-April 10
  • LocationMelbourne, Australia
  • Expertise:HTML,CSS,PHP,Java,Javascript,Node.js,SQL

Posted 30 June 2012 - 06:59 AM (#9)

Quote

Hmmm. Shouldn't datacentres have backup generator's? Or like...UPS backups or something?

Backup generators and UPSes can't power the whole data centre for a long time (I think it's normally about enough power to safely shut down most servers). Only the most integral components (core routers and such) get lots of backup power.
Daniel15! :D
Posted Image

Repeat after me: jQuery is not JavaScript. It is not the answer to every JavaScript-related question. When you have to write some JavaScript, do not instantly react with "Oh, I'll do that with jQuery!"

Spoiler
0


User is offline TheMaster 

  • *-c0de master-*
  • Group: Members
  • Posts: 748
  • Joined: 24-May 10
  • LocationAustralia
  • Expertise:HTML,CSS,PHP,Java

Posted 30 June 2012 - 08:16 AM (#10)

View PostDaniel15, on 30 June 2012 - 06:59 AM, said:

Backup generators and UPSes can't power the whole data centre for a long time (I think it's normally about enough power to safely shut down most servers). Only the most integral components (core routers and such) get lots of backup power.


Oh really? I thought it was something more like powering it for a few hours....

I guess that'd be an ENORMOUS amount of power though. Thousands of servers probably cost thousands each month to run....I'd hate to see that electricity bill....
0


User is offline NeilHanlon 

  • Group: Members
  • Posts: 884
  • Joined: 08-July 10
  • LocationRowley, Massachusetts
  • Expertise:HTML,CSS,PHP,Java,Graphics

Posted 30 June 2012 - 11:26 AM (#11)

View PostDaniel15, on 30 June 2012 - 06:59 AM, said:

Backup generators and UPSes can't power the whole data centre for a long time (I think it's normally about enough power to safely shut down most servers). Only the most integral components (core routers and such) get lots of backup power.


This. At work, though not a data center, we host quite a few servers in one of the rooms and they're all on UPSes. The UPSes kick in when the power goes out, and give the systems time to shut down. We also have it on critical systems like all the computers we fix and stuff, so there isn't any hard drive failure, etc.
Thanks,
兄ニール

Website | Blog | @NeilHanlon | About.Me | Facebook | LinkedIn
0


User is offline Ruku 

  • I do Linux and that Internet thing.
  • Group: Members
  • Posts: 1351
  • Joined: 17-April 10
  • Location/root
  • Expertise:HTML,CSS,PHP,Javascript,Python,SQL

Posted 30 June 2012 - 11:47 AM (#12)

Lol @ all the idiot startups thinking that more than one server in the same damn datacentre counts as redundancy. No sympathy; they should have started running instances in different AZs as Amazon intended. Evidently nobody learned anything the last time this happened?

View PostTheMaster, on 30 June 2012 - 08:16 AM, said:

Oh really? I thought it was something more like powering it for a few hours....

I guess that'd be an ENORMOUS amount of power though. Thousands of servers probably cost thousands each month to run....I'd hate to see that electricity bill....

Nope; a typical datacentre will have a UPS system that'll power the entire site just long enough for all the generators to fire (seconds/minutes), then the generators can run only as long as they have fuel (hours/minutes). Fuel's expensive!
0


User is online @Tom 

  • space
  • Group: Members
  • Posts: 704
  • Joined: 24-May 11
  • Locationspace
  • Expertise:Python

Posted 30 June 2012 - 12:24 PM (#13)

View PostAwesomezGuy, on 30 June 2012 - 06:34 AM, said:

Yet another reason to host your sites in Ireland. No major weather events. Some, but not much copyright law. Cold most of the year, so cooling systems don't have to use much energy.

But then your prone to a mad leprechaun getting inside the servers and sprinkling lucky charms. That can cause huge issues to scalability.
ocelotapps.com
jr wdR comedian under ThatRailsGuy

View Postarronhunt, on 30 June 2012 - 10:09 PM, said:

Sir you are the first person to make me piss myself laughing. Kudos.
2


User is offline Kyek 

  • Founder of wdR
  • Group: Administrators
  • Posts: 5078
  • Joined: 20-February 10
  • LocationPhiladelphia, PA, USA
  • Expertise:HTML,CSS,PHP,Java,Javascript,Node.js,SQL

Posted 30 June 2012 - 12:56 PM (#14)

View PostRuku, on 30 June 2012 - 11:47 AM, said:

Lol @ all the idiot startups thinking that more than one server in the same damn datacentre counts as redundancy. No sympathy; they should have started running instances in different AZs as Amazon intended. Evidently nobody learned anything the last time this happened?

It didn't matter. My company's stack is redundant in 3 zones, with exceptions for the ELBs (obviously) and an RDS instance that's set up with the managed multi-AZ failover.

I'm still working on my post-mortem, but it appears that the service responsible for detecting the failure of the primary RDS itself failed -- either that or the routing table wasn't reachable to actually re-point the hostname at the secondary, because the failover didn't happen. I had to work with support to manually re-point the hostname this morning. Even once that was done, there were connectivity issues between the AZs causing some of our services behind ELBs not to connect, so even though many of our servers were up and running, few of them could talk to each other.

It's a popular response to say "You can't blame Amazon, they made redundancy possible" and often times I agree with that sentiment. But in this case, the entire region's APIs went down, the failovers already in place didn't work, and once they restored power, it took 12 hours until they had the majority of services back up and they're still not finished. That is *insane*. Don't get me wrong, I still love the service, but clearly they need to revisit some of their failure plans.
0


User is offline Daniel15 

  • dan.cx
  • Group: Moderators
  • Posts: 3415
  • Joined: 17-April 10
  • LocationMelbourne, Australia
  • Expertise:HTML,CSS,PHP,Java,Javascript,Node.js,SQL

Posted 30 June 2012 - 09:30 PM (#15)

Quote

I guess that'd be an ENORMOUS amount of power though. Thousands of servers probably cost thousands each month to run....I'd hate to see that electricity bill....

It depends on what electricity costs in the area, as this varies a LOT. One server would be anywhere between $10-25 for an average server, and even more than that if it's very powerful (top-of-the-line processor, huge amount of hard drives, etc).

A lot of the time, getting a VPS is actually cheaper than using a home server.
Daniel15! :D
Posted Image

Repeat after me: jQuery is not JavaScript. It is not the answer to every JavaScript-related question. When you have to write some JavaScript, do not instantly react with "Oh, I'll do that with jQuery!"

Spoiler
0


User is offline arronhunt 

  • I'm a httpster
  • Group: Moderators
  • Posts: 3398
  • Joined: 09-March 10
  • LocationLos Angeles, CA
  • Expertise:HTML,CSS,Javascript,Graphics,Flash

Posted 30 June 2012 - 10:09 PM (#16)

View Postitom07, on 30 June 2012 - 12:24 PM, said:

But then your prone to a mad leprechaun getting inside the servers and sprinkling lucky charms. That can cause huge issues to scalability.


Sir you are the first person to make me piss myself laughing. Kudos.
DO NOT OPEN THIS

Spoiler
0


User is offline Fike 

  • Group: Members
  • Posts: 340
  • Joined: 26-October 10
  • LocationIreland
  • Expertise:PHP,Javascript,Python,SQL

Posted 01 July 2012 - 06:55 AM (#17)

View PostAwesomezGuy, on 30 June 2012 - 06:34 AM, said:

Yet another reason to host your sites in Ireland. No major weather events. Some, but not much copyright law. Cold most of the year, so cooling systems don't have to use much energy.


Wish I knew of some decent providers over here. Blacknight is the leader here I guess but for the money I have they're too expensive.
web developer :: HTML, CSS, JavaScript (node), Python, PHP, MySQL, Mongo.
server admin :: experience with debian (and debian based distros), Gentoo, FreeBSD, OpenBSD.
social :: @nixhead (Twitter), Fudge (IRC), Github (FionnK), Personal Blog.
0


User is offline Ruku 

  • I do Linux and that Internet thing.
  • Group: Members
  • Posts: 1351
  • Joined: 17-April 10
  • Location/root
  • Expertise:HTML,CSS,PHP,Javascript,Python,SQL

Posted 01 July 2012 - 01:38 PM (#18)

Just to be clear, my concerns were more angled towards healthcare providers hosting mission critical (as in life or death!) operations in single AZs. Hearing about all the foolish companies placing their customers' lives at risk purely out of selfishness and/or an insane lack of foresight was chilling.

View PostKyek, on 30 June 2012 - 12:56 PM, said:

It didn't matter. My company's stack is redundant in 3 zones, with exceptions for the ELBs (obviously) and an RDS instance that's set up with the managed multi-AZ failover.

I have positively no idea how Amazon's implementation works (I'm still not convinced that EC2 is cost-effective enough to migrate to), but surely you can just configure multiple A records on the domains which point to different load balancers? It's not particularly sophisticated or elegant, but it's a primitive way to ensure connectivity when a load balancer fails. I still see every individual IP address as a potential disaster zone.

View PostKyek, on 30 June 2012 - 12:56 PM, said:

I'm still working on my post-mortem, but it appears that the service responsible for detecting the failure of the primary RDS itself failed -- either that or the routing table wasn't reachable to actually re-point the hostname at the secondary, because the failover didn't happen. I had to work with support to manually re-point the hostname this morning. Even once that was done, there were connectivity issues between the AZs causing some of our services behind ELBs not to connect, so even though many of our servers were up and running, few of them could talk to each other.

Well that really sucks, and it may not exactly something anyone can change, at least from a technical standpoint. The only way they could possibly avoid issues like this in the future would be to preempt emergency situations and make the changes before an outage actually occurs, which would give their provisioning systems the necessary times to make the changes. Of course, they could always providie reliable UPS and generator systems (like every other damn datacentre) to failover onto when events like these occur, but I don't think we should really place blame until they've released their analysis.

View PostKyek, on 30 June 2012 - 12:56 PM, said:

It's a popular response to say "You can't blame Amazon, they made redundancy possible" and often times I agree with that sentiment. But in this case, the entire region's APIs went down, the failovers already in place didn't work, and once they restored power, it took 12 hours until they had the majority of services back up and they're still not finished. That is *insane*. Don't get me wrong, I still love the service, but clearly they need to revisit some of their failure plans.

I guess it's time to consider multiple hosting providers as well as availability zones? This is a huge argument for interopability between cloud hosting services, and a phenomenal part of the reason our company would rather go about building its "cloud" on the likes of OpenStack and Eucalyptus than some proprietary platform spearheaded by such a huge private company. I love the premise, but their sheer scale of operations is somewhat intimidating :s
0


User is offline Kyek 

  • Founder of wdR
  • Group: Administrators
  • Posts: 5078
  • Joined: 20-February 10
  • LocationPhiladelphia, PA, USA
  • Expertise:HTML,CSS,PHP,Java,Javascript,Node.js,SQL

Posted 01 July 2012 - 09:58 PM (#19)

View PostRuku, on 01 July 2012 - 01:38 PM, said:

Just to be clear, my concerns were more angled towards healthcare providers hosting mission critical (as in life or death!) operations in single AZs. Hearing about all the foolish companies placing their customers' lives at risk purely out of selfishness and/or an insane lack of foresight was chilling.

Oh my, agreed. And screw multiple AZs -- if you're running literal life-or-death services and you're not in multiple regions, then your company should not exist.

Quote

I have positively no idea how Amazon's implementation works (I'm still not convinced that EC2 is cost-effective enough to migrate to), but surely you can just configure multiple A records on the domains which point to different load balancers? It's not particularly sophisticated or elegant, but it's a primitive way to ensure connectivity when a load balancer fails. I still see every individual IP address as a potential disaster zone.

That's just it -- that's super easy and even fairly cheap for things like webservers, especially when your failover endpoints use auto-scaling so you only need to pay for very minimal instances unless they're needed. But EC2 wasn't even our problem this time. Being stretched across multiple AZs, each of our services were durable enough to deal with it. I could have had us back online in half the time if it weren't for the fact that RDS (managed MySQL, ironically set up with failover) failed.

Quote

Well that really sucks, and it may not exactly something anyone can change, at least from a technical standpoint. The only way they could possibly avoid issues like this in the future would be to preempt emergency situations and make the changes before an outage actually occurs, which would give their provisioning systems the necessary times to make the changes. Of course, they could always providie reliable UPS and generator systems (like every other damn datacentre) to failover onto when events like these occur, but I don't think we should really place blame until they've released their analysis.
Agreed agreed agreed. I hope they release their postmortem.

Quote

I guess it's time to consider multiple hosting providers as well as availability zones? This is a huge argument for interopability between cloud hosting services, and a phenomenal part of the reason our company would rather go about building its "cloud" on the likes of OpenStack and Eucalyptus than some proprietary platform spearheaded by such a huge private company. I love the premise, but their sheer scale of operations is somewhat intimidating :s

Really, we could stay 100% with Amazon and be fine. There's availability zones, which are physically separate datacenters within a region, and then the regions, of which there are 8 -- and 3 of those are in the US. So we could go multi-region and be fine. The issue is that, while Amazon has amazing support for spanning their services across AZs, there is absolutely zero support for spanning them between regions. Obviously for webservers this is a non-issue, as they're not interconnected and there's a million ways to scale them. But I'd have to stop using DynamoDB and RDS and SQS and SimpleDB and others, because awesome as they are, there's no multiregion sync on them. To take a Dynamo-based DB multi-region, I'd be spending $20k on Riak and that's BEFORE server costs and data costs (since you pay WAN rates between regions). Taking MySQL multi-region is just a shit show, and I'd probably have to run MySQL Cluster somewhere. Extra servers, an admin workload that I'd be looking at either hiring or outsourcing for, and even then failure recovery is a joke. I'd be better off looking at something like Oracle for our relational needs, which is more software cost. SQS is something we'd probably have to build in-house but even then we'd be running 3-4 more servers for durability and eating the associated cost.

What it boils down to is, at a certain point it's more cost-effective to hang up a "Please excuse our maintenance" sign for a few hours and cut our losses on the revenue for that period than it would be to undertake all that extra work and pay out all those software costs and server costs. All the while, this could be resolved almost entirely if AWS would support multi-region syncing/multi-region services. And every time they have an outage like this, there's more and more of an outcry for it -- so hopefully they'll come to their senses soon!
0


Share this topic:


Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users


Enter your sign in name and password


Sign in options
  Or sign in with these services