Contents
Introduction
My cloud journey was brief(almost two years total), but informative. At the time I didn’t realize how informative it would be for me more than a decade later(so for that, I am glad I experienced it, but wow was it annoying in the moment), I was more focused on how frustrating it was for me, as a person who had been managing mission critical internet-facing infrastructure for nearly eight years at that point(and fourteen years running internet facing infrastructure in general), to be using it at all. Biggest complaints at the time for me included lack of control and poor availability.
My journey spans three different companies starting in 2010 and ending in 2012. The first company(“BIZ1”) never ended up using cloud anything while I was there, I only had discussions with vendors and internal budgeting. BIZ1 went “out of biz” probably a decade ago now. The second company(“BIZ2”) was “born in the cloud”, a term I have used for a while to signify a company that has never had “on prem” anything, there was no “migration to the cloud”. Day 0 of everything was already cloud. That company went “out of biz” also about a decade ago as well. The third company(“BIZ3”) was “born in the cloud” for the purposes of their core application stack which they were developing from the ground up(at the time I joined they were using what I would call a “Application Service Provider”, I suppose you could call it SaaS as well but basically it was a semi custom app for their business, developed and operated by this third party. BIZ3 had outgrown the abilities of this third party and decided to build their own tech stack in house). Day 0 operations of that stack(for production/testing) were in the cloud. There was never a migration TO the cloud. But unlike the previous two companies there were migrations FROM the cloud in 2012. BIZ3 is still around, in a dramatically shrunken form. I am no longer there(was the last of the tech team to depart), several folks from BIZ3 went on together to the next company, I was/am among them. Though I am the only one to span BIZ2, BIZ3, and I guess you could call it BIZ4(where I am at now).
Cloud costs first impressions
In early 2010, I was at BIZ1, and one day my manager comes to me, and asks me if I can come up with a disaster recovery plan. I believe this was after we had a recent primary storage array outage, while that outage had some impact to direct customers(our entire back end was offline, took several days to fully recover), and no impact to users that hit our services from the internet it was still quite an ordeal. I do not recall being given any requirements, other than maybe come up with a couple of options.
I spent some time working on this, and came up with three options, I want to say the first option was roughly $450,000 worth of infrastructure for a bare bones DR setup, second option was around $650,000 which was a more viable setup, then there was a third probably priced around $800,000 for a premium setup. These DR systems had to remain online in order to replicate data to them, deploy apps to them and otherwise just make sure they are ready to take over if the time ever came. This was for a company who was entirely “on prem”, and the DR was for our back-end systems which were hosted at the Westin building in Seattle, total of about 20 racks for our back-end at the time(by no means were they close to being fully populated(and tons of older equipment), we had other active-active data center locations for front-end equipment that did not require DR, including one front end in Seattle not included in the 20 rack figure). My DR plans were leveraging the latest technology(honestly some of which wasn’t even due to be released for about several months, that being the HP Opteron 6100 platforms, which I was so excited about at the time, also 10G networking throughout which was still pretty new at the time) and consolidating things into just two racks. I presented the options to my manager.
He came back to me and said something along the lines of “That looks nice, good job. But the VP of technology only budgeted $200,000 for this project”. My response was obvious confusion, what was this $200,000 based on? His response was he didn’t know what it was based on, if anything. Until this point I had never heard of anyone speak of disaster recovery so obviously I was never involved in any budgetary planning for such a scenario until now. Also note that this budget was set during the previous year(long before the outage that we had). I flat out told my manager it’s impossible to do a DR for BIZ1 for $200,000.
“Let’s throw it in the cloud” was about the words the VP of technology used when we were later having a brief discussion on the topic of DR. At that point I basically knew nothing about “cloud” other than the (now) common saying “it’s someone else’s computer” (which I don’t recall seeing people say that phrase yet as of early 2010, not even me). Nobody else really knew more than that at BIZ1, either.
Enterprise Cloud Provider
The most obvious answer initially was OK let’s go talk with a cloud provider that uses technology similar to what we use today. That initial provider was Terremark. They were one of the many VMware-based cloud providers at the time, they also leveraged 3PAR Utility Storage for their platform which I had been using for four years already(at two different companies), BIZ1 had a 3PAR T400 as their primary storage array(Side note to tech folks: HP had not yet acquired 3PAR). So this made sense to me from a technology standpoint anyway. 3PAR would often brag that something like 7 of the 10 largest cloud providers at the time used their platform.
(Side note, Terremark was actually acquired by Verizon barely a month after my engagement with them. They apparently took over operations of all of Verizon’s data centers. Years later, Terremark’s cloud operations were sold to IBM. I do want to call them out for building good data centers though, they had a fire in the electrical room of one of their data centers, and remained online throughout, zero impact to customers)
I engaged with Terremark, and throughout the entire process was always honest with them, I told them I have a solution here that I can build for $650,000 + hosting fees(which I think came to maybe $5-10k/mo). My management wants to find a lower cost alternative to my plan so we are talking with you.
They obviously didn’t get the message, you could say this was my first exposure to the “Cult of the Cloud”, in this case it was the cloud sales rep.
Their first proposal (check link for actual proposal) to us was a solid solution technically, but it had a slightly inflated price tag of $272,000 PER MONTH. My initial reaction was simply shock really. I remember telling the rep, “You do realize that I can build this for $650,000 and host it for $10,000/mo right?” Here comes the “cult” response: “But who’s going to manage it?”(I’ll never forget those words) Once again, my reaction was simply shock. Again, we were entirely on premises at this point and had more than 30 racks of equipment we were managing, how can you suggest we can’t manage my solution? I reminded him that my solution is not even two full racks of equipment, but he simply couldn’t understand. Or maybe he refused to understand, but his responses were always confusion, as in “Why wouldn’t we want someone else to run this for us?” So, that ended that discussion for the moment.
After checking the original emails which I still have, their first proposal came back with a second option as well but no PDF for that(originally I thought the second proposal came days later), just the email message. They said our 10G networking requirements aren’t easy to satisfy and that in fact they only had one facility at the time which was built out with 10G in mind that was in South America. HOWEVER, they were in the midst of building a new data center I think near Washington DC, which we could get in on. They said the new cost if we were to go with that solution was $120,000/mo. Which again didn’t seem viable if that was the end of the story. It wasn’t the end of the story. You see the first $272,000/mo proposal was really a good deal in that the “install fee” was only $100. The new proposal had a different install fee. What was it?
THREE MILLION DOLLARS.
Yes, you heard that right. I had told them on multiple occasions that I can build this for $650,000, and they are coming back to me again saying we can do the same for you it’s only going to cost THREE MILLION DOLLARS, plus over ONE HUNDRED TWENTY THOUSAND PER MONTH after that.
If we built it from scratch, on 10G Cisco UCS blades, you’d be looking somewhere in the neighborhood of:
Setup and equipment charges ~ 3 million
Monthly charges ~ 120,000
(By contrast I was looking at using quad processor 12 core HP Opteron 6100 blades(total 48 cores/server), which again weren’t due for release for a few months)
They weren’t even listening to me, if I was them I would of not even come back with ANY proposal at all, simply because they could not compete on price, even remotely. I mean if it was double the price that’s one thing but it was more than double. The mindset of these people was so weird, it was like the Twilight Zone for computing. There is no option other than cloud, and we are the best for cloud. To this day more than fifteen years later I’m still confused…
To Amazon Web Services…. ?
The VP comes back with “let’s throw it up at AWS”. At this point again I had near zero interest/knowledge/experience/etc dealing with AWS. BIZ1 was in the VERY early stages of moving their data analysis platform from their in house built stack to a Hadoop-based platform(apparently they ended up being one of the largest Hadoop players in the Seattle area initially that was after I left, at that point they didn’t even have a proof of concept done). Initial cost estimates for AWS were super high(don’t recall specifics). So the VP came back with an idea. Let’s ASSUME that we are on Hadoop and everything is working fine, let’s build a DR plan to handle our Hadoop stuff. The idea was that the AWS-portion of DR would be only temporary. Perhaps our primary data center burns to the ground for example, we run DR in AWS for a few months while we buy new equipment and build out a new site, then switch back to our stuff.
There was so many things that had to go right in that situation for it to make even a lick of sense. But I modeled it out anyway. I don’t recall most of the details, but we were going to use a S3-backed Hadoop platform. Zero effort went into actually figuring out how/if it would work, but the costs came back at something like $160,000/mo. VP assumed we’d run that for 2-3 months while we get new stuff online. So total cost in the area of $320,000 to $480,000 for 3 months, then we’d have to add in the costs of buying new stuff as well. The discussion ended pretty quickly at this point after they double checked my math and came to the same conclusion I had established weeks prior before talking to any cloud provider.
It is impossible to do Disaster Recovery for $200,000.
Some time passes, maybe only a day, I don’t remember. But I do remember my manager coming to me and telling me to go ahead and start work on the $650,000 DR plan.
Result of the proposals
Nothing ever actually came of it in the end.
I was simultaneously working on another project, coming up with a specification for systems for our on prem Hadoop cluster. I had been working off and on(for a good 6-8 months) for a purpose built system from SGI(years later acquired by HPE) at the time leveraging their new “Cloud Rack” setup, which was a super efficient/scalable platform that removed things like fans, and power supplies from the servers and instead had those components in the rack. The costs for that was around $800,000. Once I gave THAT to my manager, he said something familiar to me. He said something along the lines of “Thanks! But the VP of technology has only budgeted $200,000 for this project as well”. I said “Based on what?” — he didn’t know. Off we went on another quest which eventually had them revoke the budget for DR, only to give it to the Hadoop project. The CTO and VP didn’t trust me, thought I was being bought off by the vendors. I don’t know where they got that impression from. Simultaneously we were interviewing for another person to join the team. I think they only interviewed one person, I interviewed them myself. He was a decent mid range candidate but had zero mission critical experience(our SLAs with customers were 99.99% uptime, I don’t believe we ever breached that SLA while I was there). I thought he would be an OK fit on the team, not a bad person at all, not great, but not bad.
The VP had met a crappy Supermicro reseller at a conference who had a lower cost, I don’t recall the dollar value there. But I do recall a meeting where the CTO was literally going to newegg.com to look at the costs of components because he wanted to see what their margin was. I leaned over to the top developer at the company at the time and said “If this goes badly, I’m leaving the company”. Well, it went badly. They didn’t choose my solution. The whole situation was a shit show. I left the company soon after, and my manager(whom I liked) actually resigned from his position of Director of Ops and IT within days of my departure, reverting to just Director of IT, specifically because he had no confidence in the remaining team to make up for me being gone. The VP of technology later left as well(months later). I had a 1-2 hour discussion with the CTO prior to my last day, where he apologized to me for not trusting me and wished things would be different(ironically was the second company in a row to apologize to me, the first was the previous company, whom apologized to me years later that is another story unrelated to cloud). The VP never communicated to me a single time once I gave my two week notice. I wanted to exit fast, and I did. The new guy was hired, but he didn’t start until after my last day. I offered to the VP saying hey I can leave today if you prefer, or I can wait the two weeks. He sent another person from the team to work with me to “transition”. I still think it is hysterical that he could not find it in him to say one word, one email, ANYTHING to me those last two weeks. I would have paid money to be a fly on the wall for that new “me” on the team those first few days. I later heard the new guy left for similar reasons as I did, and I was told specifically he really liked the 3PAR storage array they had, it was one of the highlights apparently.
Fast forward about a year, and I was having lunch with one of the lead developers on the Hadoop project and he said something along the lines of, to-date they had suffered a pretty consistent 30-40% hardware failure rate in their new systems, and due to the quorum requirements of Hadoop that had cut off something like 50-70% of their capacity for a solid year. One of the benefits of my proposal was NBD on site support, the CTO’s response to that at the time was “We’ll just hire some kid out of college“. I asked the developer if they ever hired such a person, and was told they did not. The new VP of technology was super pissed at this Supermicro vendor and was swearing they’d never do business with them again. At the same time he was trying to place blame on ME for some reason(for various things) which I just found the whole situation glorious. I was right, ABOUT EVERYTHING. AGAIN.
(Side note: had I been there I would of run all of the systems through burnin tests before putting them in production. They obviously didn’t do that. I’m sure it’s not perfect but I have used a product(now obsolete probably but still worked last I used it) called Cerberus Test Suite(ctcs). I even used it two jobs later, I built a standard VM that ran ctcs automatically then I just cloned the VM and fired up as many as I needed to stress test the system. It obviously works fine on bare metal as well. I am mainly concerned with stressing the CPU and memory)
Fast forward a bit later, and both of my remaining co-workers I had at the time while I was there asked to be laid off as the company was doing poorly, they wanted the severance. They got it. Company closed for good maybe 3 or 4 years after I left. This was AFTER they went super expensive and for brief time became the largest Cisco UCS customer in the Seattle area.
I would not consider BIZ1 to be a fly by night operation, unlike BIZ2. BIZ1 was doing stuff as far back at least to late 2000, they were startup sized, but not super new by the time I joined them in 2008. Part of me misses that place, I still remember when they sent me the address to do the interview on site. I had to do a double take at first, then I realized they were literally ACROSS THE STREET from my apartment. I had co-workers parking further away than I LIVED in order to avoid paying the garage fees there.
Born in the cloud
(Sorry my memory is a bit fuzzy as far as when exact events took place so there is some jumping around, I remember the events, I just don’t perfectly remember the order in which they all occurred in)
Onto BIZ2, which I started in September 2010, a company who was entirely in the cloud. They did not have a single server for anything. I was on the operations team, I don’t even recall them really having much of an IT department but whatever I didn’t care about that. I had zero AWS experience to this point and was just given the keys to the kingdom and off I went. There was at least one other person on the team in another part of the country but apparently he was being phased out in favor of me and other new hires(company was in the midst of moving from east coast to Seattle). He was very unresponsive and unhelpful. I suppose that is one reason they were phasing him out. There had already been high turnover on the team. My team grew to at least 3 or 4 people in the Seattle office that I can recall, and later added a manager of operations as well.
If the last company was a shit show to some degree this company was next level. Maybe I had heard the term before I don’t recall but I used the term “Death March” describing some of the situations there at least with the developers. Being worked so hard to meet impossible deadlines, high turnover, cutting corners, unstable app stack, endless problems, no monitoring, just pure chaos at every level. I don’t recall seeing the phrase “Move fast and break stuff” at this point of my career but they certainly did their best to do that, result was they actually seemed to end up moving slower since they spent more time repairing broken things.
(I recall one day specifically having a brief conversation with our CEO, telling her I thought it was very strange to be pushing the developers so hard to get this (feature/something) launched on this day(which was the same day as the Christmas party I think), and was surprised at her response, something along the lines of “I never set any such requirement I don’t know why that is happening”.)
I started trying to adjust to using AWS stuff, they used a 3rd party management layer to provision systems which also tied into DNS named Scalr. It was an interesting/novel approach(there was another player that did the same but was MUCH bigger(and 2000x more expensive), I don’t remember their name, but I do know Scalr was dirt cheap maybe not even $100/mo). I discovered our AWS bill was in the range of $400,000-$500,000/mo. WTF. How can this tiny shitbox company be paying that much for hosting. It was not long after that where I started realizing what IaaS actually is, how it is bad-by-design on so many levels. I took it upon myself to make a project to move them out of AWS entirely and save tons of money at the same time. I spent endless hours on it, fine tuning every detail. Initially just by myself, later got some management support. My hiring manager departed at around that time, and a new Director was hired who had previously worked for over a decade directly for Amazon.
We had several ex-Amazon people including our glorious CTO(yeah that is sarcasm there). The brother of our CEO was in fact the global leader of AWS at the time(and he is currently the CEO of Amazon). I actually went on a trip to his office along with my CTO(and maybe my hiring manager too he may not of left yet) at one point to meet with this AWS leader and his “Chief Scientist” to discuss our experience with AWS. I thought I was quite polite and measured but I had a lot to unload as far as how poorly our experience legitimately was. The response was mostly apologies and they are working hard to fix and improve stuff. I don’t recall anything else other than my CTO later complaining to me harshly for being “too hard” on them (or something like that). I told him I was holding back BIG TIME. It was really hard to contain my frustration with my AWS experience at that time, but I believe I did. I don’t recall if this meeting happened before or after I had started my migration plans, I suspect it was before.
(Side note: at some point in the months that followed my new director fired off an email to AWS support, saying a polite version of “Hey, everyone at my company hates your services, we have non stop problems, and we must be doing something wrong. We spend a lot of money with your services, are located in Seattle not too far from you. We also have high level relationships as well with AWS. Would you be willing to come on site and tell us some things we are doing wrong so we can have a better experience?” — their response was something along the lines of “No, that’s not our model, you figure it out”. My director, who had over a decade of experience working for Amazon was quite floored at the response, I don’t doubt their support HAS improved in the years since but it is a experience that has stayed in my mind)
The Blog post and legal threat
At around the same time, maybe a bit after this meeting is when the reality of IaaS hit me like a truck. It was a huge realization that just sort of came to me overnight in a way that I could not recognize before. I had a tech blog at the time, so I felt compelled to write it out, as I literally could not sleep(likewise, when I made this website I had 3 sleepless nights in a row writing for over 10 hours a day), I had to get the words out. So I wrote a post titled “Amazon EC2: Not your father’s enterprise cloud” early on October 6, 2010. I sent a link to my manager because I was already involved with him on the move out plans, though I think I was doing 95% of the work(which I didn’t mind, I enjoyed it). I wanted his opinion on things. A bit later that morning(maybe as early as 8 or 9AM), there was some article on theregister.com regarding cloud stuff. I have no recollection what the article was about now, but I wrote a comment there linking to my blog post from that day. I didn’t think anything of it at the time. My manager replied back to me regarding my post saying he thought it was a great post, and was “very balanced”. To this day I have not changed one word on it, as I wanted to keep it the same for historical purposes. Fast forward another few hours and I get pulled into a meeting with my CTO, and maybe HR too (my hiring manager was NOT there as he was not based out of that office). They attacked me for the blog post, saying specifically I was leaking “confidential information” from the company and they were threatening legal action against me if I did not take the post down. I asked them what information am I leaking, there is nothing private there. Their only response was the two images I had on the blog post that I had included in a slide deck (THAT NOBODY HAD SEEN YET), that had ZERO company specific information, those slides were/are VERY GENERIC and apply to anyone. They didn’t agree and threatened me further. So, this wasn’t THAT important to me so I agreed I’ll take the post down, whatever.. (I kept the post down until they later went out of business and then I published it again maybe 4-5 years later, unchanged). The obvious conclusion I drew to the whole situation is someone(s) at AWS saw my blog post and it made it around the vine there, all the way up to the top, which then prompted calls to my employer to take it down. I never told anyone other than my manager and co-worker that morning about the post(at that point).
(Some more time passes)
Project Sedona
My new Director came on board and we got along pretty well(he later called me an “outstanding engineer” I think). I pitched him my project plans and he came around pretty quick. He became a very staunch supporter of my plans, which internally we called “Project Sedona”, as they were paranoid of mentioning moving out of AWS as making people upset(yes that is foreshadowing). Fast forward 5-6 months or so, working on this project was really the thing that kept me sane at the company. Ironically one of my co-workers was doing something similar(working on a project to keep himself sane), in his case he was working on a new software deployment system. That consumed the bulk of his time(nobody asked him to build such a system), so much so that I really had little interaction with him, he was in his own little world. To put some perspective on how much time I spent on this, as I still have all of the materials. The general presentation I made in OpenOffice which is really what most people saw consisted of fifteen slides, about half of which were filled mostly with pictures and graphs. The full extreme in depth end-to-end-every-possible-situation-examined slide deck that almost nobody saw was 170 slides. The proposal was for a single rack of equipment(fully populated).
(During this time one of the tasks I was assigned was to reduce our AWS bill, at peak I want to say we had over 500 servers in AWS, and at least 100, maybe 200 of them nobody had access to. Nobody knew what they did(if anything), and as a result, everyone was scared to turn them off in case it would break something. I managed to get the server count down to roughly 320 systems without breaking anything.)
At the end of the day, everyone was on board with “Project Sedona”. From my director, most/all of the developers, the whole Ops team, even the CTO and CEO were on board. Everyone was excited. All that was left was go to the Board of Directors to get the project approved and off we go. For a project which had a documented 6 and a half month ROI(and projected $4M savings over 3 years), what’s not to like about that?
Rejected
The board rejected my proposal. The only thing communicated to me was they wanted to “shelve” the project and revisit it in a year. I don’t recall the timeline specifically again, but I was pretty much done at that point. No reason to stick around at this place any longer. I left the company shortly after, maybe one or two weeks ? three weeks? I don’t know. But I am proud to say I was not alone. My director didn’t even come in for my last day at the office, and he resigned literally the first day I was gone(Note, two directors at two different companies resigned immediately following my departure!!). One of my co-workers that I was close to(not the guy working on the software deployment stuff) also resigned that week. My departure triggered a mass exodus of probably 20ish employees (given the total number of tech employees that was a LOT). Many of them came to BIZ3, where my original hiring manager from BIZ2 went, and he hired me at BIZ3 tool! BIZ2 made sure to give me one more legal threat regarding my blog post on my last day before I left.
The CTO really wasn’t happy with me for whatever reason at that point, I’ll share one more story I was told. One of the things i did while I was there was sign them up to Dynect for external DNS I think it was. I’m always the responsible person and handled everything correctly. After I departed the CTO went to Dynect and tried to cancel the contract. He told them I was not authorized to enter the company in an agreement with them so that is grounds for terminating the contract. Dynect came back to them, with the signed contract, showing it was MY DIRECTOR WHO SIGNED FOR IT(I didn’t even bother to have the manager sign, went straight to the director who I was closer to anyway). Such a hilariously embarrassing situation for BIZ2 to be in!! I almost never sign for stuff even today unless it’s tiny dollar amounts that can be filed in an expense report. There’s no reason to. So they were “stuck” in that contract with Dynect (who was a perfectly fine provider, I was introduced to them at BIZ1, and continued using them even at BIZ3 until they were completely absorbed by Oracle cloud, I miss Dynect to this day in fact, specifically the user interface).
Fast forward to 2012, I was already at BIZ3 for some time along with the other ex-BIZ2 folks (and a lot of other people unrelated to BIZ2). Amazon had a huge outage in us-east-1 causing multi day downtime for many customers, including BIZ2. I was later told, the leadership of BIZ2 decided it was not acceptable to have that kind of outage so they worked to “fix” their application stack to be multi region in AWS. They worked on it for a month, maybe two, then gave up.
I recall having a call with their most senior ops person, who I believe was still based in Australia at the time working for a consulting company. This guy was/is brilliant, and very friendly as well. I ended up working with one of his co-workers(from the same consulting company, brilliant as well) at BIZ3, and literally still working with that same guy 15 years later. Anyway, I recall asking him one question(maybe six months after I left), how do you deal with all the BS at that company? I’ll never forget is response – “Copious amounts of alcohol and swearing”.
Company went out of business a couple of years later. Years later still that Director that resigned the day after me tried to hire me on multiple occasions again he was some big shot director at Oracle Cloud. I liked him, but I liked my new position more and didn’t want to work for Oracle so I declined. Eventually he retired. One of my co-workers from BIZ2 is still at Oracle cloud in some high level technical role. He was/is a very smart guy, oh and that software deployment system he wrote? It was a disaster. He spent months working on it(with zero feedback from anyone) and I was told the other ops guy literally used it once, said it was too complicated and threw it out. That ops guy that went onto Oracle, we don’t operate the same way, I certainly hold respect for him for his technical skill(which I could only gauge based on limited information but I had a feeling he was very good in what he does), but I don’t think we would be good to work together again based on the vibes I got from him(he really didn’t work with anyone while I was at BIZ2, myself included).
Hired to move out of the cloud
BIZ3 ended up being my longest stint at any company in my life, over a decade of employment there with glowing reviews year after year. As mentioned, my hiring manager at BIZ3 was also my hiring manager at BIZ2. He hired me with the intention of leveraging my skills and knowledge to move them out of AWS quickly. They hadn’t even launched their production stack in AWS yet and my manager already knew they shouldn’t stay there long. I was hired in May 2011, and started working on plans to move out. Didn’t have a whole lot of info since the app stack wasn’t running yet, but I did what I could. During the initial months my manager worked closely with the outsourced ops person(whom I still work with today) on running scalability testing on the app stack to get an idea how much server capacity they need and the costs etc to run once we go live. I was not involved in any of it(nor did I want to be).
Launch day, turn the knobs to eleven
We launched in AWS in late September 2011, after several critical delays in the software readiness. Years later, the CEO admitted had they known how “not ready” the software was even at that point they would have postponed till early 2012. Early days for launch were chaos, literally immediately every benchmark, and scalability test that was run was declared invalid, and we were told to turn the knobs to eleven, max everything out to get better performance. Tons of bugs, tons of outages, it was pretty crazy. Though the level of technical skill and confidence(?) in the senior team at BIZ3 made the team at BIZ2 look like a bunch of herded cats maybe? I don’t know it was night and day difference. Good times, fun stories… though I’m sure the developers were going through hell. Meanwhile I continued to work on my AWS move out project, which got approval this time, I was later told it was very difficult to get approval, we were spending roughly $90,000/mo when we moved out of AWS, and nobody was really blinking at that(out of nothing other than ignorance I think, which is fine…), but the general response to the proposal was “Everyone is moving INTO the cloud, why do we want to move OUT?”. My manager put everything on the line to push the project through and fortunately we had a very nice and supportive CTO who backed us up.
Data center chosen, hardware deployed
My manager decided he wanted a data center on the east coast, for the sole reason of latency purposes to Europe in the event we had to direct traffic (disaster recovery or something) from Europe data center(not setup yet). I didn’t find this reason valid myself, mainly because if you are in that situation the last thing you should be worried about is latency, and you should just be happy that the damn thing is working at all hah. Anyway, I was super impressed with the facility as he was, by far the most sophisticated facility I have ever personally visited. I certainly had reservations about having our only facility be over two thousand miles away from me, I was used to being able to make regular trips to the data center to do stuff, that would have to change.
Our initial build out was tiny, I said before we were spending around $90,000/mo in AWS in the month prior to cut over. I believe we had less than 100 servers in AWS at the time. What does that translate to in data center terms for us? A 64-sq foot cage and two oversized 47U racks that are roughly half populated each. One rack with servers, and network gear, the other rack with storage only. We had 8 HP DL385G7 servers for VM hosting, a single HP 3PAR F200 fibrechannel storage array, four ethernet switches, two fibrechannel switches, and two load balancers. I think that is it. Remote access was SSL VPN provided through the load balancers. There was no formal firewall(load balancers were the access control) initially.
I did basically everything(see this) from an infrastructure perspective including deciding on the equipment, racking & stacking, cabling(11 cables per server!), and all configurations through the stack – servers, storage, networking, hypervisors, operating systems. My consultant co-worker focused more on the “DevOps” angle of things(that division of labor continues today).
Cut over to data center
While in AWS we leveraged EC2, S3, RDS, and SQS. At least at the time(AFAIK still true now?) it was not possible to establish MySQL replication FROM RDS to anything OUTSIDE of RDS. So while we could deploy all of our apps and stuff at the data center, the actual cut over would take a while, as we had to do a mysqldump of the data out of RDS, then copy the data over to data center then import it. It took several hours to do this. I recall going to the office in the early afternoon, maybe 5-7 hours before the cut over was to begin and telling one of our senior developers something like “I think I’ll still be here at my desk 24 hours from now”. (and I think that ended up being true). In advance of our cut over our new MySQL DBA performed several performance tests on MySQL in our new data center. His feedback was it was the fastest he’s ever seen MySQL run, and he had some decent experience actually working MySQL at one or two large companies including Motorola I think. I was happy he was happy, but at the same time I felt his tests were invalid, and told him as such, specifically because whatever he was doing, was not touching the storage array, so it had to be almost entirely in memory. But whatever, he was super happy so that is fine.
(I was super hesitant to do the actual cut over on that day, I had one outstanding potentially serious hardware issue that was not yet resolved. Though my manager REALLY didn’t want to push the date any further. The 10G network cards we had from HP/Qlogic had a manufacturing defect that could cause them to fail(a reboot would fix it). It was becoming more widely known at the time of this problem, and one of the readers of my blog at the time I came to get to know a little bit better had the same NICs and had major issues, just on IBM hardware. I had specifically built my systems to handle NIC failures gracefully(long before I knew of this issue) with a total of four 10G network links (2 for VMs, and 2 for cluster communication, active/passive, also had 2 dedicated 1G active/passive links for host management). So if one 10G card went bad the other would still be fine(if all 10G links went down I could still manage the host over 1G, storage was fibrechannel so VMs would not crash if both 10G cards died). The guy I was chatting with had his servers with just a single 2 port 10G NIC and that was it. I was working with HP on the issue already, they were apparently hesitant to replace the cards, from what I heard from this other guy, the supply of new cards was super limited, so you really had to show you had this issue before they would replace them. From what I recall I only had maybe two real issues(and none after launch). The first issue is what alerted me to the situation, I don’t recall the details. But HP had me put the driver in debug mode or something, to get more data the next time it happened. When it happened again the server went offline entirely(both 10G cards went down), something with the driver being in debug mode the system couldn’t recover automatically. In the end I got replacement cards(a few months later) and there was never any impact to the systems after we went live so, it worked out in the end.)
The cut over itself I think overall went pretty smoothly. I don’t recall most of it, but I do recall our DBA almost quitting his job over one aspect of it, or almost getting fired that first day either way. He said the MySQL query cache was terrible and he had disabled it. He was adamant he was not going to re-enable the cache. Performance at the time was horrible, everyone was complaining. We knew we had to have the query cache enabled, and we understood the design of the query cache was bad. But the platform we were using really REQUIRED the cache to operate correctly. We went back and forth with him over an hour or two perhaps and he eventually relented and enabled the cache, and performance skyrocketed. As with any big cut over I’m sure there were lots of other issues big and small to work through. But we got through it.
CTO Reaction
The CTO sent a company wide email about 24 hours after we went live in the new data center and had declared basically a success at that point which included such quotes as
- “The move provides a solid foundation for operational excellence, in which full visibility and underlying control will enable steady ongoing improvements to system performance and scalability. In day 1, it reduced the slowest (3+ sec) Commerce requests by 30%. In addition, it reduces costs by 50% and will pay for itself within the year.“
- “As a reference point, (Senior tech manager) noted that “I’m used to seeing this kind of project take many more weeks and require a team three times as big.“
- “(My manager) advocated passionately for this project and drove it with a strong sense of urgency. Dissatisfied with the mediocre performance and high costs of a public cloud solution for a system of our size and complexity, (My manager) insisted to pursue a path to operational excellence.“
- (My Manager) challenged his teams to execute the project in parallel with other intense project work, effectively doing double-time for the past several weeks.
- (My Manager) drove hard for a Feb 22 launch date, stubbornly demanded focus by de-prioritizing less critical work, and found ways to hit the date without cutting corners or introducing risk.
Second AWS extraction
Later in 2012, we did another AWS migration, the first migration was for our North American customer base, and the second migration was for our European customer base(which was tiny by comparison). For tax purposes the company required the servers be physically located in Amsterdam, even though the initial launch of the EU stuff was in AWS us-east-1. Apparently they tried finding a cloud provider in Amsterdam I think, and could not find any or could not find a suitable one at the time. Not sure. But I was tasked with building out an even smaller footprint there, which I did, and flew to Amsterdam to set it up mid year, probably half of one rack in a shared hosting space(no cage). I later came to hate that facility with a passion and the employees that worked there. We eventually moved out probably 6 or 7 years later(consolidating EU operations back to our US data center). But that launch was a success as well, I don’t really talk about it since it was more of a footnote to the main show but wanted to mention it anyway.
Still used some AWS after extraction
I don’t really talk about it since it is so trivial, but wanted to mention to be clear, the app stack was designed to use SQS, as well as S3 for storing some data. They continued to leverage both services from the data center for a few years(until the app stack was replaced entirely), however the SQS and S3 portions of the bill were basically rounding errors, I’d be surprised if it was more than maybe $200-500/mo. They just didn’t get around to changing the design till they decided on a new app stack.
A Decade of Service Excellence
We continued to operate in that facility for over a decade following the move out with great success. The company briefly grew massively, our virtual server count eventually peaked at around 1,200 systems(this was when we had both app stacks running in parallel). It was pretty much all smooth sailing(relative to AWS), very few issues.
Despite the facility being located two thousand miles away, things worked out well, I did go on site several times mostly for new equipment installs but it was not a regular thing. I think my first on site visit after go live was close to two years later. I typically made one or two trips a year, sometimes skipping a year or two though. We did eventually expand to 4 racks in a 128 sq foot cage, and not long after added another 2 racks in another 64 sq foot cage but that cage only lasted a couple of years before we shrunk back down again.
Overall I’m absolutely convinced my setup provided better performance, availability, and flexibility during that decade plus, that AWS’ us-east-1 did during that same time frame. During that decade we had management turnover as well and there was several times where I found myself back at “I want to move to the cloud, to save money” type of situation again. It was frustrating at those times at least having to try to justify things again and again. Which I can do, the problem was the influence of this “Cult of the Cloud” really twists people’s minds up, especially vulnerable less technical minds and I have seen time and time again them just hard to accept the reality that IaaS public cloud is so expensive, which has always been the easiest way to convince people to not use it. My biggest reasons for not using it include
- Lack of control
- In the data center I controlled everything from the power plugs on up I controlled when something was restarted for updates(or not restarted). I knew basically every infrastructure change that happened, because I did 90% of them myself.
- Lack of insight
- Full visibility and control into storage, networking, and servers, from individual CPU cores to individual disks in the SAN, I could identify and trace bottlenecks in moments to resolve them. In 2014, I added LogicMonitor which opened a whole new realm of visibility which I still maintain today.
- Poor availability
- I went over a decade not having to rebuild a single system due to system failure
- Server lifetimes were measured in years, none of this “servers are cattle” crap. If one of my physical servers broke down, HP came on site to repair it. If a VM had some problem, it was easy to login to it to fix it.
- Our datacenter ISP(Internap) had a 100% uptime SLA, and I had several years of experience using them at prior companies. From an IP standpoint I described them as the “#1 Datacenter ISP, there is no second place”. They had the best NOC in the world as well staffed with very experienced and helpful people. They also had fancy unique BGP routing technology called MIRO, which I really wanted to leverage(handled transparently on their end).
- Lack of flexibility
- Being able to use real actual load balancers like Citrix Netscaler (first time user at that point, prior to that I used F5 BigIP/LTM) with good features, good instrumentation, and ability to assign dozens if not hundreds of IPs on them to route traffic for so many different things from one place, vs the absolute dog shit that is(even in 2025) AWS ELB by contrast(at one point while in the cloud we moved to a 3rd party load balancer(Zeus) with similar features to Netscaler but the cloud limited it to a single IP, and we had to rely on elastic IPs I think for fail over which were super slow, and it was far more expensive, I think I costed it out at about $60k/year for a pair of Zeus systems in AWS). Ironically I was told that at least for several years Amazon was a HUGE Citrix Netscaler customer, probably one of if not the biggest in the world.
- Being able to provision virtual servers with any number of CPUs, memory and disk space. Later able to adjust all of those things in any increment up or down with minimal(in some cases no) impact. At one point a few years later I was told that Amazon was purchasing dozens if not hundreds of 3PAR storage arrays for internal use(I used 3PAR as well in my data center). Amazon was a HUGE user of HP XP storage arrays(Hitachi OEM) as well at least in their earlier days. The XP arrays and the Hitachi versions they are OEM’d from were/are(?) widely considered to be the highest end, highest availability storage platform in the world(but Amazon got good pricing on them as you might expect).
- Being able to spin up new servers for essentially $0 in most situations(since most systems don’t consume much resources on a consistent basis), really gives flexibility towards deployment models.
- Being able to use thin provisioning on the storage system to over provision and grow stuff on demand without actually having to provision new stuff was very nice, technology that 3PAR was among the pioneers of more than a decade prior.
- Being able to move live VMs from host to host, and automatically recover VMs from host failure(I remember the first host failure we had, the VMs were automatically restarted on another host and running by the time any alert had a chance to fire off)
- Real static IPs, both internally and externally. These IPs never changed once.
- and on and on and on
For these reasons and more, operating in public cloud was a constant headache for me, never knowing when the next thing is going to fail, or the next weird hiccup, or strange bug, or whatever. There were so many annoying little issues, each one by themselves not a deal breaker, but add them all up and it’s just one frustrating day after another.
- RDS (MySQL)
- We had constant performance issues with RDS(this was before flash storage), I remember one phone call with AWS support regarding this and they commented on how many great IOPS we were getting(based on Cloudwatch data), which was a load of crap, I saved the screenshot from that day, you can see it here. Anyone with infrastructure background shouldn’t need more than 30 seconds to see the huge problems in those graphs.
- Part of our app stack leveraged I think it was in memory tables in MySQL at the time, which didn’t play properly with replication, causing headaches with RDS(at times).
- MySQL RDS replication would frequently break and in many cases the only recourse was to rebuild the the slave instance entirely(could not “repair” the broken replication, insufficient access rights available to us, I think all we needed to do was “skip” one replication event to recover but RDS didn’t allow us to do that at the time anyway, unsure about now.)
- At one point we had to stop using their RDS DNS addresses entirely and just use their IP addresses, more difficult to fail over but their DNS kept flaking out and not giving valid responses at random times, breaking shit.
- When we used ELB it was terrible as well especially in part because one critical service we leveraged(that we sent data TO, and then they sent data BACK to us immediately after) cached DNS entries, which, is a bad thing for ELB(that the service provider wouldn’t fix).
- Server failures were alarmingly regular, as were forced reboots due to what I think was underlying security fixes. AWS would say you have to reboot before XYZ or we will reboot for you.
- I think this was only BIZ2, but we were an beta customer of what was called I think “Committed IOPS EBS”, or something like that, basically EBS that gave more performance depending on the size of disk or something. Had lots of issues, don’t remember specifics.
- Random shit just breaking for brief periods of time, signs that AWS was of course always messing with stuff in the background.
- All of these pains went away with the data center. Pains that nobody else in the organization feels/cares about because it’s on the Ops people to handle them. Which is why when I go against IaaS I always target the costs first, since that is easy for everyone to understand.
Fourteen years later
Where am I at today? Well I’ll give you just a hint. I can still login to my first HPE 3PAR 7450 All Flash Array that was purchased in 2014 and HPE did a case study with me featured in it at the time. That array has been running continuously since November 4, 2014 13:36:18 PM EST with only a single component failure(one SSD failed in Jan 2023). It still runs mission critical workloads today. The longest running SSDs on the system have a “wear level” of “84% life remaining” as of November 2025. My first 10G switches are still in service, and are approaching 5,100 days of service time, being first powered on Dec 20, 2011. I do have replacements for them but haven’t had a chance to get on site in a couple of years, almost certainly next year. My valid ID card for the data center has not changed since 2011.
I spent a bit over 8 hours writing the initial drafts of this page alone, this “Cult of the Cloud” thing is going to take more time than I expected. But at least on this topic I can sleep a bit better, knowing that I got the words out of my mind.