Enter the cloud – but how deep?

KenyaTanzania 665 The new company is officially founded and has got a name: DataMarket. You’ll have to stay tuned to hear about what we actually do, but the name provides a hint 😉

In the last couple of weeks I’ve spent quite some time, thinking about the big picture in terms of DataMarket’s technical setup, and it’s led me to investigate an old favorite subject, namely that of cloud computing.

Cloud computing perfectly fits DataMarket’s strategy in focusing on the technical and business vision in-house while outsourcing everything else. In other words, focus on the things that will give us competitive advantage, but leave everything else to those that best know how. Setting up and managing hardware and operating systems is surely not what we’ll be best at, so that task is best left to someone else.

This could be accomplished simply by using a typical hosting or collocation service, but cloud computing gives us one important extra benefit. It allows us to scale to meet the wild success we’re expecting, yet be economic if things grow in a somewhat milder manner.

That said, cloud computing will no doubt play a big role in DataMarket’s setup. But there are different flavors of “the cloud”. The next question is therefore – which flavor is right for this purpose?

I’ve been fortunate enough this year to hang out with people that are involved in two of the major cloud computing efforts out there: Force.com and Amazon Web Services (AWS). Additionally I’ve investigated Google AppEngine to some extent, and I saw a great presentation of Sun’s Network.com at Web 2.0 Expo.

These efforts can roughly be put in two categories:

  1. Infrastructure as a Service (IaaS): AWS and Network.com
  2. Platform as a Service (PaaS): Google AppEngine and Force.com

IaaS puts at your disposal the sheer power of thousands of servers that you can deploy, configure and set to crunch on whatever you throw at them. Granted, the two efforts mentioned above are quite different beasts and not really interchangeable for all tasks. Network.com is geared towards big calculation efforts, such as rendering of 3D movies or simulating nuclear chain reactions, whereas AWS is suitable for pretty much anything, but surely geared towards running web applications or web services of some sort.

PaaS gives you a more restricted set of tools to work with. They’ve selected the technologies you’re able to run and given you libraries to access underlying infrastructure assets. In the case of Force.com, your programming language is “Almost Java”, i.e. Java with proprietary limitations to what you’re able to do, but you also get a powerful API that allows you to make use of data and services available on Salesforce.com. AppEngine uses Python and allows you to run almost any Python library. In a similar fashion to Force.com, AppEngine gives you API access to many of Google’s great technological assets, such as the Datastore and the User service.

In short, the PaaS approach gives you less control over the technologies and details of how your application works, while it gives you things like scalability and data redundancy “for free”. The IaaS approach gives you a lot more control, but you have to think about lower levels of technical implementation than on PaaS. Example: On AWS you make an API call to fire up a new server instance and tell it what to do – on AppEngine, you don’t even know how many server instances you’re running – just that the platform will scale to make sure there is enough to meet your requirements.

So, which flavor is the right one?

The thinking behind PaaS is the right one. It takes care of more of the technical details and leaves the developers to do what they’re best at – develop the software. However, there is a big catch. If you decide to go with one of these platforms, you’re pretty much stuck there forever. There is no (remotely simple) way to take your application off Force.com or AppEngine and start running it on another cloud or hosting service. You might find this ok, but what if the company you’re betting your future on becomes evil? or changes their business strategy? or doesn’t live up to your expectations? or want’s to acquire you? or – worse yet – doesn’t want someone else to acquire you?

You don’t really have an alternative – and you’re betting your life’s work on this thing.

Sure. If I was writing a SalesForce application or something with a deep SalesForce integration, I’d go with Force.com. Likewise, if I was writing a pure web-only application, let alone if it had some connections with – say – Google Docs or Google Search, I’d be very tempted to use AppEngine. But neither is the case.

So the plan for DataMarket is to write an application that is ready to go on AWS (or another cloud computing service offering similar flexibility), but try as much as possible to keep independent of their setup. Not that I expect that I’ll have any reason to leave, but there is always the black swan, and when it appears you better have a plan. This line of thinking even makes me skeptical of utilizing their – otherwise promising – SimpleDB, unless someone comes up with an abstraction layer that will allow SimpleDB queries to be run against a traditional RDBMS or other data storage solutions that can be set up outside the Amazon cloud.

Yesterday, I raised this issue to Simone Brunozzi, AWS’ Evangelist for Europe. His – very political – answer was that this was a much requested feature and they “tended to implement those”, without giving any details or promises. I’ll be keeping an eye out for that one…

So to sum up. When I’ve been preaching the merits of Software as a Service (think SalesForce or Google Docs) in the past, people have raised similar questions about trusting a 3rd party for their critical data or business services. To sooth them, I’ve used banks as an analogy: A 150 years ago everybody knew that the best place to keep their money was under their mattress – or even better – in their personal safe. Gradually we learned that actually the money was safer in institutions that specialized in storing money, namely banks. And what was more – they paid you interest. The same is happening now with data. Until recently, everybody knew that the best place for their data was on their own equipment in the cleaning closet next to the CEO’s office. But now we have SaaS companies that specialize in storing our data, and they even pay interest in the form of valuable services that they provide on top of it.

So, your money is better of in the bank than under your mattress and your data is better of at the SaaS than on your own equipment. Importantly however, at the bank you have to trust that your money is always available for withdrawal at your convenience and in the same way, your data must be “available for withdrawal” at the SaaS provider – meaning that you can download it and use it in a meaningful way outside the service it was created on. That’s why data portability is such an important issue.

So the verdict on the most suitable cloud service for DataMarket: My application is best of at AWS as long as it’s written in a way that I can “withdraw” it and set it up elsewhere. Using AppEngine or Force.com would be more like depositing my money to a bank that promptly changes all your money to a currency that nobody else accepts.

I doubt that such a bank would do good as a business!

10 comments

  1. Excellent post, looking forward to see where you’re heading with DataMarket 🙂

    You mentioned SimpleDB and abstraction layers, it so happens to be that someone implemented JPA for SimpleDB, I think it was called SimpleJPA — which implemented a subset of the JPA query standard — something that might make you feel more platform neutral 🙂 Don’t know how far along/stable/reliable/… it is though.

  2. (Disclaimer: I work for Google one one of the teams running the “infrastructure” underlying, amongst other things, Appengine)

    Regarding portability, I’m pretty sure people have gotten the Appengine development environment to run on 3rd party servers, essentially allowing you to host your Appengine apps yourself.

    In fact, here’s an article about running Appengine apps on Amazon’s EC3: http://waxy.org/2008/04/exclusive_google_app_engine_ported_to_amazons_ec2/

    It’s hack, but it arguably makes Appengine the only platform you can develop for which already has two live clouds that can run your apps. 😉

    I don’t know if Google plans to open source the SDK or not, but even if they don’t odds are an open-source implementation of the stack will materialize sooner than later.

    So… I suspect you may be over-estimating the degree of actual lock-in. Appengine is cool. 🙂

  3. Heh, I should have read the whole article I linked before posting. It reports that the Appengine SDK is already open source and that Google is addressing your data lock-in concerns. So yeah. I like what my employer is doing in this space! 🙂

  4. I’ve been looking into cloud computing for a customer of mine, specifically Google AppEngine and like what I see so far.

    I have to admit that I haven’t gone deep into other clouds so far, but the scalability and ease of use of the Google AppEngine impresses me alot.

    As Bjarni Rúnar points out the the data lock-in concerns are being addressed (in a way) and it is possible to run the environment (using a hack) in another cloud, making it very flexible as a PaaS.

    Anyway, I’m excited to hear more about Datamarket 🙂

  5. Thanks for the pointers guys.

    Don’t get me wrong, AppEngine is cool and considering the new info you point to, the actual lock-in may not be as bad as I thought. The AppEngine-on-EC2 is however still a hack, and as such, its future compatibility is clearly not a given.

    In my particular case I’ve found libraries for Java that are likely to save DataMarket years of work and not been able to find similar things for Python. So, the freedom to select technologies that comes with IaaS in my case trumps the benefits of available PaaS offerings (mainly “effortless” scalability).

    Nevertheless: for many web applications, especially those that can benefit from hooking up with Google services or using the assets Google makes available via the APIs, AppEngine is a great platform and one to seriously consider.

  6. Choosing the correct platform is always a matter of weighing out the pro’s and con’s. In my case the Django framework and other available Python libraries do the trick 🙂

    Betting on Java is in my mind a pretty sure bet as the language is mostly platform independent one that is likely to be supported by more clouds in the future, easing the fears lock-in and other platform provider issues. I think you’re on the right track with Datamarket.

  7. Like so many times before you’re spot-on, Hjalli. At least I hope you are because I agree with you 🙂 – in particular about Google and AWS. I’m somewhat more skeptical about force.com since they seem to be on a semi-proprietary island.
    Congrats about the Newco and the ultra-focused attitude and leaving the rest to the best (maybe I should get that registered if it’s not taken). We already miss you at the Mothership.

  8. Sveinn: I think Force.com makes sense if what you’re trying to do is closely connected with things in SalesForce, but not as a general platform.

    Focusing on core competitive advantage and “leaving the rest to the best” is a great way to explain DataMarket’s team/partner strategy. Even better than anything I’d come up with myself.

  9. IMHO, there is one more interesting way to portability: to create a high-level DSL and express the needed business logic, etc. using that DSL. In this case transfer to another platform will require only reworking appropriate language implementation which often forms relatively thin layer under the rest of the project functionality. With proper DSL, it is even possible to switch from one platform-specific language to another.

    In addition to portability, using DSLs have other attractive features, such as increase in productivity or ability of automated code analysis for parallelism, etc.

    Here is an example of using DSL approach to some real-world problems:

    http://research.microsoft.com/~simonpj/Papers/financial-contracts/contracts-icfp.ps.gz

    DSLs are also have their risks, of course. The main of them is that DSL designer should choose the correct set of primitives to aviod “abstraction leakage” as long as possible.

Leave a reply to Hjalmar Gislason Cancel reply