The debate about the validity of internal cloud implementations has raged on for some time now, with some claiming that cloud computing and wholly owned infrastructure don’t mix, and others pointing out that applying “on demand,” “at scale,” and “multitennant” to enterprise IT data centers offers unique advantages to those who have already made that investment. It has been difficult, however, to do an objective comparison of the two approaches–until now. The announcement on Thursday of Amazon’s new Hadoop-based Elastic MapReduce service , combined with the introduction of a commercial Hadoop distribution from start-up Cloudera , means that we finally have a reasonable means of watching which directions enterprise IT prefers. Let me explain. Amazon’s service is a simplified, prepackaged Hadoop implementation that can be leveraged by anyone with an Amazon account. The Amazon Web Services blog describes it as follows : Today we are rolling out Amazon Elastic MapReduce . Using Elastic MapReduce, you can create, run, monitor, and control Hadoop jobs with point-and-click ease. You don’t have to go out and buys scads of hardware. You don’t have to rack it, network it, or administer it. You don’t have to worry about running out of resources or sharing them with other members of your organization. You don’t have to monitor it, tune it, or spend time upgrading the system or application software on it. You can run world-scale jobs anytime you would like, while remaining focused on your results. Note that I said jobs (plural), not job. Subject to the number of EC2 (Elastic Compute Cloud) instances you are allowed to run, you can start up any number of MapReduce jobs in parallel. You can always request an additional allocation of EC2 instances here. Processing in Elastic MapReduce is centered around the concept of a Job Flow. Each Job Flow can contain one or more steps. Each step inhales a bunch of data from Amazon S3 , distributes it to a specified number of EC2 instances running Hadoop (spinning up the instances if necessary), does all of the work, and then writes the results back to S3. Each step must reference application-specific “mapper” and/or “reducer” code (Java JARs or scripting code for use via the Streaming model). We’ve also included the Aggregate Package with built-in support for a number of common operations such as Sum, Min, Max, Histogram, and Count. You can get a lot done before you even start to write code! Cloudera, on the other hand, provides a Hadoop build that you can deploy wherever you wish: Cloudera’s Distribution for Hadoop is based on the most recent stable version of Apache Hadoop. It includes some useful patches back-ported from future releases, as well as improvements we have developed for our support customers. Cloudera’s Distribution includes everything you need to configure and deploy Hadoop using standard Linux system administration tools. Here’s what I’m thinking: enterprise IT is looking at an entirely new class of applications that take advantage of MapReduce to process very large sets of both structured and unstructured data for things like predictive analysis, sorting/sequencing, and data mining. Both commercial Hadoop offerings meet the demand for a platform to simplify the development and operation of these applications. The primary difference is the where, not so much the what. That is exactly what will make the competition between the two offerings so compelling to watch. Let me break it down for you: Will the requirement to own and operate hardware work against Cloudera? What makes the Amazon offering so groundbreaking (and it will prove to be historic,



Excellent detailed post. To your point about enterprise data centers, I think the “tipping point” in bringing the value of MapReduce to traditional companies is by providing enterprise-class database features such as security, tight SQL integration, and more WITH MapReduce.
Aster Data offers the nCluster database on the Amazon cloud (as well as on-premise) that combines the functionality of a database to manage the data along with the power of SQL and MapReduce to query it. Customers can store, manage, query and mine data in one platform realizing the full benefits of Amazon’s Cloud, MapReduce, and standard SQL in a tightly coupled way