notes hadoop summit '09
My notes on the Hadoop Summit '09 conference in Santa Clara this year. The conference was very energetic, and Hadoop's momentum is undeniable.
- The "The Growing Hadoop Community" talk was an eye opening experience. Given the buzz on Hadoop, I had figured it was much more mature than it actually is.
- In 2007, there were only 3 companies listed in the "powered by" page. In 2008, Facebook was added, as well as ~20 other big names.
- ANSI compliant SQL is coming for Pig in the next month or two (!)
- Oozie (job manager similar to Azkaban) is not impressive.
- Sun is coming out with an AWS clone (and cloned pricing model! ... )
- IBM is trying to be relevant with a strange project called M2 that uses Hadoop to crawl/visualize data sets.
- Cloudera is really trying to do a lot for the Hadoop community. They were very conspicuous at the conference.
- Cloudera + EBS AWS image is available (and apparently pretty fast compared to S3)
- Highlighted Sqoop (SQL -> Hadoop) app
- Big focus on binary packages for projects
- Three presentations: 2 in Administration track, 1 in Intro. They seem to recognize that friction is coming from ops teams. Attempting to show how to integrate it, or bypass ops (cloud hosting).
- There seems to be a war brewing between Hive and Pig. They really don't seem to get along too well.
- Yahoo! is releasing their own Hadoop binary packages (like Cloudera).
- No one really seemed to be able to give me a good answer as to what Cascading is and how it fits in with workflow/job managers. It's also GPL'd, which means a few people are not too happy with it.
- EC2 + S3 for Hadoop is really a great solution.
- EMR is interesting, but a bit limiting for our use case (only can workflow Hadoop jobs). Also, hard to integrate with our staging (non EMR) cluster.
- People are annoyed with 1hr block pricing. Minimum payment blocks are one hour.
- Persisting data from EC2 can be done in two ways: (1) Save to S3 or (2) Save to EBS. When new nodes are brought up, they just pull data from S3/EBS, and begin running.
- Persisting data with Aster involves extra step of loading data into temporary node, and running ncluster_loader to insert the data into your cluster.
- ShareThis (Paco) was very prominent. They leverage EC2 for just about everything (including running Aster in the cloud).
- Hbase is focusing on performance now. Seeing pretty big gains. Heard "Can I issue PIG/SQL queries on Hbase?" question 3 times, and also saw it on Pig mailing list.
- Future proofing map reduce talk was basically a Yahoo! VP explaining why it was a good thing that he broke all of our Hadoop jobs. (0.20 changes)
- Greenplum was never mentioned at ScaleCamp or Hadoop Summit (that I heard). Aster was acknowledged by Hive/Facebook team as another solution (although Netizza was as well ... ).
Oozie
Katta
Summit Papers/Presentations
Cascading blog comments powered by Disqus
