topics
mobile
data mining
web
data visualization
distributed computing
blackberry, iphone, android
sentiment analysis, string matching
social networking, google app engine
processing
hadoop, aster data
notes hadoop summit '09
My notes on the Hadoop Summit '09 conference in Santa Clara this year. The conference was very energetic, and Hadoop's momentum is undeniable.
  • The "The Growing Hadoop Community" talk was an eye opening experience. Given the buzz on Hadoop, I had figured it was much more mature than it actually is.
    • In 2007, there were only 3 companies listed in the "powered by" page. In 2008, Facebook was added, as well as ~20 other big names.
  • ANSI compliant SQL is coming for Pig in the next month or two (!)
  • Oozie (job manager similar to Azkaban) is not impressive.
  • Sun is coming out with an AWS clone (and cloned pricing model! ... )
  • IBM is trying to be relevant with a strange project called M2 that uses Hadoop to crawl/visualize data sets.
  • Cloudera is really trying to do a lot for the Hadoop community. They were very conspicuous at the conference.
    • Cloudera + EBS AWS image is available (and apparently pretty fast compared to S3)
    • Highlighted Sqoop (SQL -> Hadoop) app
    • Big focus on binary packages for projects
    • Three presentations: 2 in Administration track, 1 in Intro. They seem to recognize that friction is coming from ops teams. Attempting to show how to integrate it, or bypass ops (cloud hosting).
  • There seems to be a war brewing between Hive and Pig. They really don't seem to get along too well.
  • Yahoo! is releasing their own Hadoop binary packages (like Cloudera).
  • No one really seemed to be able to give me a good answer as to what Cascading is and how it fits in with workflow/job managers. It's also GPL'd, which means a few people are not too happy with it.
  • EC2 + S3 for Hadoop is really a great solution.
    • EMR is interesting, but a bit limiting for our use case (only can workflow Hadoop jobs). Also, hard to integrate with our staging (non EMR) cluster.
    • People are annoyed with 1hr block pricing. Minimum payment blocks are one hour.
    • Persisting data from EC2 can be done in two ways: (1) Save to S3 or (2) Save to EBS. When new nodes are brought up, they just pull data from S3/EBS, and begin running.
    • Persisting data with Aster involves extra step of loading data into temporary node, and running ncluster_loader to insert the data into your cluster.
    • ShareThis (Paco) was very prominent. They leverage EC2 for just about everything (including running Aster in the cloud).
  • Hbase is focusing on performance now. Seeing pretty big gains. Heard "Can I issue PIG/SQL queries on Hbase?" question 3 times, and also saw it on Pig mailing list.
  • Future proofing map reduce talk was basically a Yahoo! VP explaining why it was a good thing that he broke all of our Hadoop jobs. (0.20 changes)
  • Greenplum was never mentioned at ScaleCamp or Hadoop Summit (that I heard). Aster was acknowledged by Hive/Facebook team as another solution (although Netizza was as well ... ).
LINKS
Oozie
Katta
Summit Papers/Presentations
Cascading
blog comments powered by Disqus