Digital Daaroo - by Saurabh Nanda: July 2009

Thursday, July 30, 2009

How to make your Hive cluster blazingly fast

Squeeze it! Literally.

Always, always, always - compress your tables. Most of your Hive query execution time is not going into doing complex computations. Aggregates, averages, and counts across tons of weblog data are not exactly computationally complex. It's the reading & writing of tons of data that's taking time.

In computer science parlance, it's not CPU bound, it's I/O bound.

If your data is already compressed (which in all probability, is the case), then nothing can beat the speed at which you can import it into Hive. Simply use the LOAD DATA LOCAL command to load compressed files into a TextFile table. Hive will automatically detect the compression and decompress on-the-fly when executing queries.

Leaving aside the data import times, here are some (unscientific) benchmarks for a SELECT COUNT(1) query on my raw data tables. From 106 sec down to 60 sec. At 10% of the storage space.

Storage	Row count	Table size	Query time
Uncompressed	8,259,720	7,686 MB	106 sec
Compressed – GzipCodec (RECORD)	8,259,720	4,773 MB	101 sec
Compressed – native gzip	8,259,720	736 MB	60 sec

Now, each of these tables was put through a moderately complex Hive query [1], which processed the data using two custom map/reduce scripts, and inserted the procssed data into a new table. Sure, the initial ETL phase is not significantly faster, but look at the amount of space I'm saving.

Storage	Resultant row count	Resultant table size	Query time
Uncompressed	1,561,633	1,608 MB	699 sec
Compressed – GzipCodec (RECORD)	1,561,633	N/A [2]	563 sec
Compressed – native gzip	1,561,633	86 MB	510 sec

I'll try to run some complex queries on the table with processed data to compared between compressed and uncompressed storage.

Learn how to compress your Hive tables at Hive Wiki – Compressed Storage

[1] Take a look at the section titled "Processing the raw data (ETL)" here
[2] I forgot to record the table size before dropping the table :-)

Friday, July 24, 2009

Using Hive for weblog analysis

Introduction

I've been playing around with Hadoop since the last fortnight to see how it performs with our weblog data processing jobs (Apache access logs). Right now we're using a blink-and-it-breaks system running a bunch of custom Perl scripts for log processing and MySQL for data storage. It's running on a single server class machine. It takes it about 3-4 hours to extract, transform, and load (into MySQL) one day of weblogs (this includes identifying sessions based on a 30min timeout). Another 1-2 hours to aggregate data and generate various CSV reports.

So, a total of 6-7 hours, when it's not crashing.

With Hadoop + Hive I've brought the first phase down from 3-4 hours to under an hour! Anywhere between 15 minutes to 45 minutes to copy log files into a raw Hive table (depending upon how much replication you've configured your cluster to have). After that, 12 minutes to extract, transform, and load into a new table (including session identification).

I gave Cloudbase a shot before Hive. Cloudbase was *dead* easy to setup and get off the ground (I love their simple approach to UDFs & UDTs). However, it's performance was not as good as Hive. On the other hand, I spent quite a lot of time trying to even import my custom log formats into Hive (I'm looking at you GenericUDF). But once I got started, man, was I blown away! Take a look at my Twitter updates on Cloudbase and Hive comparisons.

Cluster setup: I managed to corner 4 desktop machines in the office for my experimental cluster. They are Core-2 Duo 2.4 GHz machines with 1GB of RAM and standard SATA hard disks. All four are connected to each other using a switch.

Setting up the Hive tables

Here's how I've set up my Hive tables:

The following table (raw) stores the raw, unprocessed log files. I've partitioned the table by date so that the ETL operations can be run on bite sized chunks consisting of a day's worth of data.

create table raw(line string) partitioned by(dt string)
row format delimited fields terminated by '\t' lines terminated by '\n';

The following table (hits) contains the processed & filtered hits, partitioned by date. Within each partition the rows are clustered by Apache user-id (aid) and within each cluster they are again sorted by page view times (ts). According to Hive Wiki this clustering & sorting can improve efficiency in certain queries. Although it does lead to an increase during the ETL phase. I'm really not sure what the buckets do - and 1,000 is just a number I pulled out of thin air.

create table streamed_hits(ip_address string, aid string, uid string,
                          ts string, method string, uri string,
                          response string, session_id string,
                          session_start string, pv_number int,
                          clickstream string)
partitioned by(dt string)
clustered by (aid) sorted by (aid, ts) into 1000 buckets
row format delimited fields terminated by '\t' lines terminated by '\n';

Importing the raw data

Here's how one import data into the raw table. The import process is a simple HDFS file copy operation[1]. The more replication you have, the larger number of nodes the data needs to be copied to. Hence, the more time it takes during this operation.

load data local inpath '/weblogs/20090602-access.log'
into table raw partition(dt='2009-06-02');

load data local inpath '/weblogs/20090603-access.log'
into table raw partition(dt='2009-06-03');

Processing the raw data (ETL)

Here's the Hive query for the ETL operation. I'm making sure that uninteresting rows are discarded upfront using the WHERE clause in the inner query. I had earlier put this in the outer query (which was causing LOTS of uninteresting rows to go through the parsing logic). The (apparently, in hindsight) stupid placement of the WHERE clause was bumping up the completion time of this query from 15mins to 2+ hours!

I'm using a custom map script to read my log files, which are in a 'non-standard' format. There is another way to do this - UDFs & SerDe - but according to the discussion I had on the mailing list, these features are not fit for newbie consumption yet. The DISTRUBUTE BY & SORT BY is essential for my session & clickstream identification script to work (thanks to the Hive Wiki).

from
   (from raw
   select transform line using 'parse_logs.pl' as     ip_address, aid, uid, ts, method, uri,
                                                   response, referer, user_agent, cookies, ptime
   where lower(line) rlike '^(\\S+) (\\S+) (\\S+) \\[(.*?)\\] "(.*?)" (\\d+) (\\d+|-) "(.*?)" ".*?(mozilla|msie|opera).*?".*'
           and not line rlike '^(\\S+) (\\S+) (\\S+) \\[(.*?)\\] "(\\S+) (/images.*?|/styles.*?|/javascripts.*?|/adserver.*?|.*?favicon.*?) (\\S+)".*'
           and dt='2009-06-30'
   distribute by aid sort by aid, ts asc) parsed
insert overwrite table streamed_hits partition(dt='2009-06-30')
   select transform parsed.ip_address, parsed.aid, parsed.uid, parsed.ts, parsed.method, parsed.uri,
           parsed.response, parsed.referer, parsed.user_agent, parsed.cookies, parsed.ptime
   using 'identify_sessions_and_clickstream.pl' as ip_address, aid, uid, ts, method, uri,
                                   response, session_id, session_start, pv_number, clickstream;

Custom map/reduce scripts used

Here's the parse_logs.pl Perl script. Please excuse my Perl - I'm not a Perl programmer. I was just trying to reuse the funky regular expressions that we had lying around in our current log processing system. You use plug in any script, that reads/writes from standard input/output, into Hive.


#!/usr/bin/perl

my %monthNum=(
   "Jan" => 1,
   "Feb" => 2,
   "Mar" => 3,
   "Apr" => 4,
   "May" => 5,
   "Jun" => 6,
   "Jul" => 7,
   "Aug" => 8,
   "Sep" => 9,
   "Oct" => 10,
   "Nov" => 11,
   "Dec" => 12,
);

while (defined($line = )) {
   if (($host,$user,$apache,$rfc931,$method, $url, $ver, $status,$size,$referrer,$agent,$cookies,$ptime) = $line =~ m/^(\S+) (\S+) (\S+) \[(\S+ \S+)\] "(\S+) (.*?) HTTP\/([0-9\.]*)" (\d+) (\d+|-) "([^"]*)" "([^"]*)" "([^"]*)" (\d+|-)$/) {
       # everything found -- nothing to be done
   }elsif (($host,$user,$apache,$rfc931, $method, $url, $ver, $status,$size,$referrer,$agent,$cookies) = $line =~ m/^(\S+) (\S+) (\S+) \[(\S+ \S+)\] "(\S+) (.*?) HTTP\/([0-9\.]*)" (\d+) (\d+|-) "([^"]*)" "([^"]*)" "([^"]*)"$/) {
       $ptime="";
   }

   $ts="";
   # Converting date to yyyy-MM-dd hh:mm:ss
   if (($day, $monthname, $year, $hour, $minute, $sec)= $rfc931 =~/^(\d{2})[\/-]([^\/-]+)[\/-](\d{4}):(\d{2}):(\d{2}):(\d{2})/) {
       $month=$monthNum{$monthname};
       $ts=sprintf("%04d-%02d-%02d %02d:%02d:%02d", $year, $month, $day, $hour, $minute, $sec);
   }


   $agent=lc($agent);
   $user=lc($user);
   print "$host\t$apache\t$user\t$ts\t$method\t$url\t$status\t$referrer\t$agent\t$cookies\t$ptime\n"
}

Here's the identify_sessions_and_clickstream.pl Perl script. Again, same disclaimer about my Perl as above. Also, please excuse the extremely naive way in which I'm trying to construct the clickstream. That work in progress.

This script depends on the fact that the incoming data is sorted on Apache user-id and timestamp (that's what the DISTRIBUTE BY & SORT BY in the Hive query achieve). Apart from identifying 60min sessions I'm cleaning up the URLs to make them more 'generic' (removing query strings, removing trip IDs etc.)

#!/usr/bin/perl

use Date::Parse;

$session_duration=60*60; # in seconds
$prev_apache=undef;
$prev_ts=undef;
$pv_number=undef;
$session_id=undef;
$session_num=undef;
$session_start=undef;
$clickstream='';

while (defined($line = )) {
   chomp($line);
   ($ip_address, $apache, $user, $ts, $method, $url, $status, $referrer, $agent, $cookies, $ptime) = split(/\t/, $line);
$url, $status, $referrer, $agent, $cookies, $ptime\n";
   $url =~ s/(.*?)\?.*/$1/i;
   $url =~ s/^\/(trains|flights)\/itinerary\/.*?\/(.*?)/\/$1\/itinerary\/itinerary-id\/$2/;
   $url =~ s/^\/(trains|flights)\/itinerary\/(\d+)$/\/$1\/itinerary\/itinerary-id/;
   $url =~ s/^\/(activate|reactivate|reset)\/.*/\/$1/;
   $url =~ s/^\/share.*/\/share/;
   $url =~ s/^\/account\/trips\/.*/\/account\/trips\/trip-id/;
   $url =~ s/^\/trains\/stations\/[0-9]*/\/trains\/stations\/numeric-id/;
   $url =~ s/^\/trains\/stations\/[A-Za-z]*/\/trains\/stations\/alphanumeric-id/;
   $url =~ s/^\/hotels\/info\/.*/\/hotels\/info\/hotel-id/;
   $url =~ s/^\/places\/hotels\/.*\/images.*/\/places\/hotels\/images/;
   $url =~ s/^\/places\/hotels\/images.*/\/places\/hotels\/images/;
   $url =~ s/^\/newsletters\/images\/.*/\/newsletters\/images/;
   $url =~ s/^\/index.*/\//;
   $url =~ s/\/$//; # remove trailing slashes

   if(!defined($prev_apache) || $prev_apache ne $apache){
       if($apache eq '-' or $apache eq '')  {
           # TODO -- use IP address & user_agent to identify sessions?
           $pv_number='';
           $session_start='';
           $session_id='';
           $clickstream='';
       } else {
           $pv_number=1;
           $session_start=$ts;
           $session_num=1;
           $session_id="$apache|$session_start";
           $clickstream="$url"
       }
   } elsif($prev_apache eq $apache) {
       if((str2time($ts)-str2time($prev_ts))<=$session_duration) {             $pv_number=$pv_number+1;             if($pv_number<70) clickstream="$clickstream|$url" pv_number="1;" session_num="$session_num+1;" session_id="$apache|$session_start" clickstream="$url" prev_apache="$apache;" prev_ts="$ts;">

[1] A small warning here, if your Hive session is running on one of your active nodes. Beware that because of HDFS default policy your Hive node's disk is going to be filled up first. This might lead to a disk space problem if your HDFS partition is being used by other processes as well. I ran into this and had to run the HDFS balancer, which, btw, is extremely slow! (even after increasing the dfs.balance.bandwidthPerSec property).

Tuesday, July 07, 2009

Capitalism and the environment

Sadly, this is not a joke - it's the truth. All large-scale projects see "environmental concerns" just as a nuisance they have to put up with it. Something that will eventually have to yield, because, well, you just can't stop "development", can you?

Why aren't existing trees part of "beautification drives"? Why does beautification always mean pouring massive amounts of concrete on the ground and building railings and felling of trees?

On that note, I would recommend 'Small is Beautiful: Economic as if People Mattered' by E.F. Schumacher as compulsory reading to everyone studying the pseudo-science of economics. And to everyone who thinks growth is good - perpetually.

Monday, July 06, 2009

Directi Ads on Facebook

Nice :-) This must be resonating with the uber-geeks really well. Check out the careers page as well -- "Open Techie Positions"

Thursday, July 02, 2009

My comments to Ministry of IT about Section 69A of IT Act

Emailed to grai (at) mit (dot) gov (dot) in as given at http://www.mit.gov.in/default.aspx?id=969

Dear Sir,

With reference to the Draft Rules under section 69A of the Information Technology (Amendment) Act, 2008, available at http://www.mit.gov.in/download/sec69A_Rules.pdf , I would like to bring to your notice the following comments --

(i) Lack of neutral party: There is no provision in the Draft Rules where the request for blocking information is examined by a neutral third-party (say, a court designated for this purpose) before being enforced. This leaves the law open to abuse by persons "in power" who want to curb any criticism (published on the Internet) aimed towards them. In other words this creates a situation where the plaintiff, himself is the judge.

(ii) Lack of representation: There is absolutely no provision in the Act which gives the publisher (against whom the request for blocking has been initiated) a right to a fair hearing in front of a neutral third-party. I realize that on the Internet, which lends itself to anonymity, identifying & notifying the publisher of a particular information resource is difficult. However, there needs to be a process in place using which the Designated Officer first needs to notify the publisher against whom a Request for blocking is pending. This process may include: (a) emailing the contact person mentioned on the website/resource, (b) notifying the ISP on which the website/resource is hosted, (c) general notice on the Minister of Information Technology's website, etc. This is analogous to various "Notices" that any government agency serves before taking drastic action against a person -- like sealing of property, or disconnection of electricity/water supply.

I have worked in the IT industry for 5 years and I understand the need for tighter regulation and stronger cyber-laws. However, these need to be balanced against (a) right to freedom of speech & expression, and (b) right to a fair trial. In my opinion, the Draft Rules completely oversteps these constitutional rights in a bid to provide the governing agencies with more power.

Regards,
Saurabh Nanda.

A SIM-card along-with the National ID card?

Taking on from Sanjay Swamy's post at Medianama, I think it's an excellent idea. Sure, there are the rough edges that one needs to smoothen out, but that's the case with any scheme of this magnitude. The kind of opportunities that open if your national ID can be part of your mobile SIM card are amazing. Especially in the mobile banking and m-commerce area.

Picture this: Your bank and telecom provider, both as part of their KYC, will need to know your national ID. Now, if you have the ability to link-up your SIM with your national ID (both smart cards), you can perform a fully authenticated financial transaction through your mobile phone. Just think of the number of services that can be made available to the rural population through such a thing!

The biggest challenge this idea will face is from people getting paranoid about
(a)a telecom operator "owning their identity", and
(b) not everyone having a mobile phone.

The National ID should be a complete in itself even without a SIM or a mobile phone. Something like the new vehicle registration smart cards you get in Maharashtra nowadays. However, it should have a detachable SIM card built right into it. If you want to use the additional features of your National ID, just remove the SIM and plug it in. You can keep your current phone number and even switch operators -- thanks to Number Portability.

The telecom operator (or anyone else on the mobile network) will NOT have access to your identity unless you authorize them during a transaction which requires such an access. It's just that if the SIM and your National ID live on the same hardware (which is, the smart card) you can link them up with ease, as and when required!

Digital Daaroo - by Saurabh Nanda