Data Engineer at Lookout Mobile Security

@yashranadive

Read this first

How to quickly find the depth of deepest file in a directory tree

If for whatever reason you want to find how deep a directory tree goes in your software projects, simply run this nifty little one liner

find . | grep -v "\.git" | awk '{print gsub(/\//,"")}' - | sort -r  | head -1

This will first run a find on the current directory, filter git files, find the number of “/” per line, reverse sort them and finally show the highest depth.

View →

Aug 11, 2015

msck repair table for custom partition names

msck repair table is used to add partitions that exist in HDFS but not in the hive metastore.

However, it expects the partitioned field name to be included in the folder structure:
year=2015
|
|_month=3
|
|_day=5

Notice the partition name prefixed with the partition. This is necessary. msck repair table wont work if you have data in the following directory structure:
2015
|
|_3
|
|_5

This is kind of a pain. The only solution is to use alter table add partition with location.

ALTER TABLE test ADD PARTITION (year=2015,month=03,day=05) location ‘hdfs:///cool/folder/with/data’;

View →

Jun 24, 2015

Hive and Hadoop Command Snippet search

Why?

I’ve found myself looking up the “exact” syntax for DML / DDL in Hive countless times. Also, I tend to forget the list of date functions and parameters. I would use a combination of Google Search and/or a cheat sheet for these. These don’t work for me very well for multiple reasons (I’ll cover those is a separate post if enough people are interested). I wanted a no-frills snippet search tool but couldn’t find a good one for hadoop and hive. So I built my own.

Try it Out

If you’re interested you can access the tool at www.greppage.com. The UI is very basic and I’ll appreciate your feedback.

View →

Feb 24, 2015

First Experiences with Scalding

Recently, I’ve been evaluating using Scalding to replace some parts of our ETL. Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. You specify your jobs in the clean and expressive Scala syntax and out spits MapReduce which runs on your Hadoop Cluster.

Options for Cluster Processing

There are several options to run a custom processing task on a hadoop cluster without actually writing Java Mapreduce code. The major ones are Pig, HIve, Scalding and Spark(I’m sure I’m missing some that you may think are significant). All the options except for Spark work by allowing you to write in an easy to use/expressive DSL which later gets compiled to Java Map Reduce. Spark has its own engine to run workloads over the cluster and is gaining massive popularity. However...

Continue reading →

Feb 11, 2015

Hive doesn’t like the carriage return character

Have you ever run in to a situation where you count the number of rows for a table in a database, then dump it to CSV and then load it to HIVE only to find that number has changed? Well, you probably have carriage returns in your fields. HIVE reads a carriage return similar to a new line which means end of row. Here’s a link I found that describes it:

http://grokbase.com/t/hive/user/111v7jva3f/newlines-in-data

You have to manually clean the \r from the file. One option is to use the unix command transliterate:

cat yourfile | tr -d "\r" > newfile

View →

Feb 11, 2015

Few Thoughts about Learning

It is funny how we have so much information available to us but nobody teaches us how to learn. In college, I struggled with processing vast amounts of information. I would read an article/paper/concept and comprehend only some part of it. I’d later feel guilty for not knowing the rest. Looking back my biggest mistake was learning Java by reading a book. I remember being haplessly confused and dumbfounded as I read books that contained lines and lines of programs. As time went by I became more open to the idea if partially understanding a next without worrying too much about wholly understanding it. With large amounts the challenge is that you hit a sentence containing concepts or words of which you have no prior understanding.

Students should be taught in the early years of their life on how to read complex texts without having to worry about comprehending everything. This will...

Continue reading →

Jan 28, 2015

Removing Database Level Locks in HIVE

Recently we started noticing “CREATE TABLE” taking incredibly long amounts of time to execute and after which they’d fail. A more detailed look in to the issue revealed that we had upgraded HIVE and the new version, which now allowed ACID, would lock the database by default even if ACID support was turned off. So basically, during a SELECT or INSERT was running in HIVE, HIVE created a zookeeper SHARED lock on the entire database that contained those tables.

I did some digging through the code and found this:
https://github.com/apache/hive/blob/68bc618bf0b1fd3839c3c52c2103b58719b3cb81/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DummyTxnManager.javaL166 Notice the lock on the whole database.

To solve this problem, this link recommends to turn off locking altogether:
http://mail-archives.apache.org/mod_mbox/hive-user/201408.mbox/%3C0eba01cfc035$3501e4f0$9f05aed0$@com%3E That was not...

Continue reading →

Jan 23, 2015

Creating Presentations with Reveal.js

Late last year, I gave a talk at the Sift Science office in San Francisco on “Hadoop at Lookout - how Lookout uses the hadoop infrastructure to power internal analytics”. I used Reveal.js to present the talk in my browser! Reveal.js is a HTML Presentation Framework that uses Javascript and plain HTML to create beautiful slides. I tried the free version on Github where you create your slides by writing html or markdown. There’s also an online editor which I have not tried.

HTML Slides, What

You may ask, “Well, do we really need HTML slides when we have mature tools like Powerpoint or Keynote?” Having a presentation in the HTML format allows you to quickly upload it and share it with the world by simply hosting it. Desktop solutions like Powerpoint or Keynote require you to upload your file to an online folder or upload to presentation viewers like SlideShare. Additionally, Reveal.js...

Continue reading →

Dec 21, 2014

Setting up Camus - LinkedIn’s Kafka to HDFS pipeline

Few days ago I started tinkering with Camus to evaluate its use for dumping raw data from Kafka=>HDFS. This blog post will cover my experience and first impressions with setting up a Camus pipeline. Overall I found Camus was easy to build and deploy.

What is Camus

Camus is LinkedIn’s open source project that can dump raw/processed data from Kafka to HDFS. It does this by a map-reduce job which when kicked off can -

Manage its own offsets
Date partition the data in to folders in HDFS

The github readme has a details on how this is achieved.

Building Camus

To build Camus:

Clone the Git Repo from https://github.com/linkedin/camus
You may want to change the version of hadoop-client library in camus/pom.xml to match your hadoop version
Build using mvn clean package, if the tests fail use -DskipTests

How Camus does Date Partitioning

Camus achieves date partitioning by...

Continue reading →

Dec 13, 2014

Visualizing Metrics in Storm using StatsD & Graphite

Storm Metrics API

Jason Trost from Endgame has written a nice post on how to setup Storm to publish metrics using the Metrics API. Endgame has also open sourced a module storm-metrics-statsd for Storm that allows you to send messages to StatsD.

Build

If you use maven, you can use the following snippets in your topology pom.xml to the load storm-metrics-statsd jar. Alternately, you can clone the github project and build it yourself.

    <repository>
      <id>central-bintray</id>
      <url>http://dl.bintray.com/lookout/systems</url>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
    </repository>

...
    <dependency>
      <groupId>com.timgroup</groupId>
      <artifactId>java-statsd-client</artifactId>
      <version>2.0.0</version>
    </dependency>
    <dependency>
      <groupId>com.lookout</groupId>
      <artifactId>storm-metrics-statsd</artifactId>

...

Continue reading →