tag:etl.svbtle.com,2014:/feedYash Ranadive2015-12-02T17:17:45-08:00Yash Ranadivehttps://etl.svbtle.comSvbtle.comtag:etl.svbtle.com,2014:Post/how-to-quickly-find-the-depth-of-deepest-file-in-a-directory-tree2015-12-02T17:17:45-08:002015-12-02T17:17:45-08:00How to quickly find the depth of deepest file in a directory tree<p>If for whatever reason you want to find how deep a directory tree goes in your software projects, simply run this nifty little one liner</p>
<pre><code class="prettyprint lang-unix">find . | grep -v "\.git" | awk '{print gsub(/\//,"")}' - | sort -r | head -1
</code></pre>
<p>This will first run a find on the current directory, filter git files, find the number of “/” per line, reverse sort them and finally show the highest depth.</p>
tag:etl.svbtle.com,2014:Post/msck-repair-table2015-08-11T13:54:15-07:002015-08-11T13:54:15-07:00msck repair table for custom partition names<p>msck repair table is used to add partitions that exist in HDFS but not in the hive metastore. </p>
<p>However, it expects the partitioned field name to be included in the folder structure:<br>
year=2015<br>
|<br>
|_month=3<br>
|<br>
|_day=5</p>
<p>Notice the partition name prefixed with the partition. This is necessary. msck repair table wont work if you have data in the following directory structure:<br>
2015<br>
|<br>
|_3<br>
|<br>
|_5</p>
<p>This is kind of a pain. The only solution is to use alter table add partition with location.</p>
<p>ALTER TABLE test ADD PARTITION (year=2015,month=03,day=05) location ‘hdfs:///cool/folder/with/data’;</p>
tag:etl.svbtle.com,2014:Post/fast-hive-and-hadoop-snippet-search2015-06-24T08:35:27-07:002015-06-24T08:35:27-07:00Hive and Hadoop Command Snippet search<h2 id="why_2">Why? <a class="head_anchor" href="#why_2">#</a>
</h2>
<p>I’ve found myself looking up the “exact” syntax for DML / DDL in Hive countless times. Also, I tend to forget the list of date functions and parameters. I would use a combination of Google Search and/or a cheat sheet for these. These don’t work for me very well for multiple reasons (I’ll cover those is a separate post if enough people are interested). I wanted a no-frills snippet search tool but couldn’t find a good one for hadoop and hive. So I built my own. </p>
<h2 id="try-it-out_2">Try it Out <a class="head_anchor" href="#try-it-out_2">#</a>
</h2>
<p>If you’re interested you can access the tool at <a href="http://www.greppage.com/">www.greppage.com</a>. The UI is very basic and I’ll appreciate your feedback.<a href="https://svbtleusercontent.com/ncycy10un3pzwg.png"><img src="https://svbtleusercontent.com/ncycy10un3pzwg_small.png" alt="Screen Shot 2015-06-26 at 7.28.29 AM.png"></a></p>
tag:etl.svbtle.com,2014:Post/experiences-with-scalding2015-02-24T12:42:08-08:002015-02-24T12:42:08-08:00First Experiences with Scalding<p>Recently, I’ve been evaluating using Scalding to replace some parts of our ETL. Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. You specify your jobs in the clean and expressive Scala syntax and out spits MapReduce which runs on your Hadoop Cluster.</p>
<h1 id="options-for-cluster-processing_1">Options for Cluster Processing <a class="head_anchor" href="#options-for-cluster-processing_1">#</a>
</h1>
<p>There are several options to run a custom processing task on a hadoop cluster without actually writing Java Mapreduce code. The major ones are Pig, HIve, Scalding and Spark(I’m sure I’m missing some that you may think are significant). All the options except for Spark work by allowing you to write in an easy to use/expressive DSL which later gets compiled to Java Map Reduce. Spark has its own engine to run workloads over the cluster and is gaining massive popularity. However, I’ve decided to give Spark a little more time to mature. Although, it seems pretty strong and well supported in its state as of this writing.</p>
<h1 id="scalding-is-used-by-big-companies_1">Scalding is used by Big Companies <a class="head_anchor" href="#scalding-is-used-by-big-companies_1">#</a>
</h1>
<p>Another reason why I’m particularly interested in Scalding is that it is being used in several large companies. E.g. Etsy, Twitter. Twitter runs most of their backend batch tasks using scalding.</p>
<h1 id="getting-scalding_1">Getting Scalding <a class="head_anchor" href="#getting-scalding_1">#</a>
</h1>
<p>You can get scalding by cloning and building <a href="https://github.com/twitter/scalding">https://github.com/twitter/scalding</a><br>
On the twitter/scalding github page(s) the tutorial uses scald.rb to trigger jobs. Don’t use it please. The code is hideous and it will take you forever to make a simple change. On the other hand, I use the project here: <a href="https://github.com/Cascading/scalding-tutorial/">https://github.com/Cascading/scalding-tutorial/</a>. Advantage of the former is that you get a REPL to play with - which can be very useful. To kick off jobs from your local machine, you will have to make sure that you have hadoop client installed. If you don’t want to do that, then you can always run in the –local mode.</p>
<h1 id="simple-use-case_1">Simple Use Case <a class="head_anchor" href="#simple-use-case_1">#</a>
</h1>
<p>We had an issue where one of HDFS folders of an external HIVE JSON table was having issues with bad / Incomplete JSON. Any hive query on the table would error because of the bad JSON. </p>
<p>I decided to write a scalding job which will look at each line for each file in the HDFS folder and find the offending JSON. I did a regex to check if the line ended with a “}”. Not the best JSON check but a good idea to see how prevalent the problem was. I wrote this class in the tutorial dir.</p>
<p><strong>Note: This code uses the <a href="https://github.com/twitter/scalding/wiki/Fields-based-API-Reference">FieldsAPI</a> which is not typed. It is recommended to use the <a href="https://github.com/twitter/scalding/wiki/Type-safe-api-reference">Typed API</a></strong></p>
<pre><code class="prettyprint lang-scala">import com.twitter.scalding._
class FindBadJson(args: Args) extends Job(args) {
TextLine(args("input"))
.read
.filter ('line) { line: String => line.matches(".*[^}]$")}
.write(Tsv(args("output")))
}
</code></pre>
<p>Then from the scalding tutorial directory </p>
<pre><code class="prettyprint lang-shell">scalding-tutorial git:(wip-2.6) ✗ sbt assembly
scalding-tutorial git:(wip-2.6) ✗ yarn jar target/scalding-tutorial-0.11.2.jar FindBadJson --input hdfs:///user/yranadive/data/json --output hdfs:///user/yranadive/data/output --hdfs
Exception in thread "main" cascading.flow.FlowException: step failed: (1/1) ...ser/yranadive/data/output, with job id: job_1423699617785_0038, please see cluster logs for failure messages
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:221)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
</code></pre>
<p>On further digging in to resource manager UI I found this</p>
<pre><code class="prettyprint">Diagnostics:
MAP capability required is more than the supported max container capability in the cluster. Killing the Job. mapResourceReqt: 2048 maxContainerCapability:1222
Job received Kill while in RUNNING state.
</code></pre>
<p>Believable, Since I was running this on a small QA cluster, which was probably resource starved. I changed the mapreduce.map.memory.mb and mapreduce.map.memory.mb in yarn-site.xml to the cluster max 1024 (tiny). The job now ran but threw an error. It looks like my client is not able to get updates from YARN server about the status of the job.</p>
<pre><code class="prettyprint">ERROR hadoop.HadoopStepStats: unable to get remote counters, no cached values, throwing exception
No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
</code></pre>
<p>The MR job keeps chugging and succeeds. Aah..finally some data!!!</p>
<p>But Wait! Since, we didn’t specify a reducer, we have just as many files as the mapper read. Bad MR…. Bad.. The output files are named like part-00001, part-00002, etc. Too much to go through. Time to declare a reducer:</p>
<pre><code class="prettyprint">import com.twitter.scalding._
class FindBadJson(args: Args) extends Job(args) {
TextLine(args("input"))
.read
.filter ('line) { line: String => line.matches(".*[^}]$")}
.groupAll { _.size }
.write(Tsv(args("output")))
}
</code></pre>
<p>And Voila! All offenders in one file!</p>
<h1 id="conclusion_1">Conclusion <a class="head_anchor" href="#conclusion_1">#</a>
</h1>
<p>Using scalding was really easy. The fact that I was able to kick off a MR job that went across the cluster and did things with only 5 lines is pretty cool. However, I do find people being wary of functional programming languages and using scala. To them I can say that if you are only using the scalding dsl, your are going to be fine for the most part and you really wouldn’t have to learn the nitty gritty details of scala. I’m going to update this space with more scalding related posts as I go through my journey.</p>
<p><strong>Note: this example uses the fields api which is not typed. It is recommended to use the Typed API.</strong></p>
tag:etl.svbtle.com,2014:Post/hive-doesnt-like-the-carriage-return-character2015-02-11T12:48:24-08:002015-02-11T12:48:24-08:00Hive doesn't like the carriage return character<p>Have you ever run in to a situation where you count the number of rows for a table in a database, then dump it to CSV and then load it to HIVE only to find that number has changed? Well, you probably have carriage returns in your fields. HIVE reads a carriage return similar to a new line which means end of row. Here’s a link I found that describes it:</p>
<p><a href="http://grokbase.com/t/hive/user/111v7jva3f/newlines-in-data">http://grokbase.com/t/hive/user/111v7jva3f/newlines-in-data</a></p>
<p>You have to manually clean the \r from the file. One option is to use the unix command transliterate:</p>
<pre><code class="prettyprint lang-unix">cat yourfile | tr -d "\r" > newfile
</code></pre>
tag:etl.svbtle.com,2014:Post/learning2015-02-11T12:38:23-08:002015-02-11T12:38:23-08:00Few Thoughts about Learning<p>It is funny how we have so much information available to us but nobody teaches us how to learn. In college, I struggled with processing vast amounts of information. I would read an article/paper/concept and comprehend only some part of it. I’d later feel guilty for not knowing the rest. Looking back my biggest mistake was learning Java by reading a book. I remember being haplessly confused and dumbfounded as I read books that contained lines and lines of programs. As time went by I became more open to the idea if partially understanding a next without worrying too much about wholly understanding it. With large amounts the challenge is that you hit a sentence containing concepts or words of which you have no prior understanding. </p>
<p>Students should be taught in the early years of their life on how to read complex texts without having to worry about comprehending everything. This will reduce a whole lot of anxiety around learning. </p>
<p>Another major problem while learning is long sentences. Nobody uses long sentences in business because they are hard to understand in one read. All texts should strive to do the same.</p>
tag:etl.svbtle.com,2014:Post/removing-database-level-locks-in-hive2015-01-28T10:58:11-08:002015-01-28T10:58:11-08:00Removing Database Level Locks in HIVE<p>Recently we started noticing “CREATE TABLE” taking incredibly long amounts of time to execute and after which they’d fail. A more detailed look in to the issue revealed that we had upgraded HIVE and the new version, which now allowed ACID, would lock the database by default even if ACID support was turned off. So basically, during a SELECT or INSERT was running in HIVE, HIVE created a zookeeper SHARED lock on the entire database that contained those tables. </p>
<p>I did some digging through the code and found this:<br>
<a href="https://github.com/apache/hive/blob/68bc618bf0b1fd3839c3c52c2103b58719b3cb81/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DummyTxnManager.java#L166">https://github.com/apache/hive/blob/68bc618bf0b1fd3839c3c52c2103b58719b3cb81/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DummyTxnManager.java#L166</a> Notice the lock on the whole database.</p>
<p>To solve this problem, this link recommends to turn off locking altogether:<br>
<a href="http://mail-archives.apache.org/mod_mbox/hive-user/201408.mbox/%3C0eba01cfc035%243501e4f0%249f05aed0%24@com%3E">http://mail-archives.apache.org/mod_mbox/hive-user/201408.mbox/%3C0eba01cfc035$3501e4f0$9f05aed0$@com%3E</a> That was not an option for us as we were doing a full replace of the table and we want to make sure no one is reading from the table when we are replacing its contents.</p>
<p>Another solution, although not elegant is to unlock the zookeeper lock on the schema manually every so often. Here’s a script if you decide to go that route. If you have a better way of solving this issue please tweet to me @yashranadive</p>
<pre><code class="prettyprint lang-ruby">require 'zookeeper'
require 'trollop'
# Removes zookeeper SHARED locks from a database
# usage ruby remove_zookeeper_locks.rb <name_of_hive_database>
# e.g. ruby remove_zookeeper_locks.rb default
opts = Trollop::options do
opt :hive_schema, "Hive Schema to unlock", :type => :string, :required => true # string --hive-schema <s>
opt :zookeeper_server, "Zookeeper server", :type => :string, :required => false # string --zookeeper-server <s>
end
logger = Logger.new(STDOUT)
zk = Zookeeper.new(opts[:zookeeper_server] || "default_zookeeper_server:2181")
hive_schema = opts[:hive_schema]
path = "/hive_zookeeper_namespace_hive/#{hive_schema}"
nodes = zk.get_children(:path => path)[:children] || []
logger.info( "ZK Nodes for #{path}: #{nodes}" )
locks = nodes.select { |i| i[/LOCK-SHARED/] }
if (locks.nil? || locks.empty?)
logger.info( "No SHARED Locks found on #{hive_schema}" )
else
locks.map do |lock_name|
zk.delete(:path => "/hive_zookeeper_namespace_hive/#{hive_schema}/#{lock_name}")
logger.info( "Removed lock /hive_zookeeper_namespace_hive/#{hive_schema}/#{lock_name}" )
end
end
</code></pre>
tag:etl.svbtle.com,2014:Post/creating-presentations-with-revealjs2015-01-23T20:50:46-08:002015-01-23T20:50:46-08:00Creating Presentations with Reveal.js<p>Late last year, I gave a talk at the Sift Science office in San Francisco on “Hadoop at Lookout - how Lookout uses the hadoop infrastructure to power internal analytics”. I used <a href="http://lab.hakim.se/reveal-js/#/">Reveal.js</a> to present the talk in my browser! Reveal.js is a HTML Presentation Framework that uses Javascript and plain HTML to create beautiful slides. I tried the <a href="https://github.com/hakimel/reveal.js">free version</a> on Github where you create your slides by writing html or markdown. There’s also an <a href="http://slides.com/">online editor</a> which I have not tried.</p>
<h1 id="html-slides-what_1">HTML Slides, What? <a class="head_anchor" href="#html-slides-what_1">#</a>
</h1>
<p>You may ask, “Well, do we really need HTML slides when we have mature tools like Powerpoint or Keynote?” Having a presentation in the HTML format allows you to quickly upload it and share it with the world by simply hosting it. Desktop solutions like Powerpoint or Keynote require you to upload your file to an online folder or upload to presentation viewers like SlideShare. Additionally, Reveal.js allows you to write <a href="http://en.wikipedia.org/wiki/Markdown">Markdown</a> so you can throw something up real quick. Granted, it is not going to be as pixel perfect as a Keynote but for a lot of technical presentations you really don’t need to be.</p>
<p>Another cool reason to use Reveal.js is the ability to hit ESC and go in the preview mode. I’ve given presentations where during QnA I need to go back to the beginning of the presentation. The preview mode is really awesome. Additionally, the content in your Reveal.js slides can be easily indexed by search engines.</p>
<h1 id="the-good-things_1">The Good Things <a class="head_anchor" href="#the-good-things_1">#</a>
</h1>
<p>I write my blogs using Markdown and I absolutely love it. With Markdown support, Reveal.js allows you to focus on your content and nothing else. Which is great. Also the code looks much cleaner. Here’s a one page slide on the agenda:</p>
<pre><code class="prettyprint lang-html"><! Agenda ############################ -->
<section data-markdown>
#Agenda
- What we do @Lookout
- Analytics Architecture
- Event Ingestion Pipelines
- Storm
- Questions
</section>
</code></pre>
<p>Which results in something like this: <a href="https://svbtleusercontent.com/99ny82ilhhdlma.png"><img src="https://svbtleusercontent.com/99ny82ilhhdlma_small.png" alt="Screen Shot 2015-01-23 at 8.21.25 PM.png"></a></p>
<p>Notice, the HTML comment with a string of “#” characters. I had to do this to visually separate the sections. Otherwise, searching for that one slide you want to update is a bit of a pain. </p>
<p>Adding images with Markdown is easy. Just shove them in a folder inside your dir and simply point to it:</p>
<pre><code class="prettyprint lang-html"><section data-markdown>
##Storm Parallelism
![](/assets/images/Storm_parallelism.png)
</section>
</code></pre>
<p>But resizing the image in Markdown wasn’t so easy.</p>
<h1 id="the-challenges_1">The Challenges <a class="head_anchor" href="#the-challenges_1">#</a>
</h1>
<p>All of this is great, but when the rubber meets the road, does Reveal.js really give its desktop cousins a strong fight? First, writing pure HTML or Markdown needs getting used to. Not being able to look at the preview the instant you make a change can be frustrating, as you may find yourself constantly modifying your markdown file and Cmd+Tab to your Chrome window.</p>
<p>Also, with Reveal.js (at least till version 2.6.2) you can’t include a code block in the markdown. You have to use HTML.</p>
<pre><code class="prettyprint lang-html"><section>
<h2>Deployment</h2>
<p>Configuration is stored in shell scripts that launch topologies</p>
<pre><code data-trim>
storm jar /topolgoies/data-storm-0.0.3-SNAPSHOT.jar com.lookout.data.topology.KafkaToHdfsTopology \
-topologyname kafka-hdfs \
...
-D statsd.host=statsdhost
</code></pre>
</section>
</code></pre>
<p><a href="https://svbtleusercontent.com/pkib0va2ozgydq.png"><img src="https://svbtleusercontent.com/pkib0va2ozgydq_small.png" alt="Screen Shot 2015-01-23 at 8.32.40 PM.png"></a></p>
<p>Another issue is that if you are presenting using a clicker (which generally has only 2 buttons - next and back) then navigating is impossible. For every new Section Group you need to scroll down to reach the slides in that group.</p>
<p>Making diagrams, which I often do in technical presentations, can also be a bit of a pain. You have to create the diagram in an external tool and then export it to png to be visible in the slides.</p>
<h1 id="overall-impressions_1">Overall Impressions <a class="head_anchor" href="#overall-impressions_1">#</a>
</h1>
<p>I’m definitely going to use Reveal.js again for quick presos and when I want to convey an idea quickly. But, for more pixel-perfect and detailed presos with a lot of diagrams, I’m going to stick with the workhorses - Powerpoint and Keynote.</p>
tag:etl.svbtle.com,2014:Post/setting-up-camus-linkedins-kafka-to-hdfs-pipeline2014-12-21T14:20:15-08:002014-12-21T14:20:15-08:00Setting up Camus - LinkedIn's Kafka to HDFS pipeline<p>Few days ago I started tinkering with Camus to evaluate its use for dumping raw data from Kafka=>HDFS. This blog post will cover my experience and first impressions with setting up a Camus pipeline. Overall I found Camus was easy to build and deploy. </p>
<h1 id="what-is-camus_1">What is Camus? <a class="head_anchor" href="#what-is-camus_1">#</a>
</h1>
<p>Camus is LinkedIn’s open source project that can dump raw/processed data from Kafka to HDFS. It does this by a map-reduce job which when kicked off can -</p>
<ul>
<li>Manage its own offsets </li>
<li>Date partition the data in to folders in HDFS</li>
</ul>
<p>The <a href="https://github.com/linkedin/camus">github readme</a> has a details on how this is achieved.</p>
<h1 id="building-camus_1">Building Camus <a class="head_anchor" href="#building-camus_1">#</a>
</h1>
<p>To build Camus:</p>
<ul>
<li>Clone the Git Repo from <a href="https://github.com/linkedin/camus">https://github.com/linkedin/camus</a>
</li>
<li> You may want to change the version of hadoop-client library in camus/pom.xml to match your hadoop version</li>
<li> Build using mvn clean package, if the tests fail use -DskipTests</li>
</ul>
<h1 id="how-camus-does-date-partitioning_1">How Camus does Date Partitioning <a class="head_anchor" href="#how-camus-does-date-partitioning_1">#</a>
</h1>
<p>Camus achieves date partitioning by introspecting the message and extracting the timestamp field from the messages. It uses this date to then determine the folder in HDFS where it lands the message. Camus creates folders in HDFS partitioned by date of the message. E.g. a json message with a timestamp of “2014-12-19T01:00:59Z” will land in folder 2014/12/19. You may choose to not introspect the message and assign a timestamp = currentTime(). In this case the data will land in folder which represents the time the job was run. You may want to do this if you are just trying out Camus and don’t want to invest in writing a custom decoder for a specific message format.</p>
<h1 id="setting-up-what-you-will-need_1">Setting up - What you will need <a class="head_anchor" href="#setting-up-what-you-will-need_1">#</a>
</h1>
<p>There are two touch-points where you will have to possibly write your own code. </p>
<ol>
<li>Reading Messages from Kafka - You will write a class which extends com.linkedin.camus.coders.MessageDecoder that will tell Camus what’s the timestamp of the message.</li>
<li>Writing to HDFS - You will write a class which extends com.linkedin.camus.etl.RecordWriterProvider that will tell Camus what’s the payload that should be written to HDFS.</li>
</ol>
<h1 id="writing-you-own-decoder_1">Writing you own decoder <a class="head_anchor" href="#writing-you-own-decoder_1">#</a>
</h1>
<p>Now, to actually run the jar you will need to create a message decoder or use one of the supplied classes e.g. KafkaAvroMessageDecoder, JSONStringMessageDecoder class. This will obviously depend on what kind of data you are reading from Kafka. The following is an example of a String Message Decoder which reads string messages from Kafka and writes them to HDFS: </p>
<p>camus/camus-kafka-coders/src/main/java/com/linkedin/camus/etl/kafka/coders/StringMessageDecoder.java</p>
<pre><code class="prettyprint lang-java">package com.linkedin.camus.etl.kafka.coders;
import com.linkedin.camus.coders.CamusWrapper;
import com.linkedin.camus.coders.MessageDecoder;
import org.apache.log4j.Logger;
import java.util.Properties;
/**
* MessageDecoder class that will convert the payload into a String object,
* System.currentTimeMillis() will be used to set CamusWrapper's
* timestamp property
* This MessageDecoder returns a CamusWrapper that works with Strings payloads,
*/
public class StringMessageDecoder extends MessageDecoder<byte[], String> {
private static final Logger log = Logger.getLogger(StringMessageDecoder.class);
@Override
public void init(Properties props, String topicName) {
this.props = props;
this.topicName = topicName;
}
@Override
public CamusWrapper<String> decode(byte[] payload) {
long timestamp = 0;
String payloadString;
payloadString = new String(payload);
timestamp = System.currentTimeMillis();
return new CamusWrapper<String>(payloadString, timestamp);
}
}
</code></pre>
<p>You can also use a ByteArray decoder if you are reading binary data.<br>
UPDATE: Before you do this make sure you can read the data back from HDFS. For example, if you want to write protobufs, then perhaps its a wiser solution to first convert it a format such as parquet. Parquet will allow you to project columns off arbitrarily nested data. </p>
<p>camus/camus-kafka-coders/src/main/java/com/linkedin/camus/etl/kafka/coders/ByteArrayMessageDecoder.java</p>
<pre><code class="prettyprint lang-java">package com.linkedin.camus.etl.kafka.coders;
import com.linkedin.camus.coders.CamusWrapper;
import com.linkedin.camus.coders.MessageDecoder;
import org.apache.log4j.Logger;
import java.util.Properties;
/**
* MessageDecoder class that will convert the payload into a ByteArray object,
* System.currentTimeMillis() will be used to set CamusWrapper's
* timestamp property
* This MessageDecoder returns a CamusWrapper that works with ByteArray payloads,
*/
public class ByteArrayMessageDecoder extends MessageDecoder<byte[], byte[]> {
private static final Logger log = Logger.getLogger(ByteArrayMessageDecoder.class);
@Override
public void init(Properties props, String topicName) {
this.props = props;
this.topicName = topicName;
}
@Override
public CamusWrapper<byte[]> decode(byte[] payload) {
//Push the raw payload and add the current time
return new CamusWrapper<byte[]>(payload, System.currentTimeMillis());
}
}
</code></pre>
<p>The date for each message in this case is the time the Kafka job is run (or more specifically the time the message is fetched from Kafka). </p>
<h1 id="writing-your-own-recordwriter_1">Writing your own RecordWriter <a class="head_anchor" href="#writing-your-own-recordwriter_1">#</a>
</h1>
<p>RecordWriter interface has methods which tell Camus the payload that will be written to HDFS. Here you can tell Camus what record terminator will be used. You may wish to choose a String “\n” or a (byte)0x0. You can also specify if you want the output to be compressed - good thing to do if you are using HDFS as a backup of your kafka topics.</p>
<p>camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/common/ByteArrayRecordWriterProvider.java</p>
<pre><code class="prettyprint lang-java">package com.linkedin.camus.etl.kafka.common;
import com.linkedin.camus.coders.CamusWrapper;
import com.linkedin.camus.etl.IEtlKey;
import com.linkedin.camus.etl.RecordWriterProvider;
import com.linkedin.camus.etl.kafka.mapred.EtlMultiOutputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.DefaultCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.io.compress.SnappyCodec;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.ReflectionUtils;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
/**
* Provides a RecordWriter that uses FSDataOutputStream to write
* a Byte record as bytes to HDFS without any reformatting or compression.
*
* Null byte is used as record delimiter unless a string is specified
*/
public class ByteArrayRecordWriterProvider implements RecordWriterProvider {
public static final String ETL_OUTPUT_RECORD_DELIMITER = "etl.output.record.delimiter";
public static final String DEFAULT_RECORD_DELIMITER = "null";
protected String recordDelimiter = null;
private String extension = "";
private boolean isCompressed = false;
private CompressionCodec codec = null;
public ByteArrayRecordWriterProvider(TaskAttemptContext context) {
Configuration conf = context.getConfiguration();
if (recordDelimiter == null) {
recordDelimiter = conf.get(ETL_OUTPUT_RECORD_DELIMITER, DEFAULT_RECORD_DELIMITER);
}
isCompressed = FileOutputFormat.getCompressOutput(context);
if (isCompressed) {
Class<? extends CompressionCodec> codecClass = null;
if ("snappy".equals(EtlMultiOutputFormat.getEtlOutputCodec(context))) {
codecClass = SnappyCodec.class;
} else if ("gzip".equals((EtlMultiOutputFormat.getEtlOutputCodec(context)))) {
codecClass = GzipCodec.class;
} else {
codecClass = DefaultCodec.class;
}
codec = ReflectionUtils.newInstance(codecClass, conf);
extension = codec.getDefaultExtension();
}
}
@Override
public String getFilenameExtension() {
return extension;
}
@Override
public RecordWriter<IEtlKey, CamusWrapper> getDataRecordWriter(TaskAttemptContext context, String fileName,
CamusWrapper camusWrapper, FileOutputCommitter committer) throws IOException, InterruptedException {
// If recordDelimiter hasn't been initialized, do so now
if (recordDelimiter == null) {
recordDelimiter = context.getConfiguration().get(ETL_OUTPUT_RECORD_DELIMITER, DEFAULT_RECORD_DELIMITER);
}
// Get the filename for this RecordWriter.
Path path =
new Path(committer.getWorkPath(), EtlMultiOutputFormat.getUniqueFile(context, fileName, getFilenameExtension()));
FileSystem fs = path.getFileSystem(context.getConfiguration());
if (!isCompressed) {
FSDataOutputStream fileOut = fs.create(path, false);
return new ByteRecordWriter(fileOut, recordDelimiter);
} else {
FSDataOutputStream fileOut = fs.create(path, false);
return new ByteRecordWriter(new DataOutputStream(codec.createOutputStream(fileOut)), recordDelimiter);
}
}
protected static class ByteRecordWriter extends RecordWriter<IEtlKey, CamusWrapper> {
private DataOutputStream out;
private byte[] recordDelimiter;
public ByteRecordWriter(DataOutputStream out, String recordDelimiterString) {
this.out = out;
this.recordDelimiter = recordDelimiterString.toUpperCase().equals("NULL") ?
new byte[] {(byte)0x0} :
recordDelimiterString.getBytes();
}
@Override
public void write(IEtlKey ignore, CamusWrapper value) throws IOException {
boolean nullValue = value == null;
if (!nullValue) {
ByteArrayOutputStream outBytes = new ByteArrayOutputStream();
byte[] record = (byte[]) value.getRecord();
outBytes.write(record);
outBytes.write(recordDelimiter);
out.write(outBytes.toByteArray());
}
}
@Override
public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
out.close();
}
}
}
</code></pre>
<h1 id="run-camus_1">Run Camus <a class="head_anchor" href="#run-camus_1">#</a>
</h1>
<p>Great, you’re all setup to run Camus. Time to change the configs and point it to your Kafka cluster.</p>
<p>config.properties</p>
<pre><code class="prettyprint lang-properties">
# The job name.
camus.job.name=Camus Fetch
# final top-level data output directory, sub-directory will be dynamically created for each topic pulled
etl.destination.path=/camus/topics
# HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files
etl.execution.base.path=/camus/exec
# where completed Camus job output directories are kept, usually a sub-dir in the base.path
etl.execution.history.path=/camus/exec/history
camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.ByteArrayMessageDecoder
# The record writer for Hadoop
etl.record.writer.provider.class=com.linkedin.camus.etl.kafka.common.ByteArrayRecordWriterProvider
# max hadoop tasks to use, each task can pull multiple topic partitions
mapred.map.tasks=10
# max historical time that will be pulled from each partition based on event timestamp
kafka.max.pull.hrs=1
# events with a timestamp older than this will be discarded.
kafka.max.historical.days=3
# Max minutes for each mapper to pull messages (-1 means no limit)
kafka.max.pull.minutes.per.task=-1
# if whitelist has values, only whitelisted topic are pulled. Nothing on the blacklist is pulled
kafka.blacklist.topics=
kafka.whitelist.topics=mytopic
log4j.configuration=false
# Name of the client as seen by kafka
kafka.client.name=camus
# The Kafka brokers to connect to, format: kafka.brokers=host1:port,host2:port,host3:port
kafka.brokers=kafka.test.org:6667,kafka.test.org:6667
#Stops the mapper from getting inundated with Decoder exceptions for the same topic
#Default value is set to 10
max.decoder.exceptions.to.print=5
#Controls the submitting of counts to Kafka
#Default value set to true
post.tracking.counts.to.kafka=false
#monitoring.event.class=class.that.generates.record.to.submit.counts.to.kafka
# everything below this point can be ignored for the time being, will provide more documentation down the road
##########################
etl.run.tracking.post=false
kafka.monitor.tier=
etl.counts.path=
kafka.monitor.time.granularity=10
#etl.hourly=hourly
etl.daily=daily
# Should we ignore events that cannot be decoded (exception thrown by MessageDecoder)?
# `false` will fail the job, `true` will silently drop the event.
etl.ignore.schema.errors=false
# configure output compression for deflate or snappy. Defaults to deflate
mapred.output.compress=false
etl.output.codec=gzip
etl.deflate.level=6
#etl.output.codec=snappy
etl.default.timezone=America/Los_Angeles
etl.output.file.time.partition.mins=60
etl.keep.count.files=false
etl.execution.history.max.of.quota=.8
mapred.map.max.attempts=1
kafka.client.buffer.size=20971520
kafka.client.so.timeout=60000
</code></pre>
<p>Now, submit the shaded jar to hadoop. The shaded jar exists in the camus-example/target directory.</p>
<pre><code class="prettyprint">hadoop jar camus-example-0.1.0-SNAPSHOT-shaded.jar com.linkedin.camus.etl.kafka.CamusJob -P camus.properties
</code></pre>
<p>As the job runs it will store the temporary data in the camus/exec HDFS directory. Once the job completes the files will be moved to the topics directory in HDFS.</p>
<h1 id="output-compression_1">Output Compression <a class="head_anchor" href="#output-compression_1">#</a>
</h1>
<ul>
<li>“snappy”</li>
<li>“gzip”</li>
<li>“deflate”</li>
</ul>
<p>To enable compression set mapred.output.compress=true and set <br>
etl.output.codec=gzip in the config file. The advantage of using gzip is that anyone can see contents of the file very quickly by unzipping the file using ubiquitous gunzip from the command line. <a href="http://zlib.net/">Deflate</a> (which is the underlying compression for gzip) and <a href="https://github.com/kubo/snzip">snappy</a> will require you to download additional tools to uncompress data.</p>
<h1 id="conclusion_1">Conclusion <a class="head_anchor" href="#conclusion_1">#</a>
</h1>
<p>The project is being actively committed to and new documentation is being added everyday. My experience with Camus has been good so far. Another advantage of Camus is that it can be used for basic transformations on your data. I’m looking forward to working more with this project.</p>
tag:etl.svbtle.com,2014:Post/visualizing-metrics-in-storm-using-statsdgraphite2014-12-13T15:51:08-08:002014-12-13T15:51:08-08:00Visualizing Metrics in Storm using StatsD & Graphite<h1 id="storm-metrics-api_1">Storm Metrics API <a class="head_anchor" href="#storm-metrics-api_1">#</a>
</h1>
<p>Jason Trost from Endgame has <a href="https://www.endgame.com/blog/storm-metrics-how-to.html">written a nice post</a> on how to setup Storm to publish metrics using the Metrics API. Endgame has also <a href="https://github.com/endgameinc/storm-metrics-statsd">open sourced a module storm-metrics-statsd</a> for Storm that allows you to send messages to StatsD. </p>
<h1 id="build_1">Build <a class="head_anchor" href="#build_1">#</a>
</h1>
<p>If you use maven, you can use the following snippets in your topology pom.xml to the load storm-metrics-statsd jar. Alternately, you can clone the <a href="https://github.com/endgameinc/storm-metrics-statsd">github project</a> and build it yourself.</p>
<pre><code class="prettyprint lang-xml"> <repository>
<id>central-bintray</id>
<url>http://dl.bintray.com/lookout/systems</url>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
...
<dependency>
<groupId>com.timgroup</groupId>
<artifactId>java-statsd-client</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>com.lookout</groupId>
<artifactId>storm-metrics-statsd</artifactId>
<version>1.0.0</version>
</dependency>
</code></pre>
<h1 id="registering-your-metric_1">Registering your Metric <a class="head_anchor" href="#registering-your-metric_1">#</a>
</h1>
<p>You can now register the metric you wish to track by adding the following in your topology class</p>
<pre><code class="prettyprint lang-java">// Configure the StatsdMetricConsumer
Map statsdConfig = new HashMap();
statsdConfig.put(StatsdMetricConsumer.STATSD_HOST, statsdHost);
statsdConfig.put(StatsdMetricConsumer.STATSD_PORT, 8125);
statsdConfig.put(StatsdMetricConsumer.STATSD_PREFIX,"data.storm.metrics");
topologyConfig.registerMetricsConsumer(StatsdMetricConsumer.class, statsdConfig, 2);
</code></pre>
<p>Now, let’s say you wish to track success and errors for a particular message. In the bolt class, you can send increments to a counter like such</p>
<pre><code class="prettyprint lang-java">// Metrics - Note: these must be declared as transient since they are not Serializable
transient CountMetric _successCountMetric;
transient CountMetric _errorCountMetric;
@Override
public void prepare(java.util.Map stormConf, TopologyContext context) {
// Metrics must be initialized and registered in the prepare() method for bolts, or the open() method for spouts. Otherwise, an Exception will be thrown
initMetrics(context);
}
private void initMetrics(TopologyContext context) {
_successCountMetric = new CountMetric();
_errorCountMetric = new CountMetric();
context.registerMetric("success_count", _successCountMetric, 1);
context.registerMetric("error_count", _errorCountMetric, 1);
}
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
try{
//some complex logic
//On success increment the success counter
_successCountMetric.incr();
} catch (Exception ex) {
//On Error incerment the error counter
_errorCountMetric.incr();
}
}
</code></pre>
<h1 id="accessing-your-graph-from-graphite_1">Accessing your graph from Graphite <a class="head_anchor" href="#accessing-your-graph-from-graphite_1">#</a>
</h1>
<p>storm-metrics-statsd sends data to statsd under the following namespace:<br>
topology_name.host_name.port_number.bolt_name.metric_name. The port number in this case is the supervisor worker that is running the bolt. You can find this in the Storm UI. In addition, the module will also send internal storm metrics of the topology to statsd. For e.g. metrics such as __ack-count, __tranfer-count, etc.</p>
<p>A common requirement is to get the counts regardless of the supervisor host / port that is responsible for running the bolt. In such cases, you can use wildcard characters and sumSeries function to get metrics across hosts/topolgies, etc. Here’s an example graph link in graphite</p>
<p>graphite.endlesspuppies.com/render/?colorList=red%2Cgreen&from=-60minutes&target=sumSeries(storm.metrics.topology.<em>.</em>.tsv.error_count)&target=sumSeries(storm.metrics.topology.<em>.</em>.tsv.success_count)</p>
<p><img src="http://s4.postimg.org/qd65lehot/graphite_metrics.jpg" alt=""><br>
If you wish to display graphs(s) auto-cyled on a TV/dashboard you can use <a href="https://gist.github.com/yash-ranadive/a10f55aafd1f26cd062f">this javascript</a>.</p>
<h1 id="a-word-of-caution_1">A Word of Caution <a class="head_anchor" href="#a-word-of-caution_1">#</a>
</h1>
<p>When the Storm topology sends data to StatsD, it is actually sending that data over <a href="http://en.wikipedia.org/wiki/User_Datagram_Protocol">UDP</a>. UDP is a connectionless protocol i.e. there is no guarantee that the message sent over UDP will be received by the server at the other end. Depending on your network connection and on how much load your StatsD server is under, the server may drop a small or significant amount of your data. This results in unreliable dashboards. So be very careful not to overwork your StatsD boxes and make sure the StatsD box is running closer to your Storm Topologies.</p>
<h1 id="debugging-topology-stats_1">Debugging Topology Stats <a class="head_anchor" href="#debugging-topology-stats_1">#</a>
</h1>
<p>Once your topology is all setup and running you may find yourself wanting to know the exact data being sent to the StatsD server. Login to a supervisor box and listen to outgoing UDP traffic on port 8125 (default statsd port). You can achieve this by using ngrep.</p>
<pre><code class="prettyprint">sudo ngrep -W byline -d en3 . udp port 8125 > /tmp/capture.txt
tail -f /tmp/capture.txt
</code></pre>