Yash Ranadive

Data Engineer at Lookout Mobile Security

Page 3

Create Views over JSON Data in Hive

The beauty of storing raw JSON in HIVE is that you can potentially create multiple tables on the same data using Hive Views. Hive allows you to query JSON data using couple of different ways (json_tuple and get_json_object). The get_json_object allows you to pass a json string and a JSONPath to extract data. Here’s an example:

event_type event_data user_registered {ip_address: “” } user_deleted {ip_address: “” }
hive> CREATE VIEW my_view(type, value)
SELECT event_type, get_json_object(tbl.event_data, '$.ip_address')
from json_talbe tbl
WHERE event_type='some_type';

hive> select * from my_view;
type value user_registered user_deleted

Continue reading →

Hive Unlock All Partitions

Looks like there is no good way to unlock all partitions on a hive table. That means you have to manually unlock each partition. A big pain in the butt.

This little nifty ruby snippet helps you get the unlock table statements that you can paste on your Hive CLI.

irb> s = Date.new(2014,03,01)  The start date of the partitions
irb> e = Date.new(2014,05,01)  The end date of the partitions
irb> (s..e).each {|x| puts "unlock table your_table partition(dt='{x}');"}
unlock table your_table partition(dt='2014-04-01');
unlock table your_table partition(dt='2014-04-02');
unlock table your_table partition(dt='2014-05-01');

View →

Streaming data to Hadoop using Unix Pipes? Use Pipefail

If you pipe the output of a statement to hadoop streaming you must know about the unix pipefail option. To demonstrate what it does, try this out in your commandline:

$> true | false
$> echo $?
$> false | true
$> echo $?

ZOMG WTF why is that 0, the first command failed so the output of the entire command should be 1, no? By default, the return status of a pipeline is the return status of the last command. So if you have something like this:

$> mysql -u user -p password -e "Select * from sometable" | hadoop dfs -put - /somefile/on/the/cluster

The exit code will be 0 even if the mysql command fails. You can force the return status of the pipeline to be 1 if any command in the pipeline fails.

$> set -o pipefail;mysql -u user -p password -e "Select * from sometable" | hadoop dfs -put - /somefile/on/the/cluster

Continue reading →

Data Storage Calculations for Storing Event Data

Do you deal with storing message/event data? Ever wondered how much space they will take over course of time in your Hadoop Cluster? How much space will that 500 JSON msgs/second pipeline take? Or maybe you plan to compress that data later?

Well, I’ve wondered that a lot. So I wrote a javascript app that does just that.

View →

How to compress Data in Hadoop

Hadoop is awesome because it can scale very well. That means you can add new data nodes without having to worry about running out of space. Go nuts with the data! Pretty soon you will realize that’s not a sustainable strategy… at least not financially. It is important to have a storage / retention strategy. Old data needs to be deleted or if nothing else, compressed as much as possible.

Here’s a simple way to compress a folder using Snappy codec via Hadoop Streaming.

hadoop jar /opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.2.0-mr1-cdh5.0.0-beta-2.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
  -Dmapred.reduce.tasks=0 \
  -input /user/your_user/path/to/large/directory \

Continue reading →

Two Very Useful Hive Lock Settings


The first property indicates the number of times the client will attempt to get a lock before it gives up. The second property indicates that time interval between retries for getting locks.

You would find yourself tweaking this to suit your needs. The default value for hive.lock.sleep.between.retries is very large. This could potentially mean that your query will be forever stuck in getting the lock if your table is being constantly hit with queries.

I generally use the following settings

SET hive.lock.numretries 100;
SET hive.lock.sleep.between.retries 1;

You can set the above in your ~/.hiverc file or you can set them whenever you open a new HIve CLI client session. The following shows description of the properties.

<description>The number of times you want to

Continue reading →

Two Very Useful Hive CLI settings

It is very helpful to set these in your .hiverc file. The hive cli reads from the .hiverc file in your home directory to override defaults.

Two of the settings I find very important is

set hive.cli.print.header=true;
set hive.cli.print.current.db=true;

The first line prints column headers when you do a select * from table. The second line prints the current database in the prompt. Both very very helpful features which I think should be turned on by default. Oh well.

Beware any changes in your .hiverc file will also affect the output of hive -e and hive -f commands. So if you have scripts that dump hive tables using any of the above methods the headers will be printed in the output.

View →

Create a file of size x bytes

One of the common requirements I run across in moving data around is finding if I’m doing it the fastest way possible. A good indicator of speed is to find out how long it takes for a large file to get copied from one server to another.

If you’re building a (big ;)) data pipeline that transports data from one server to another you better be close to the above speed. dd is a unix util that allows you to create a file of a particular size for these purposes

dd if=/dev/zero of=/path/to/desired/big/file count=1024

This will create a file 1,024 bytes in size.

View →

MySQL Replication Slave Monitoring Script for Zenoss

To monitor a slave you need to check if

  1. Slave IO Thread is running (Alert when not running)
  2. Slave SQL Thread is running (Alert when not running)
  3. Seconds behind master (Alert when passes a certain threshold)

Zenoss expects the output of the script in format
property_name1:property_value1 property_name2:property_value2

Basically key value pairs separated by spaces. Here’s the python script:

!/usr/bin/env python

import MySQLdb as mdb
import sys

host = sys.argv[1]
user = sys.argv[2]
password = 'some_password'

con = mdb.connect(host, user, password);
cur = con.cursor(mdb.cursors.DictCursor)
cur.execute('show slave status')
slave_status = cur.fetchone()
slave_file = slave_status["Seconds_Behind_Master"]
slave_sql_running = "1" if slave_status["Slave_SQL_Running"] == "Yes" else "0"
slave_io_running = "1" if slave_status["Slave_IO_Running"] == "Yes" else "0"
print "seconds:" +

Continue reading →

Basic Monitoring for Hadoop Data Nodes

Here’s a basic monitoring script to monitor the HDFS cluster disk space, Temp Dir space and number of data nodes up. This was plenty useful before we switched to Cloudera Manager.

!/usr/bin/env ruby

 Checks Hadoop and alerts if there is a change in data nodes
require 'yaml'

error_flag = 0
report = `hdfs dfsadmin -report`

 Datanodes available: 17 (28 total, 11 dead)
datanodes_report = report[/Data.*/]
persistence_report = {}
old_persistence_report = {}
dfs_used_percent = report[/DFS Used%.*/].split(' ')[-1].to_i

datanodes_total = datanodes_report[/\d+\stotal/].split(' ')[0]
datanodes_dead = datanodes_report[/\d+\sdead/].split(' ')[0]
datanodes_available = datanodes_report[/available:\s\d+/].split(' ')[-1]

persistence_report['datanodes_total'] = datanodes_total
persistence_report['datanodes_dead'] = datanodes_dead

Continue reading →