Yash Ranadive

Data Engineer at Lookout Mobile Security

Page 2

How to read ISO 8601

ISO 8601 is a format of expressing a date with timezone information. I used to get confused after looking at dates like “2014-10-07T16:11:24-07:00”. Ok so you can tell it is 7th October 2014 and 4:24 PM. The -07:00 tells us the timezone which is UTC - 7 hours- all good. On the surface it looked easy but it was confusing for me.

Is it saying 4:24 at UTC and timezone where record created was UTC -07:00 or is the time 4:24 at the timezone UTC - 07:00 hrs? This confused me for quite a while. Well, the answer is that it is the later. Here’s an example is ruby:

time = Time.now.iso8601
=> "2014-10-07T16:11:24-07:00"

Means 4 11 PM in the timezone that is UTC - 7 hours

time = Time.now.utc.iso8601
=> "2014-10-07T23:15:08Z"

The Z(for Zulu) at the end indicates the timezone is UTC and time is 9 11 PM in that timezone.

Continue reading →

Boilerplate Maven Pom for generating jars with dependencies

I find myself searching for this over and over again. Maven can be a pain in the butt and Gradle is supposed to be a huge improvement. But every now and then you have to work with Maven, here’s a useful gist which contains boilerplate to create a jar with dependencies


<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">



Continue reading →

How to determine character encoding of files downloaded by gsutil

gsutil is Google’s tool to download reports/reviews/etc from the Developer Console.

$ gsutil ls -L gs://link/to/your/document.csv
    Creation time:      Mon, 04 Aug 2014 09:38:01 GMT
    Content-Encoding:       gzip
    Content-Length:     739977
    Content-Type:       text/csv; charset=utf-16le
    Hash (crc32c):      AAAAAA
    Hash (md5):     AAAAAAAAAAAAAAA
    ETag:           AAAAAAAAAAA
    Generation:     1234567081803000
    Metageneration:     1
    ACL:            ACCESS DENIED. Note: you need OWNER permission
                on the object to read its ACL.
TOTAL: 1 objects, 739977 bytes (722.63 KB)

View →

Best way to duplicate a partitioned table in Hive

A simple google search for the above will land you here:

But, I believe a better way is:

  1. Create the new target table with the schema from the old table
  2. Use hadoop fs -cp to copy all the partitions from source to target table
  3. Run MSCK REPAIR TABLE table_name; on the target table

View →

How to calculate Modification Times of Hive Tables

If you use external tables in hive or use methods other than Hive’s LOAD DATA to feed data to hive tables, you should be interested in how recent is your data.

Here’s a nifty little ruby snippet that allows you to get that using webhdfs

irb> require 'webhdfs'

irb> client = WebHDFS::Client.new('hadoop-nn', 50070)

irb> fl = client.list('/user/hive/warehouse/database.db/tablename/')

irb> DateTime.strptime(fl.collect {|x| x['modificationTime']}.max.to_s, '%M')

View →

Find Number of fields in a file

To find the number of fields in a TSV file just do the following:

First calculate the number of tabs:

$ head -1 /tmp/file.txt |  egrep -o -E $'\t' | wc -l

The number of fields is number of tabs separating the fields + 1

16 + 1 = 17

View →

Semantic Versioning

All developer should follow this convention while publishing releases.

From semver.org:

Given a version number MAJOR.MINOR.PATCH, increment the:

  • MAJOR version when you make incompatible API changes,
  • MINOR version when you add functionality in a backwards-compatible manner, and
  • PATCH version when you make backwards-compatible bug fixes. Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

View →

Virtual Destinations in ActiveMQ not working

I find Activemq’s documentation pathetic at times. Consider this document which talks about activemq virtual destinations - you’d imagine they’d EXPLICITLY write about the most important thing - THEY ARE DISABLED BY DEFAULT and wont enable unless you add the code snippet below. So if you’re one of the people who have tried pulling out their hair only to realize this - I feel you :)

To enable virtual destinations you need to include the following in activemq.xml

        <virtualTopic name="VirtualTopic.>" prefix="Consumer.*."/> 

Continue reading →

Accessing your gmail account from VPS using Mutt

I spent several frustrating hours trying to figure out what I was doing wrong after installing and configuring Mutt for Gmail.

$ mutt -s "Tester" bot.dude@gmail.com < /tmp/mail.txt 
msmtp: authentication failed (method PLAIN)
msmtp: server message: 534-5.7.14 <https://accounts.google.com/ContinueSignIn?sarp=1&scc=1&plt=AKgnsbu0B
msmtp: server message: 534-5.7.14 XGxSM-tObJBIhQ5VPFuixj8fKKomIPSnncNaOTJghPy0TpsfqG0KPA8loEuq5TE0QK-WGK
msmtp: server message: 534-5.7.14 C9Sq2rHnhzg_RGnMG4poE3uqs-U52pB_IYVcXSw2QUhPSwfAsaaYkAnPSGDsWmE7iA5HBz
msmtp: server message: 534-5.7.14 jyKXEhtr7lqURIk2wxoTty0AFQOL4ZKz19gDWNe0EYAYqrFCr1V0g4hDECdJZDSO8Te5rX
msmtp: server message: 534-5.7.14 4E3I7MA> Please log in via your web browser and then try again.
msmtp: server message: 534-5.7.14 Learn more at
msmtp: server message: 534 5.7.14 https://support.google.com/mail/bin/answer.py?answer=78754

Continue reading →

Hot Swapping of Tables in Hive

Here are the steps to hot swap a table in Hive when you manually load data in them. Hot Swapping the table allows you to refresh an entire table without any downtime thus making your table always available for querying. With this approach you Lock the table but it is only for a fraction of a second. This is useful when you want to change the definition of a table or reload the whole table without downtime. Hive 11 allows you to maintain lock on a table which you can later drop and re-create.

Here are the steps for the swap:

  1. Store the data in a staging location in HDFS
  2. Get EXCLUSIVE LOCK on the Hive table (using LOCK TABLE tablename) - so queries have to wait to until we finish the refresh
  3. Delete the underlying Hive folder (e.g. /user/hive/warehouse/user.db/tablename). We have to delete first because of https://issues.apache.org/jira/browse/HDFS-4142 and...

Continue reading →