Wednesday, January 4, 2017

Why hpdr?

Where I work, much of our big data lives in Hive tables with separate partitions for the different time units: YYYY, MM, DD, HH, and MIN.

That is, we don't have a single YYYYMMDDHHMIN partition.

When newly-arrived developers meet our Hive data, they almost inevitably make a series of mistakes, in a particular order. They start with something like this:

Since no partitions are specified, the query never finishes. It's pulling data from every partition going back several years.

Then, with partitions in hand, they try this:

But that's not right because there might be events in those partitions whose create_date is outside the date range. Like an event we don't want because it has a too-early create_date of 2016-11-14 23:59:59 but it slips into the YYYY=2016,MM=11,DD=15 partition due to time skew.

They almost get it right by using both the date range and the partition range:

But that's still not right because there might be events with a create_date's that fall within the date range but are not in those partitions. Like an event with a create_date of 2016-12-14 23:59:59 that slips into the unincluded YYYY=2016,MM=12,DD=15 partition.

They need to add some extra "slop", like one hour, on each end of the partition range so that they're finding all the events within the date range.

They end up with what I believe to be the "correct" query:

But look how unwieldy that partition range is. And look what happens the next month when we cross the year boundary:

Crafting these partition date ranges over and over is not something you want to be doing by hand. This is why hpdr.