oracle aide

December 14, 2013

Top-N with group by in Hive – without analytic functions or ranking UDFs

Filed under: Uncategorized — Tags: — oracleaide @ 2:31 am

Since Hive 0.11 and its analytic functions are not available for my current project I have to resort to simple remedies.

Such as a massive union of top-N queries:

select * from (

select * from (select * from ( select * from v_diff where group_id = 1 ) aa order by abs(diff) desc ) bb limit 100
union all
select * from (select * from ( select * from v_diff where group_id = 2 ) aa order by abs(diff) desc ) bb limit 100
union all
select * from (select * from ( select * from v_diff where group_id = 3 ) aa order by abs(diff) desc ) bb limit 100
union all
select * from (select * from ( select * from v_diff where group_id = 4 ) aa order by abs(diff) desc ) bb limit 100
union all
select * from (select * from ( select * from v_diff where group_id = 5 ) aa order by abs(diff) desc ) bb limit 100
union all
select * from (select * from ( select * from v_diff where group_id = 6 ) aa order by abs(diff) desc ) bb limit 100
union all
select * from (select * from ( select * from v_diff where group_id = 7 ) aa order by abs(diff) desc ) bb limit 100
union all

) uu;

 

Where v_diff is a simple view comparing two partitions in a single table:

desc v_diff;

group_id string

diff float

Contrary to my expectations – it took a while to complete the query.

According to job tracker, all the individual select statement went into a single queue.

A small change in the hive session settings made a difference. 

set hive.exec.parallel=true;

set hive.exec.parallel.thread.number=16;

With parallel execution turned ON – Hadoop launched 16 queries in parallel and completed the whole select much faster. 

Something I would take for granted in an Oracle database with parallel execution. 

 

Blog at WordPress.com.