找了两道Hive有关的SQL练习题目,第一道是统计用户浏览量的数据。第二道是统计京东商铺浏览量的数据。
(一)统计用户浏览量
我们有如下的用户访问数据
userId |
visitDate |
visitCount |
u01 |
2017/1/21 |
5 |
u02 |
2017/1/23 |
6 |
u03 |
2017/1/22 |
8 |
u04 |
2017/1/20 |
3 |
u01 |
2017/1/23 |
6 |
u01 |
2017/2/21 |
8 |
u02 |
2017/1/23 |
6 |
u01 |
2017/2/22 |
4 |
要求使用SQL统计出每个用户的累积访问次数,如下表所示:
用户id |
月份 |
小计 |
累积 |
u01 |
2017-01 |
11 |
11 |
u01 |
2017-02 |
12 |
23 |
u02 |
2017-01 |
12 |
12 |
u03 |
2017-01 |
8 |
8 |
u04 |
2017-01 |
3 |
3 |
原始数据如下:
1 2 3 4 5 6 7 8
| u01 2017/1/21 5 u02 2017/1/23 6 u03 2017/1/22 8 u04 2017/1/20 3 u01 2017/1/23 6 u01 2017/2/21 8 u02 2017/1/23 6 u01 2017/2/22 4
|
建表语句
1 2 3 4 5
| create table action (userId string, visitDate string, visitCount int) row format delimited fields terminated by "\t";
|
答:
一、首先转换月份格式,统计每个月、每个用户的访问量
1 2 3 4 5 6 7 8
| SELECT userid id , date_format(REGEXP_REPLACE(visitdate,'/','-'),'yyyy-MM') mon , sum(visitcount) views FROM action GROUP BY userid , visitdate;
|
执行结果:
1 2 3 4 5 6 7 8
| id mon views u01 2017-01 5 u01 2017-01 6 u01 2017-02 8 u01 2017-02 4 u02 2017-01 12 u03 2017-01 8 u04 2017-01 3
|
二、再按照月份和用户进行分组并统计访问量。
1 2 3 4 5 6 7 8 9 10 11
| SELECT id , mon , sum(visitcount) views FROM(SELECT userid id , date_format(REGEXP_REPLACE(visitdate,'/','-'),'yyyy-MM') mon , visitcount FROM action) t1 GROUP BY id ,mon;
|
执行结果
1 2 3 4 5 6 7
| id mon views u01 2017-01 11 u01 2017-02 12 u02 2017-01 12 u03 2017-01 8 u04 2017-01 3
|
三、 对每个用户的数据进行累加,再次子查询
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| select userid, mon, views, (sum(views) over(partition by userid order by mon)) total FROM (SELECT userid , mon , sum(visitcount) views FROM(SELECT userid , date_format(REGEXP_REPLACE(visitdate,'/','-'),'yyyy-MM') mon , visitcount FROM action) t1 GROUP BY userid ,mon)t2;
|
执行结果
1 2 3 4 5 6
| userid mon views total u01 2017-01 11 11 u01 2017-02 12 23 u02 2017-01 12 12 u03 2017-01 8 8 u04 2017-01 3 3
|
(二)统计网店访问量
有50W个京东店铺,每个顾客访问任何一个店铺的任何一个商品时都会产生一条访问日志,访问日志存储的表名为visit,访客的用户id为user_id,被访问的店铺名称为shop,请统计:
原始数据如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| u1 a u2 b u1 b u1 a u3 c u4 b u1 a u2 c u5 b u4 b u6 c u2 c u1 b u2 a u2 a u3 a u5 a u5 a u5 a
|
建表:
1 2
| create table visit(user_id string,shop string) row format delimited fields terminated by '\t';
|
问:
1)每个店铺的UV(访客数)
2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
答:
1)
一、使用GROUP BY对用户进行去重
1 2 3 4 5 6 7 8
| SELECT shop, user_id FROM visit GROUP BY shop, user_id
|
执行结果:
1 2 3 4 5 6 7 8 9 10 11 12
| shop user_id a u1 a u2 a u3 a u5 b u1 b u2 b u4 b u5 c u2 c u3 c u6
|
二、对去重的结果进行用户数目统计
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| SELECT shop, COUNT(*) uv FROM (SELECT shop, user_id FROM visit GROUP BY shop, user_id) t1 GROUP BY shop;
|
执行结果
2)
一、计算每个人访问每个店铺的总次数
1 2 3 4 5 6 7 8 9
| SELECT shop, user_id, count(*) FROM visit GROUP BY shop, user_id;
|
二、针对同一店铺,对方问次数进行逆序排序,并计算rank值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| SELECT shop, user_id, ct, row_number() over(partition by shop order by ct desc ) rk FROM (SELECT shop, user_id, count(*) ct FROM visit GROUP BY shop, user_id) t1;
|
三、去店铺访问前三名的用户
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| SELECT shop, user_id, ct FROM (SELECT shop, user_id, ct, row_number() over(partition by shop order by ct desc ) rk FROM (SELECT shop, user_id, count(*) ct FROM visit GROUP BY shop, user_id) t1)t2 WHERE rk <= 3;
|
3
Exer4:
1 2 3 4
| CREATE TABLE user_age (dt STRING, //日期 user_id STRING, //用户 age INT) //年龄
|
1、所有用户的总数及平均年龄
1 2 3 4 5 6 7
| SELECT user_id, AVG(age) age FROM user_age GROUP BY user_id;t1
|
1 2 3 4 5 6 7 8 9 10 11 12
| SELECT COUNT(*) count1, AVG(aage) avg_total_age FROM (SELECT user_id, AVG(age) aage FROM user_age GROUP BY user_id)t1;t0
|
结果:
1 2
| count1 avg_total_age 3 27.0
|
2、活跃用户的总数及其平均年龄(2种方法)
① 用lead或者tag,在用Datediff找相邻一天的,然后去重就得出结果,不具备扩展性,实现多天活跃比较复杂。
② 等差数列
2.1 给每个用户的日期,同一天去重rank
1 2 3 4 5 6 7 8
| SELECT user_id, dt, ROW_Number() OVER(PARTITION BY user_id ORDER BY dt) rk FROM user_age GROUP BY user_id,dt;t2
|
2.2 用等差数列
1 2 3 4 5
| SELECT user_id, DATE_SUB(dt,rk) diff FROM t2;t3
|
2.3 求出差距
1 2 3 4 5
| SELECT user_id, COUNT(diff) OVER(PARTITION BY user_id,diff) count_diff FROM t3;t4
|
2.4 最大连续天数
1 2 3 4 5 6 7 8
| SELECT user_id FROM t4 WHERE count_diff >= 2 GROUP BY user_id;t5
|
2.5 去重并且得到age
1 2 3 4 5 6 7 8 9 10 11 12 13
| SELECT user_age.user_id, user_age.age FROM user_age,t5 WHERE user_age.user_id = t5.user_id;t6
SELECT COUNT(*) active_count, AVG(age) active_avg_age FROM t6; t7
|
3、合并
最终SQL:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
| SELECT count_total, avg_total_age, active_count, active_avg_age FROM (SELECT COUNT(*) count_total, AVG(aage) avg_total_age FROM (SELECT user_id, AVG(age) aage FROM user_age GROUP BY user_id)t1)t0 , (SELECT COUNT(*) active_count, AVG(age) active_avg_age FROM (SELECT DISTINCT user_age.user_id, user_age.age FROM user_age, (SELECT user_id FROM (SELECT user_id, COUNT(diff) OVER(PARTITION BY user_id,diff) count_diff FROM (SELECT user_id, DATE_SUB(dt,rk) diff FROM (SELECT user_id, dt, ROW_Number() OVER(PARTITION BY user_id ORDER BY dt) rk FROM user_age GROUP BY user_id,dt)t2)t3)t4 WHERE count_diff >= 2 GROUP BY user_id)t5 WHERE user_age.user_id = t5.user_id)t6)t7;
|
结果
1 2
| count_total avg_total_age active_count active_avg_age 3 27.0 1 19.0
|