Fork me on GitHub

Hive基础学习-五之联系

第一题

  1. 准备数据
1
2
3
4
5
6
7
8
9
userId	visitDate	visitCount
u01 2017/1/21 5
u02 2017/1/23 6
u03 2017/1/22 8
u04 2017/1/20 3
u01 2017/1/23 6
u01 2017/2/21 8
u02 2017/1/23 6
u01 2017/2/22 4
  1. 要求使用SQL统计出每个用户的累积访问次数,如下表所示
1
2
3
4
5
6
用户id	月份	小计	累积
u01 2017-01 11 11
u01 2017-02 12 23
u02 2017-01 12 12
u03 2017-01 8 8
u04 2017-01 3 3
  1. 创建表
1
2
3
4
5
create table action
(userId string,
visitDate string,
visitCount int)
row format delimited fields terminated by "\t";
  1. 解答
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
select 
userId,
date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn,
visitCount
from
action; //取名:t1

结果
userid visitdate visitcount
u01 2017-01 5
u02 2017-01 6
u03 2017-01 8
u04 2017-01 3
u01 2017-01 6
u01 2017-02 8
u02 2017-01 6
u01 2017-02 4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
select
userId,
mn,
sum(visitCount) sum_visitCount
from
(
select
userId,
date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn,
visitCount
from
action
) t1
group by
userId,mn; //取名:t2

结果
userid mn _c2
u01 2017-01 11
u01 2017-02 12
u02 2017-01 12
u03 2017-01 8
u04 2017-01 3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
select
userId,
mn,
sum_visitCount,
sum(sum_visitCount) over(partition by userId order by mn)
from
(
select
userId,
mn,
sum(visitCount) sum_visitCount
from
(
select
userId,
date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn,
visitCount
from
action
) t1
group by
userId,mn
) t2;

第二题

  1. 有50W个京东店铺,每个顾客访问任何一个店铺的任何一个商品时都会产生一条访问日志,访问日志存储的表名为visit,访客的用户id为user_id,被访问的店铺名称为shop,请统计:
  2. 准备数据
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
u1	a
u2 b
u1 b
u1 a
u3 c
u4 b
u1 a
u2 c
u5 b
u4 b
u6 c
u2 c
u1 b
u2 a
u2 a
u3 a
u5 a
u5 a
u5 a
  1. 建表
1
create table visit(user_id string,shop string) row format delimited fields terminated by '\t';
  1. 求每个店铺的UV(访客数)
1
2
3
4
5
6
7
8
// 第一种方法,但生产环境一般不能这样,distinct会导致内从溢出
select
shop,
count(distinct user_id) uv
from
visit
group by
shop;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1.去重
select
shop,
user_id
from
visit
group by
shop,user_id; // 取名:t1

2.计数
select
shop,
count(*) uv
from
(
select
shop,
user_id
from
visit
group by
shop,user_id
)t1
group by
shop;
生产环境中当使用distinct时,可利用group by来替换
  1. 每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
1. 计算每个人访问每个店铺的总次数
select
shop,
user_id,
count(*) ct
from
visit
group by
shop,user_id; // 取名:t1

2. 针对同一店铺,对访问次数进行逆序排序,并计算rank值
select
shop,
user_id,
ct,
row_number() over(partition by shop order by ct desc) rk
from
(
select
shop,
user_id,
count(*) ct
from
visit
group by
shop,user_id
)t1; // 取名:t2

3. 去店铺访问前三名的用户

select
shop,
user_id,
ct
from
(
select
shop,
user_id,
ct,
row_number() over(partition by shop order by ct desc) rk
from
(
select
shop,
user_id,
count(*) ct
from
visit
group by
shop,user_id
)t1
)t2
where
rk <= 3;