spark core-top10热门品类

需求说明:品类是指产品的分类,大型电商网站品类分多级,咱们的项目中品类只有一级,不同的公司可能对热门的定义不一样。我们按照每个品类的点击、下单、支付的量来统计热门品类。

鞋 点击数 下单数 支付数

衣服 点击数 下单数 支付数

电脑 点击数 下单数 支付数

本项目需求优化为:先按照点击数排名,靠前的就排名高;如果点击数相同,再比较下单数;下单数再相同,就比较支付数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
//用户访问动作表
case class UserVisitAction(date: String,//用户点击行为的日期
user_id: Long,//用户的ID
UserVisitAction session_id: String,//SessionID
page_id: Long,//某个页面的ID
action_time: String,//动作的时间点
search_keyword: String,//用户搜索的关键词
click_category_id: Long,//某一个商品品类的ID
click_product_id: Long,//某一个商品的ID
order_category_ids: String,//一次订单中所有品类的ID集合
order_product_ids: String,//一次订单中所有商品的ID集合
pay_category_ids: String,//一次支付中所有品类的ID集合
pay_product_ids: String,//一次支付中所有商品的ID集合
city_id: Long)//城市 id

// 输出结果表
case class CategoryCountInfo(var categoryId: String,//品类id
var clickCount: Long,//点击次数
var orderCount: Long,//订单次数
var payCount: Long)//支付次数

数据(局部):

1
2
3
4
5
6
7
2019-07-17_95_26070e87-1ad7-49a3-8fb3-cc741facaddf_6_2019-07-17 00:00:17_null_19_85_null_null_null_null_7
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_29_2019-07-17 00:00:19_null_12_36_null_null_null_null_5
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_22_2019-07-17 00:00:28_null_-1_-1_null_null_15,1,20,6,4_15,88,75_9
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_11_2019-07-17 00:00:29_苹果_-1_-1_null_null_null_null_7
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_24_2019-07-17 00:00:38_null_-1_-1_15,13,5,11,8_99,2_null_null_10
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_24_2019-07-17 00:00:48_null_19_44_null_null_null_null_4
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_47_2019-07-17 00:00:54_null_14_79_null_null_null_null_2

根据上面提供的类,很显然是将各个字段封装的上面提供的两个类中,然后进行运算调用。

思路:

1,首先要将数据按照_切分,得到各个字段对应的值并封装到UserVisitAction对象中

2,对封装好的UserVisitAction对象进行进一步封装,得到CategoryCountInfo对象并打散分布

3,根据商品id对数据进行聚合,最后排序并取前十输出

伪代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
sc.textFile("input02")
.map(line=>{
val datas: Array[String] = line.split("_")
UserVisitAction(
datas(0),
datas(1).toLong,
datas(2),
datas(3).toLong,
datas(4),
datas(5),
datas(6).toLong,
datas(7).toLong,
datas(8),
datas(9),
datas(10),
datas(11),
datas(12).toLong
)
})
.flatMap(a=>(
if (a.click_category_id != -1){
List(CategoryCountInfo(a.click_category_id.toString,1,0,0))
}else if(a.order_category_ids!="null"){
val infoesToInfoes: ListBuffer[CategoryCountInfo] =new ListBuffer[CategoryCountInfo]
val strings: Array[String] = a.order_category_ids.split(",")
for (id<-strings){
infoesToInfoes.append(CategoryCountInfo(id,0,1,0))
}
infoesToInfoes
}else if (a.pay_category_ids!="null"){
val infoesToInfoes: ListBuffer[CategoryCountInfo] =new ListBuffer[CategoryCountInfo]
val strings: Array[String] = a.pay_category_ids.split(",")
for (id<-strings){
infoesToInfoes.append(CategoryCountInfo(id,0,0,1))
}
infoesToInfoes
}else{
Nil
}
))
.groupBy(info=>info.categoryId)
.mapValues(a=>a.reduce(
(a,b)=>{
a.orderCount=a.orderCount+b.orderCount
a.clickCount=a.clickCount+b.clickCount
a.payCount=a.payCount+b.payCount
a
}
))
.map(_._2)
.sortBy(a=>(a.clickCount,a.orderCount,a.payCount),false)
.take(10).foreach(println)

输出结果:

1
2
3
4
5
6
7
8
9
10
CategoryCountInfo(6,5912,1768,1197)
CategoryCountInfo(16,5928,1782,1233)
CategoryCountInfo(4,5961,1760,1271)
CategoryCountInfo(14,5964,1773,1171)
CategoryCountInfo(8,5974,1736,1238)
CategoryCountInfo(3,5975,1749,1192)
CategoryCountInfo(1,5976,1766,1191)
CategoryCountInfo(10,5991,1757,1174)
CategoryCountInfo(5,6011,1820,1132)
CategoryCountInfo(18,6024,1754,1197)

DAG Visualization

Donate
  • Copyright: Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.

扫一扫,分享到微信

微信分享二维码
  • Copyrights © 2020-2021 ycfn97
  • Visitors: | Views:

请我喝杯咖啡吧~

支付宝
微信