Spark SQL 上海摩拜共享单车数据分析

博主：张子
发布时间：2022 年 04 月 23 日
2776 次浏览
暂无评论
5367字数
分类：大数据

1 生成DataFrame对象

val bikeDF = sqlContext.read
      .format("csv")
      .option("header", "true")
      .option("timestampFormat", "yyyy-MM-dd HH:mm")
      .load("src/main/scala/com/zhangz1/mobike_shanghai_sample_updated.csv")

2 展示数据

bikeDF.show()

3 根据起始时间统计周一至周日每天的骑行次数

核心SQL语句：

select dayofweek(start_time) as day,count(*) as count from bike group by day

核心代码：

val day_count = bikeDF.select(dayofweek(col("start_time")).alias("day")).groupBy("day").count().withColumn("day",
      when(col("day") === 1, "周日").otherwise(
        when(col("day") === 2, "周一").otherwise(
          when(col("day") === 3, "周二").otherwise(
            when(col("day") === 4, "周三").otherwise(
              when(col("day") === 5, "周四").otherwise(
                when(col("day") === 6, "周五").otherwise(
                  when(col("day") === 7, "周六").otherwise("未知"))))))))
day_count.show()

4 根据起始时间统计一天中每个小时的骑行次数并进行降序排列

核心SQL语句：

select hour(start_time) as hour, count(*) as count from bike group by hour order by hour desc

核心代码：

val hour_count = bikeDF.select(hour(col("start_time")).alias("hour")).groupBy("hour").count().withColumn("hour",
      when(col("hour") === 0, "0点").otherwise(
        when(col("hour") === 1, "1点").otherwise(
          when(col("hour") === 2, "2点").otherwise(
            when(col("hour") === 3, "3点").otherwise(
              when(col("hour") === 4, "4点").otherwise(
                when(col("hour") === 5, "5点").otherwise(
                  when(col("hour") === 6, "6点").otherwise(
                    when(col("hour") === 7, "7点").otherwise(
                      when(col("hour") === 8, "8点").otherwise(
                        when(col("hour") === 9, "9点").otherwise(
                          when(col("hour") === 10, "10点").otherwise(
                            when(col("hour") === 11, "11点").otherwise(
                              when(col("hour") === 12, "12点").otherwise(
                                when(col("hour") === 13, "13点").otherwise(
                                  when(col("hour") === 14, "14点").otherwise(
                                    when(col("hour") === 15, "15点").otherwise(
                                      when(col("hour") === 16, "16点").otherwise(
                                        when(col("hour") === 17, "17点").otherwise(
                                          when(col("hour") === 18, "18点").otherwise(
                                            when(col("hour") === 19, "19点").otherwise(
                                              when(col("hour") === 20, "20点").otherwise(
                                                when(col("hour") === 21, "21点").otherwise(
                                                  when(col("hour") === 22, "22点").otherwise(
                                                    when(col("hour") === 23, "23点").otherwise("未知")))))))))))))))))))))))))
    hour_count.sort(col("count").desc).show(24)

5 根据起始时间和结束时间计算骑行时长统计信息，命名为riding_time列

核心SQL语句：

select *,(end_time-start_time) as riding_time from bike

核心代码：

val riding_time = bikeDF.withColumn("riding_time", (unix_timestamp(col("end_time"), "yyyy-MM-dd HH:mm") - unix_timestamp(col("start_time"), "yyyy-MM-dd HH:mm"))./(60)).select("riding_time")
    riding_time.summary().show()

运行截图：

6 根据起始时间统计早晚高峰和平峰的骑行次数

val morn_even_peak = bikeDF.select(hour(col("start_time")).alias("hour")).withColumn("hour",
      when(col("hour").>(6).&&(col("hour").<=(8)), "早高峰")
        .otherwise(when(col("hour").>(17).&&(col("hour").<=(20)), "晚高峰")
          .otherwise("平峰"))).groupBy("hour").count()
    morn_even_peak.show()

7 用户分级（RFM模型）

R值：即每个用户最后一次租赁共享单车的时间距9月1日多少天

val last_rent_day = bikeDF.select(col("userid").alias("r"), datediff(lit("2016-09-01 00:00"), col("start_time"))
      .alias("last_rent_day")).groupBy("r").agg(min("last_rent_day")
      .alias("r_value"))

F值：即每个用户累计租赁单车频次

val rent_count = bikeDF.select(col("userid").alias("f")).groupBy("f").agg(count("f")
      .alias("f_value"))

M值：即每个用户累积消费金额

val total_cost = bikeDF.select(col("userid").alias("m"), (unix_timestamp(col("end_time"), "yyyy-MM-dd HH:mm") - unix_timestamp(col("start_time"), "yyyy-MM-dd HH:mm"))./(60)
      .alias("total_cost")).withColumn("total_cost",
      when(col("total_cost").%(30).=!=(0), col("total_cost")./(30) + 1)
        .otherwise(col("total_cost")./(30))
        .alias("total_cost")).groupBy("m").agg(sum("total_cost").alias("m_value")).select(col("m"), round(col("m_value"), 2).alias("m_value"))

合并到一个DataFrame中

val rfm = last_rent_day.join(rent_count, last_rent_day("r") === rent_count("f")).join(total_cost, last_rent_day("r") === total_cost("m"))
      .select(last_rent_day("r").alias("userid"), last_rent_day("r_value"), rent_count("f_value"), total_cost("m_value"))

计算RFM分数以及和RFM平均值对比

//计算R分数
    val rfm_score = rfm.withColumn("r_score",
      when(col("r_value").>(14), 1)
        .otherwise(when(col("r_value").>(7).&&(col("r_value").<=(14)), 2)
          .otherwise(when(col("r_value").>(3).&&(col("r_value").<=(7)), 3)
            .otherwise(when(col("r_value").>(1).&&(col("r_value").<=(3)), 4)
              .otherwise(when(col("r_value").>(0).&&(col("r_value").<=(1)), 5)
                .otherwise(1))))))
      //计算F分数
      .withColumn("f_score",
        when(col("f_value").>(20), 5)
          .otherwise(when(col("f_value").>(15).&&(col("f_value").<=(20)), 4)
            .otherwise(when(col("f_value").>(10).&&(col("f_value").<=(15)), 3)
              .otherwise(when(col("f_value").>(5).&&(col("f_value").<=(10)), 2)
                .otherwise(when(col("f_value").>(0).&&(col("f_value").<=(5)), 1)
                  .otherwise(1))))))
      //计算M分数
      .withColumn("m_score",
        when(col("m_value").>(100), 5)
          .otherwise(when(col("m_value").>(60).&&(col("m_value").<=(100)), 4)
            .otherwise(when(col("m_value").>(30).&&(col("m_value").<=(60)), 3)
              .otherwise(when(col("m_value").>(10).&&(col("m_value").<=(30)), 2)
                .otherwise(when(col("m_value").>(0).&&(col("m_value").<=(10)), 1)
                  .otherwise(1))))))
      //RFM值和RFM平均值对比
      .withColumn("R是否大于均值", when(col("r_score").>(3.50), 1).otherwise(0))
      .withColumn("F是否大于均值", when(col("f_score").>(1.64), 1).otherwise(0))
      .withColumn("M是否大于均值", when(col("m_score").>(1.41), 1).otherwise(0))
      .withColumn("RFM分数", col("R是否大于均值").*(100).+(col("F是否大于均值").*(10)).+(col("M是否大于均值").*(1)))
      .withColumn("用户类型", when(col("RFM分数").===(111), "重要价值用户")
        .when(col("RFM分数").===(110), "消费潜力用户")
        .when(col("RFM分数").===(101), "频次深耕用户")
        .when(col("RFM分数").===(100), "新用户")
        .when(col("RFM分数").===(11), "重要价值流失预警用户")
        .when(col("RFM分数").===(10), "一般用户")
        .when(col("RFM分数").===(1), "高消费唤回用户")
        .when(col("RFM分数").===(0), "流失用户"))
    rfm_score.show()

数据集链接下载：mobike_shanghai_sample_updated.zip

完整源码已上传至Github

版权属于：张子
本文链接：https://www.znzzi.com/articles/274
所有原创文章采用知识共享署名-非商业性使用 4.0 国际许可协议进行许可。您可以自由的转载和修改，但请务必注明文章来源并且不可用于商业目的。

最后修改：2023 年 12 月 23 日

点个赞或者请作者喝杯咖啡

发表评论取消回复
使用Cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

Spark SQL 上海摩拜共享单车数据分析

张子 • 2022 年 04 月 23 日

<h1>1 生成DataFrame对象</h1><pre><code class="lang-scala">val bikeDF = sqlContext.read
      .format(&quot;csv&quot;)
      .option(&quot;header&quot;, &quot;true&quot;)
      .option(&quot;timestampFormat&quot;, &quot;yyyy-MM-dd HH:mm&quot;)
      .load(&quot;src/main/scala/com/zhangz1/mobike_shanghai_sample_updated.csv&quot;)</code></pre><h1>2 展示数据</h1><pre><code class="lang-scala">bikeDF.show()</code></pre><p><img src="https://www.znzzi.com/usr/themes/handsome/assets/img/loading.svg" alt="image-20220423101637545" title="image-20220423101637545" style=""data-original="https://www.znzzi.com/usr/uploads/2022/04/3371205616.png"></p><h1>3 根据起始时间统计周一至周日每天的骑行次数</h1><p>核心SQL语句：</p><pre><code class="lang-sql">select dayofweek(start_time) as day,count(*) as count from bike group by day</code></pre><p>核心代码：</p><pre><code class="lang-scala">val day_count = bikeDF.select(dayofweek(col(&quot;start_time&quot;)).alias(&quot;day&quot;)).groupBy(&quot;day&quot;).count().withColumn(&quot;day&quot;,
      when(col(&quot;day&quot;) === 1, &quot;周日&quot;).otherwise(
        when(col(&quot;day&quot;) === 2, &quot;周一&quot;).otherwise(
          when(col(&quot;day&quot;) === 3, &quot;周二&quot;).otherwise(
            when(col(&quot;day&quot;) === 4, &quot;周三&quot;).otherwise(
              when(col(&quot;day&quot;) === 5, &quot;周四&quot;).otherwise(
                when(col(&quot;day&quot;) === 6, &quot;周五&quot;).otherwise(
                  when(col(&quot;day&quot;) === 7, &quot;周六&quot;).otherwise(&quot;未知&quot;))))))))
day_count.show()</code></pre><p><img src="https://www.znzzi.com/usr/themes/handsome/assets/img/loading.svg" alt="img" title="img" style=""data-original="https://www.znzzi.com/usr/uploads/2022/04/4233228049.jpg"></p><h1>4 根据起始时间统计一天中每个小时的骑行次数并进行降序排列</h1><p>核心SQL语句：</p><pre><code class="lang-sql">select hour(start_time) as hour, count(*) as count from bike group by hour order by hour desc</code></pre><p>核心代码：</p><pre><code class="lang-scala">val hour_count = bikeDF.select(hour(col(&quot;start_time&quot;)).alias(&quot;hour&quot;)).groupBy(&quot;hour&quot;).count().withColumn(&quot;hour&quot;,
      when(col(&quot;hour&quot;) === 0, &quot;0点&quot;).otherwise(
        when(col(&quot;hour&quot;) === 1, &quot;1点&quot;).otherwise(
          when(col(&quot;hour&quot;) === 2, &quot;2点&quot;).otherwise(
            when(col(&quot;hour&quot;) === 3, &quot;3点&quot;).otherwise(
              when(col(&quot;hour&quot;) === 4, &quot;4点&quot;).otherwise(
                when(col(&quot;hour&quot;) === 5, &quot;5点&quot;).otherwise(
                  when(col(&quot;hour&quot;) === 6, &quot;6点&quot;).otherwise(
                    when(col(&quot;hour&quot;) === 7, &quot;7点&quot;).otherwise(
                      when(col(&quot;hour&quot;) === 8, &quot;8点&quot;).otherwise(
                        when(col(&quot;hour&quot;) === 9, &quot;9点&quot;).otherwise(
                          when(col(&quot;hour&quot;) === 10, &quot;10点&quot;).otherwise(
                            when(col(&quot;hour&quot;) === 11, &quot;11点&quot;).otherwise(
                              when(col(&quot;hour&quot;) === 12, &quot;12点&quot;).otherwise(
                                when(col(&quot;hour&quot;) === 13, &quot;13点&quot;).otherwise(
                                  when(col(&quot;hour&quot;) === 14, &quot;14点&quot;).otherwise(
                                    when(col(&quot;hour&quot;) === 15, &quot;15点&quot;).otherwise(
                                      when(col(&quot;hour&quot;) === 16, &quot;16点&quot;).otherwise(
                                        when(col(&quot;hour&quot;) === 17, &quot;17点&quot;).otherwise(
                                          when(col(&quot;hour&quot;) === 18, &quot;18点&quot;).otherwise(
                                            when(col(&quot;hour&quot;) === 19, &quot;19点&quot;).otherwise(
                                              when(col(&quot;hour&quot;) === 20, &quot;20点&quot;).otherwise(
                                                when(col(&quot;hour&quot;) === 21, &quot;21点&quot;).otherwise(
                                                  when(col(&quot;hour&quot;) === 22, &quot;22点&quot;).otherwise(
                                                    when(col(&quot;hour&quot;) === 23, &quot;23点&quot;).otherwise(&quot;未知&quot;)))))))))))))))))))))))))
    hour_count.sort(col(&quot;count&quot;).desc).show(24)</code></pre><p><img src="https://www.znzzi.com/usr/themes/handsome/assets/img/loading.svg" alt="image-20220423102322697" title="image-20220423102322697" style=""data-original="https://www.znzzi.com/usr/uploads/2022/04/4136100822.png"></p><h1>5 根据起始时间和结束时间计算骑行时长统计信息，命名为riding_time列</h1><p>核心SQL语句：</p><pre><code class="lang-sql">select *,(end_time-start_time) as riding_time from bike</code></pre><p>核心代码：</p><pre><code class="lang-scala">val riding_time = bikeDF.withColumn(&quot;riding_time&quot;, (unix_timestamp(col(&quot;end_time&quot;), &quot;yyyy-MM-dd HH:mm&quot;) - unix_timestamp(col(&quot;start_time&quot;), &quot;yyyy-MM-dd HH:mm&quot;))./(60)).select(&quot;riding_time&quot;)
    riding_time.summary().show()</code></pre><p>运行截图：</p><p><img src="https://www.znzzi.com/usr/themes/handsome/assets/img/loading.svg" alt="image-20220423101400479" title="image-20220423101400479" style=""data-original="https://www.znzzi.com/usr/uploads/2022/04/3390281334.png"></p><h1>6 根据起始时间统计早晚高峰和平峰的骑行次数</h1><pre><code class="lang-scala">val morn_even_peak = bikeDF.select(hour(col(&quot;start_time&quot;)).alias(&quot;hour&quot;)).withColumn(&quot;hour&quot;,
      when(col(&quot;hour&quot;).&gt;(6).&amp;&amp;(col(&quot;hour&quot;).&lt;=(8)), &quot;早高峰&quot;)
        .otherwise(when(col(&quot;hour&quot;).&gt;(17).&amp;&amp;(col(&quot;hour&quot;).&lt;=(20)), &quot;晚高峰&quot;)
          .otherwise(&quot;平峰&quot;))).groupBy(&quot;hour&quot;).count()
    morn_even_peak.show()</code></pre><p><img src="https://www.znzzi.com/usr/themes/handsome/assets/img/loading.svg" alt="image-20220423101349186" title="image-20220423101349186" style=""data-original="https://www.znzzi.com/usr/uploads/2022/04/4292886788.png"></p><h1>7 用户分级（RFM模型）</h1><p>R值：即每个用户最后一次租赁共享单车的时间距9月1日多少天</p><pre><code class="lang-scala">val last_rent_day = bikeDF.select(col(&quot;userid&quot;).alias(&quot;r&quot;), datediff(lit(&quot;2016-09-01 00:00&quot;), col(&quot;start_time&quot;))
      .alias(&quot;last_rent_day&quot;)).groupBy(&quot;r&quot;).agg(min(&quot;last_rent_day&quot;)
      .alias(&quot;r_value&quot;))</code></pre><p>F值：即每个用户累计租赁单车频次</p><pre><code class="lang-scala">val rent_count = bikeDF.select(col(&quot;userid&quot;).alias(&quot;f&quot;)).groupBy(&quot;f&quot;).agg(count(&quot;f&quot;)
      .alias(&quot;f_value&quot;))</code></pre><p>M值：即每个 用户累积消费金额</p><pre><code class="lang-scala">val total_cost = bikeDF.select(col(&quot;userid&quot;).alias(&quot;m&quot;), (unix_timestamp(col(&quot;end_time&quot;), &quot;yyyy-MM-dd HH:mm&quot;) - unix_timestamp(col(&quot;start_time&quot;), &quot;yyyy-MM-dd HH:mm&quot;))./(60)
      .alias(&quot;total_cost&quot;)).withColumn(&quot;total_cost&quot;,
      when(col(&quot;total_cost&quot;).%(30).=!=(0), col(&quot;total_cost&quot;)./(30) + 1)
        .otherwise(col(&quot;total_cost&quot;)./(30))
        .alias(&quot;total_cost&quot;)).groupBy(&quot;m&quot;).agg(sum(&quot;total_cost&quot;).alias(&quot;m_value&quot;)).select(col(&quot;m&quot;), round(col(&quot;m_value&quot;), 2).alias(&quot;m_value&quot;))</code></pre><p>合并到一个DataFrame中</p><pre><code class="lang-scala">val rfm = last_rent_day.join(rent_count, last_rent_day(&quot;r&quot;) === rent_count(&quot;f&quot;)).join(total_cost, last_rent_day(&quot;r&quot;) === total_cost(&quot;m&quot;))
      .select(last_rent_day(&quot;r&quot;).alias(&quot;userid&quot;), last_rent_day(&quot;r_value&quot;), rent_count(&quot;f_value&quot;), total_cost(&quot;m_value&quot;))</code></pre><p>计算RFM分数以及和RFM平均值对比</p><pre><code>//计算R分数
    val rfm_score = rfm.withColumn(&quot;r_score&quot;,
      when(col(&quot;r_value&quot;).&gt;(14), 1)
        .otherwise(when(col(&quot;r_value&quot;).&gt;(7).&amp;&amp;(col(&quot;r_value&quot;).&lt;=(14)), 2)
          .otherwise(when(col(&quot;r_value&quot;).&gt;(3).&amp;&amp;(col(&quot;r_value&quot;).&lt;=(7)), 3)
            .otherwise(when(col(&quot;r_value&quot;).&gt;(1).&amp;&amp;(col(&quot;r_value&quot;).&lt;=(3)), 4)
              .otherwise(when(col(&quot;r_value&quot;).&gt;(0).&amp;&amp;(col(&quot;r_value&quot;).&lt;=(1)), 5)
                .otherwise(1))))))
      //计算F分数
      .withColumn(&quot;f_score&quot;,
        when(col(&quot;f_value&quot;).&gt;(20), 5)
          .otherwise(when(col(&quot;f_value&quot;).&gt;(15).&amp;&amp;(col(&quot;f_value&quot;).&lt;=(20)), 4)
            .otherwise(when(col(&quot;f_value&quot;).&gt;(10).&amp;&amp;(col(&quot;f_value&quot;).&lt;=(15)), 3)
              .otherwise(when(col(&quot;f_value&quot;).&gt;(5).&amp;&amp;(col(&quot;f_value&quot;).&lt;=(10)), 2)
                .otherwise(when(col(&quot;f_value&quot;).&gt;(0).&amp;&amp;(col(&quot;f_value&quot;).&lt;=(5)), 1)
                  .otherwise(1))))))
      //计算M分数
      .withColumn(&quot;m_score&quot;,
        when(col(&quot;m_value&quot;).&gt;(100), 5)
          .otherwise(when(col(&quot;m_value&quot;).&gt;(60).&amp;&amp;(col(&quot;m_value&quot;).&lt;=(100)), 4)
            .otherwise(when(col(&quot;m_value&quot;).&gt;(30).&amp;&amp;(col(&quot;m_value&quot;).&lt;=(60)), 3)
              .otherwise(when(col(&quot;m_value&quot;).&gt;(10).&amp;&amp;(col(&quot;m_value&quot;).&lt;=(30)), 2)
                .otherwise(when(col(&quot;m_value&quot;).&gt;(0).&amp;&amp;(col(&quot;m_value&quot;).&lt;=(10)), 1)
                  .otherwise(1))))))
      //RFM值和RFM平均值对比
      .withColumn(&quot;R是否大于均值&quot;, when(col(&quot;r_score&quot;).&gt;(3.50), 1).otherwise(0))
      .withColumn(&quot;F是否大于均值&quot;, when(col(&quot;f_score&quot;).&gt;(1.64), 1).otherwise(0))
      .withColumn(&quot;M是否大于均值&quot;, when(col(&quot;m_score&quot;).&gt;(1.41), 1).otherwise(0))
      .withColumn(&quot;RFM分数&quot;, col(&quot;R是否大于均值&quot;).*(100).+(col(&quot;F是否大于均值&quot;).*(10)).+(col(&quot;M是否大于均值&quot;).*(1)))
      .withColumn(&quot;用户类型&quot;, when(col(&quot;RFM分数&quot;).===(111), &quot;重要价值用户&quot;)
        .when(col(&quot;RFM分数&quot;).===(110), &quot;消费潜力用户&quot;)
        .when(col(&quot;RFM分数&quot;).===(101), &quot;频次深耕用户&quot;)
        .when(col(&quot;RFM分数&quot;).===(100), &quot;新用户&quot;)
        .when(col(&quot;RFM分数&quot;).===(11), &quot;重要价值流失预警用户&quot;)
        .when(col(&quot;RFM分数&quot;).===(10), &quot;一般用户&quot;)
        .when(col(&quot;RFM分数&quot;).===(1), &quot;高消费唤回用户&quot;)
        .when(col(&quot;RFM分数&quot;).===(0), &quot;流失用户&quot;))
    rfm_score.show()</code></pre><p><img src="https://www.znzzi.com/usr/themes/handsome/assets/img/loading.svg" alt="image-20220423101328922" title="image-20220423101328922" style=""data-original="https://www.znzzi.com/usr/uploads/2022/04/1962584025.png"></p><p>数据集链接下载：<a href="https://www.znzzi.com/usr/uploads/2022/04/3646884287.zip">mobike_shanghai_sample_updated.zip</a></p><p>完整源码已上传至<span class="external-link"><a class="no-external-link" href="https://github.com/z1zhang/SparkSQL" target="_blank"><i data-feather="external-link"></i>Github</a></span></p><hr class="content-copyright" style="margin-top:50px" /><blockquote class="content-copyright" style="font-style:normal"><p class="content-copyright">版权属于：张子</p><p class="content-copyright">本文链接：<a class="content-copyright" href="https://www.znzzi.com/articles/274">https://www.znzzi.com/articles/274</a></p><p class="content-copyright">所有原创文章采用<a href="https://creativecommons.org/licenses/by-nc/4.0/deed.zh" target="_blank">知识共享署名-非商业性使用 4.0 国际许可协议</a>进行许可。 您可以自由的转载和修改，但请务必注明文章来源并且不可用于商业目的。</p></blockquote>

1 生成DataFrame对象

2 展示数据

3 根据起始时间统计周一至周日每天的骑行次数

4 根据起始时间统计一天中每个小时的骑行次数并进行降序排列

5 根据起始时间和结束时间计算骑行时长统计信息，命名为riding_time列

6 根据起始时间统计早晚高峰和平峰的骑行次数

7 用户分级（RFM模型）

发表评论 取消回复 使用Cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

Spark SQL 上海摩拜共享单车数据分析

发表评论取消回复
使用Cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款