• Hadoop Website Analyst • 数据处理平台 • 数据分析 & 预处理 • Mapper Execution 程序 • Reducer Execution 程序 • Hadoop 集群处理完整数据 • Reference

Hadoop Website Analyst

要求对网站访问数据进行分析,以日期为单位,从高到低统计出每日网址的访问排名。

数据处理平台

采用以伪分布方式部署在 Docker 容器中的 Hadoop 3.4.1 进行数据分析。

数据分析 & 预处理

打开 Comma-separated values (CSV) 格式的数据集,截取其中一小段查看如下
____________________________________________________________________________________________________________________________________________________ | o o o csv | |====================================================================================================================================================| | 5813192,1688,109495,A8469148F4D543D74ACED3F4A2A115EC,2020/12/31 11:57,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html | | 5813194,1688,109495,A8469148F4D543D74ACED3F4A2A115EC,2020/12/31 11:57,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html?cName=ral_106 | | 5813195,1688,109126,B02879C663DFBB315CD8E357C460F8B1,2020/12/31 11:57,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html?cName=ral_106 | | 5813174,1628,90893,1BBDF304E8183F83A706A2AA43C3A9C0,2020/12/31 11:55,/tzjingsai/1628.jhtml | | 5813175,1628,90893,1BBDF304E8183F83A706A2AA43C3A9C0,2020/12/31 11:55,http://www.tipdm.org/tzjingsai/1628.jhtml | | 5813176,1628,90893,1BBDF304E8183F83A706A2AA43C3A9C0,2020/12/31 11:55,http://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html?cName=ral_106 | | 5813148,1693,85440,0DDFAC4CA65D24D9726B9D765ECB504E,2020/12/31 11:48,https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html?cName=ral_105 | | 5813149,1693,85440,0DDFAC4CA65D24D9726B9D765ECB504E,2020/12/31 11:48,https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html?cName=ral_106 | | 5813147,1693,85440,0DDFAC4CA65D24D9726B9D765ECB504E,2020/12/31 11:47,https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html?cName=ral_106 | | 5813116,1628,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:43,/tzjingsai/1628.jhtml | | 5813117,1628,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:43,http://www.tipdm.org/tzjingsai/1628.jhtml | | 5813119,1628,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:43,http://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html?cName=ral_106 | | 5813094,1692,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:40,https://www.tipdm.org/bdrace/news/20200908/1692.html | | 5813095,1628,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:40,/tzjingsai/1628.jhtml | | 5813096,1628,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:40,http://www.tipdm.org/tzjingsai/1628.jhtml | | 5813099,1628,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:40,http://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html?cName=ral_104 | | 5813048,1688,104842,F77914C7D45AEAEEC0FC02F7FF19D03C,2020/12/31 11:32,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html?cName=ral_106 | | 5813049,1688,104842,F77914C7D45AEAEEC0FC02F7FF19D03C,2020/12/31 11:32,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html | | 5813061,1688,104842,F77914C7D45AEAEEC0FC02F7FF19D03C,2020/12/31 11:32,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html?cName=ral_106 | | 5813034,1693,87278,A59EA59CA11FADE6461F216A61AD6716,2020/12/31 11:29,https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html?cName=ral_105 | | 5813036,1693,87278,A59EA59CA11FADE6461F216A61AD6716,2020/12/31 11:29,https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html?cName=ral_106 | | 5813038,1693,87278,A59EA59CA11FADE6461F216A61AD6716,2020/12/31 11:29,https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html?cName=ral_105 | | 5813010,1688,104842,F77914C7D45AEAEEC0FC02F7FF19D03C,2020/12/31 11:21,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html?cName=ral_104 | | 5813012,1688,104842,F77914C7D45AEAEEC0FC02F7FF19D03C,2020/12/31 11:21,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html?cName=ral_106 | '===================================================================================================================================================='
从中可以初步得出几个特征: • 日期信息在第 5 列。 • 网址信息在第 6 列。 • 有的 URL 有 bdrace/ 目录,有的没有。与采用 HTTP 或 HTTPS 协议无关。 • 数据集中含有不完整的 URL,页面都使用 JHTML 实现。 • 第二列数据与 html 页面的命名对应。 • 所有 html 相同的条目,第 2~4 列条目内容都一致,应该是某种标识符。 • 第一列应该是条目的编号。 为了实现以日期为单位对网站访问排名进行统计的目的,我们主要关注 第 5、6 列信息,即日期和 URL。为了保证信息准确,除了处理缺失的数据,和去除 URL 中的查询和片段标志外,均不对 URL 进行其他处理,使用原样数据进行分析。 程序均使用 Hadoop Streaming 工具运行,使得数据处理框架与编程语言无关,下面的 Mapper Execution(以下简称 Mapper)和 Reducer Execution(以下简称 Reducer)均由 Python 实现:

Mapper Execution 程序

#!/usr/bin/env python3 import sys import re try: for line in sys.stdin: line = line.strip().split(',') # 提取日期 time = line[4].split(' ')[0] # shuffle 按字典序排序,需要对日期进行补零 time = '/'.join(f'{s:0>2}' for s in time.split('/')) url = line[5] # 去除查询和片段,数据集中用的竟然是全角井号? url = re.sub(r'(?<=html).*$', '', url) print(f'{time}\t{url}') except IndexError: raise
Mapper 从标准输入读取 CSV 文件的条目,以 日期 制表符 URL 的形式将数据发往标准输出。 对 Mapper 程序测试:
cat test.csv | ./mapexe.py
截取的输出的数据如下:
_________________________________________________________________________________ | o o o bash | |=================================================================================| | 2020/12/21 https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html | | 2020/12/21 https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html | | 2020/12/21 https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html | | 2020/12/21 https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html | | 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html | | 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html | | 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html | | 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html | | 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html | | 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html | | 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html | | 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html | | 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html | | 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html | | 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html | | 2020/01/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html | '================================================================================='

Reducer Execution 程序

Reducer Execution 程序(以下简称 Reducer),从标准输入中读取按日期排好序的 Tabstop-separated values (TSV) ,对输入的数据的 Value,即经过 Hadoop Shuffle 处理的 URL 添加进列表,然后将列表中的 URL 以访问次数进行排序。最后以 升序的日期 制表符 降序的网址和其访问数量 的形式将数据发往标准输出。
#!/usr/bin/env python3 import sys from collections import defaultdict from collections import Counter from pprint import pprint url_list = defaultdict(list) for line in sys.stdin: k, v = line.strip().split('\t', 1) url_list[k].append(v) for k, v in url_list.items(): url_list[k] = Counter(v) for time, counter in url_list.items(): counter: Counter for url, count in counter.most_common(): print(time, url, count, sep='\t')
对 Reducer 程序进行模拟测试:
cat test.csv | ./mapexe.py | sort | ./reducexe.py
截取的输出的数据如下:
_____________________________________________________________________________________ | o o o bash | |=====================================================================================| | 2020/12/31 https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html 124 | | 2020/12/31 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html 117 | | 2020/12/31 https://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html 20 | | 2020/12/31 https://www.tipdm.org/bdrace/wq1jszx/20201222/1731.html 11 | | 2020/12/31 /tzjingsai/1628.jhtml 9 | | 2020/12/31 http://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html 9 | | 2020/12/31 http://www.tipdm.org/tzjingsai/1628.jhtml 8 | | 2020/12/31 https://www.tipdm.org/bdrace/jsgsmcm/20190402/1564.html 8 | | 2020/12/31 https://www.tipdm.org/bdrace/jn3jszx/20201223/1732.html 7 | | 2020/12/31 https://www.tipdm.org/bdrace/tabyxlw/20201202/1727.html 4 | | 2020/12/31 http://www.tipdm.org/bdrace/jljingsai/20190809/1605.html 3 | | 2020/12/31 /tj/661.jhtml 2 | | 2020/12/31 http://www.tipdm.org/bdrace/jsgsmcm/20190402/1564.html 2 | | 2020/12/31 http://www.tipdm.org/tj/1266.jhtml 2 | | 2020/12/31 http://www.tipdm.org/tj/535.jhtml 2 | | 2020/12/31 http://www.tipdm.org/tj/578.jhtml 2 | | 2020/12/31 http://www.tipdm.org/ts/661.jhtml 2 | '====================================================================================='
可以看到,程序可以按日期顺序,对每日的网站访问次序按照降序进行排序。

Hadoop 集群处理完整数据

project 目录结构如下:
__________________________ | o o o bash | |==========================| | . | | ├── init.sh | | └── site_visitors | | ├── mapexe.py | | ├── reducexe.py | | ├── site_visitors.sh | | ├── test.csv | | └── visitors.csv | '=========================='
初始化脚本 init.sh 如下,该脚本完成数据文件的上传,并在目标目录生成 Hadoop 启动脚本:
#!/bin/bash project_name=site_visitors job_file=visitors.csv cat <<-SCRIPT > ${project_name}/${project_name}.sh #!/bin/bash -x # create folder hdfs dfs -mkdir -p input_${project_name} hdfs dfs -put -f ${job_file} # start job mapred streaming \ -input ${job_file} \ -output output_${project_name} \ -mapper mapexe.py \ -reducer reducexe.py \ -file mapexe.py \ -file reducexe.py SCRIPT # transmit file sudo docker cp ${project_name} hadoop_single_node:/home/singlenode/ sudo docker exec hadoop_single_node sudo chown -R singlenode:singlenode ${project_name} sudo docker exec hadoop_single_node sudo chmod -R +x ${project_name}/*.py ${project_name}/*.sh
启动 Docker 容器,进入目标目录,运行脚本,创建 Hadoop Job,截取部分输出如下:
______________________________________________________________________________________________________________________________________________ | o o o bash | |==============================================================================================================================================| | singlenode@singlenode:~/site_visitors$ ls | | mapexe.py reducexe.py site_visitors.sh test.csv visitors.csv | | singlenode@singlenode:~/site_visitors$ ./site_visitors.sh | | + hdfs dfs -mkdir -p input_site_visitors | | + hdfs dfs -put -f visitors.csv | | + mapred streaming -input visitors.csv -output output_site_visitors -mapper mapexe.py -reducer reducexe.py -file mapexe.py -file reducexe.py | | 2024-12-10 05:10:21,662 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead. | | packageJobJar: [mapexe.py, reducexe.py] [] /tmp/streamjob14897470953513239277.jar tmpDir=null | | 2024-12-10 05:10:22,763 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties | | 2024-12-10 05:10:22,900 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). | | 2024-12-10 05:10:22,901 INFO impl.MetricsSystemImpl: JobTracker metrics system started | | 2024-12-10 05:10:22,918 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized! | | 2024-12-10 05:10:23,175 INFO mapred.FileInputFormat: Total input files to process : 1 | | 2024-12-10 05:10:23,251 INFO mapreduce.JobSubmitter: number of splits:1 | | 2024-12-10 05:10:23,447 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1924643391_0001 | | 2024-12-10 05:10:23,448 INFO mapreduce.JobSubmitter: Executing with tokens: [] | | | | [...] | '=============================================================================================================================================='
查看 HDFS 下的任务输出文件,可以看到,任务的运行结果与测试一致,但是是以分布式计算的方式处理的:
__________________________________________________________________________________________________ | o o o bash | |==================================================================================================| | singlenode@singlenode:~/site_visitors$ hdfs dfs -ls output_site_visitors | | Found 2 items | | -rw-r--r-- 3 singlenode supergroup 0 2024-12-10 05:10 output_site_visitors/_SUCCESS | | -rw-r--r-- 3 singlenode supergroup 992232 2024-12-10 05:10 output_site_visitors/part-00000 | | singlenode@singlenode:~/site_visitors$ hdfs dfs -head output_site_visitors/part-00000 | | 2020/05/13 http://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html 413 | | 2020/05/13 /tzjingsai/1628.jhtml 61 | | 2020/05/13 http://www.tipdm.org/tzjingsai/1628.jhtml 57 | | 2020/05/13 https://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html 39 | | 2020/05/13 http://www.tipdm.org/bdrace/tzbstysj/20200228/1637.html 31 | | 2020/05/13 http://www.tipdm.org/tj/1615.jhtml 26 | | 2020/05/13 http://www.tipdm.org/bdrace/tzjingsai/20181226/1544.html 25 | | 2020/05/13 http://www.tipdm.org/bdrace/tzbszjs/20200203/1632.html 23 | | 2020/05/13 http://www.tipdm.org/tj/661.jhtml 20 | | 2020/05/13 http://www.tipdm.org/bdrace/tzqhjmd/20190604/1583.html 15 | | 2020/05/13 http://www.tipdm.org/bdrace/jljingsai/20190809/1605.html 13 | | 2020/05/13 /tj/1615.jhtml 13 | | 2020/05/13 http://www.tipdm.org/bdrace/tzbstysj/20200410/1640.html 12 | | 2020/05/13 http://www.tipdm.org/bdrace/tzbjszx/20200203/1639.html 12 | | 2020/05/13 http://www.tipdm.org/ts/661.jhtml 12 | | 2020/05/13 /tj/661.jhtml 10 | | 2020/05/13 http://www.tipdm.org/bdrace/jljingsai/20181008/1488.html 6 | '=================================================================================================='

Reference

Apache Hadoop 3.4.1 – Hadoop: Setting up a Single Node Cluster.Apache Hadoop 3.4.1 – HDFS Commands GuideApache Hadoop MapReduce Streaming – Hadoop Streaming
Create: Thu Dec 12 21:49:50 2024 Last Modified: Thu Dec 12 21:49:50 2024
_____ _______ _____ _______ /\ \ /::\ \ /\ \ /::\ \ /::\____\ /::::\ \ /::\____\ /::::\ \ /::::| | /::::::\ \ /::::| | /::::::\ \ /:::::| | /::::::::\ \ /:::::| | /::::::::\ \ /::::::| | /:::/~~\:::\ \ /::::::| | /:::/~~\:::\ \ /:::/|::| | /:::/ \:::\ \ /:::/|::| | /:::/ \:::\ \ /:::/ |::| | /:::/ / \:::\ \ /:::/ |::| | /:::/ / \:::\ \ /:::/ |::|___|______ /:::/____/ \:::\____\ /:::/ |::| | _____ /:::/____/ \:::\____\ /:::/ |::::::::\ \ |:::| | |:::| | /:::/ |::| |/\ \ |:::| | |:::| | /:::/ |:::::::::\____\|:::|____| |:::|____|/:: / |::| /::\____\|:::|____| |:::|____| \::/ / ~~~~~/:::/ / \:::\ \ /:::/ / \::/ /|::| /:::/ / \:::\ \ /:::/ / \/____/ /:::/ / \:::\ \ /:::/ / \/____/ |::| /:::/ / \:::\ \ /:::/ / /:::/ / \:::\ /:::/ / |::|/:::/ / \:::\ /:::/ / /:::/ / \:::\__/:::/ / |::::::/ / \:::\__/:::/ / /:::/ / \::::::::/ / |:::::/ / \::::::::/ / /:::/ / \::::::/ / |::::/ / \::::::/ / /:::/ / \::::/ / /:::/ / \::::/ / /:::/ / \::/____/ /:::/ / \::/____/ \::/ / \::/ / \/____/ \/____/ _____ _____ _____ _____ _____ /\ \ /\ \ /\ \ /\ \ /\ \ /::\ \ /::\ \ /::\ \ /::\ \ /::\ \ /::::\ \ /::::\ \ /::::\ \ /::::\ \ /::::\ \ /::::::\ \ /::::::\ \ /::::::\ \ /::::::\ \ /::::::\ \ /:::/\:::\ \ /:::/\:::\ \ /:::/\:::\ \ /:::/\:::\ \ /:::/\:::\ \ /:::/__\:::\ \ /:::/__\:::\ \ /:::/__\:::\ \ /:::/ \:::\ \ /:::/__\:::\ \ \:::\ \:::\ \ /::::\ \:::\ \ /::::\ \:::\ \ /:::/ \:::\ \ /::::\ \:::\ \ ___\:::\ \:::\ \ /::::::\ \:::\ \ /::::::\ \:::\ \ /:::/ / \:::\ \ /::::::\ \:::\ \ /\ \:::\ \:::\ \ /:::/\:::\ \:::\____\ /:::/\:::\ \:::\ \ /:::/ / \:::\ \ /:::/\:::\ \:::\ \ /::\ \:::\ \:::\____\/:::/ \:::\ \:::| |/:::/ \:::\ \:::\____\/:::/____/ \:::\____\/:::/__\:::\ \:::\____\ \:::\ \:::\ \::/ /\::/ \:::\ /:::|____|\::/ \:::\ /:::/ /\:::\ \ \::/ /\:::\ \:::\ \::/ / \:::\ \:::\ \/____/ \/_____/\:::\/:::/ / \/____/ \:::\/:::/ / \:::\ \ \/____/ \:::\ \:::\ \/____/ \:::\ \:::\ \ \::::::/ / \::::::/ / \:::\ \ \:::\ \:::\ \ \:::\ \:::\____\ \::::/ / \::::/ / \:::\ \ \:::\ \:::\____\ \:::\ /:::/ / \::/____/ /:::/ / \:::\ \ \:::\ \::/ / \:::\/:::/ / /:::/ / \:::\ \ \:::\ \/____/ \::::::/ / /:::/ / \:::\ \ \:::\ \ \::::/ / /:::/ / \:::\____\ \:::\____\ \::/ / \::/ / \::/ / \::/ / \/____/ \/____/ \/____/ \/____/