业务上的 爬虫的 文件的数据
graph LR
数据源-->定时同步;
定时同步-->Hive数仓;
定时同步工具 DataX
https://github.com/alibaba/DataX
初始化个人开发环境
创建开发目录 - 遵循规范 每个人的代码要在自己的目录下去开发
1 2 3 4 5 6 7 8 9 10
| mkdir -p /zhiyun/lijinquan cd /zhiyun/lijinquan
mkdir data jobs sql shell python
|
存储的规范和要求
按要求每个月存储一份历史数据, 至少保留24个月的版本数据
graph LR
源表-->|DataX|Hive-ODS;
Hive-ODS-->2025-05;
Hive-ODS-->2025-06;
Hive-ODS-->2025-07-当月就覆盖;
Hive-ODS-->2025-08;
配置文件的编写
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
| { "job": { "setting": { "speed": { "channel": 3 }, "errorLimit": { "record": 0, "percentage": 0.02 } }, "content": [ { "reader": { "name": "mysqlreader", "parameter": { "username": "jd", "password": "123456", "column": [ "*" ], "connection": [ { "table": [ "yyj_nmpa" ], "jdbcUrl": [ "jdbc:mysql://192.168.8.8:3306/jd" ] } ] } }, "writer": { "name": "hdfswriter", "parameter": { "defaultFS": "hdfs://hdp:8020", "fileType": "orc", "path": "/zhiyun/yaoyinjin/tmp/nmpa", "fileName": "nmpa.data", "column": [ {"name": "id", "type": "int"}, {"name": "link", "type": "string"}, {"name": "title", "type": "string"}, {"name": "index_id", "type": "string"}, {"name": "categories", "type": "string"}, {"name": "date_", "type": "string"}, {"name": "article", "type": "string"} ], "writeMode": "truncate", "fieldDelimiter": "\t" } } } ] } }
|
抽取数据
注意抽取前需要提前创建HDFS目录 http://192.168.8.67:9870/
1
| hadoop fs -mkdir -p /zhiyun/yaoyinjin/tmp/nmpa
|




Hive建表
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| create database if not exists ods_yaoyinjin location "/zhiyun/yaoyinjin/ods";
create external table if not exists ods_yaoyinjin.nmpa( id int, link string, title string, index_id string, categories string, data_ string, article string ) partitioned by (dt string) row format delimited fields terminated by "\t" lines terminated by "\n" stored as orc location "/zhiyun/yaoyinjin/ods/nmpa";
|

加载数据到当月分区
1
| load data inpath "/zhiyun/yaoyinjin/tmp/nmpa/*" overwrite into table ods_yaoyinjin.nmpa partition(dt="2025-07");
|

验证数据
1 2
| select count(1) from ods_lijinquan.nmpa; select * from ods_lijinquan.nmpa limit 1;
|


自动化调度
需求: 每天凌晨4点自动执行这个流程
调度平台 - 海豚调度
linux定时任务 - Cron
http://192.168.8.67:12345/dolphinscheduler/ui/login
用户名和密码 admin / dolphinscheduler123

.png)