业务上的 爬虫的 文件的数据

graph LR
  数据源-->定时同步;
  定时同步-->Hive数仓;

定时同步工具 DataX
https://github.com/alibaba/DataX

初始化个人开发环境
创建开发目录 - 遵循规范 每个人的代码要在自己的目录下去开发

1
2
3
4
5
6
7
8
9
10
# 创建个人目录
mkdir -p /zhiyun/lijinquan
cd /zhiyun/lijinquan
# 创建5个功能目录
# data - 存放数据文件
# jobs - 存放datax的配置文件
# sql - sql脚本
# shell - shell脚本
# python - python脚本
mkdir data jobs sql shell python

存储的规范和要求

按要求每个月存储一份历史数据, 至少保留24个月的版本数据

graph LR
   源表-->|DataX|Hive-ODS;
   Hive-ODS-->2025-05;
   Hive-ODS-->2025-06;
   Hive-ODS-->2025-07-当月就覆盖;
   Hive-ODS-->2025-08;

配置文件的编写

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "jd",
"password": "123456",
"column": [
"*"
],
"connection": [
{
"table": [
"yyj_nmpa"
],
"jdbcUrl": [
"jdbc:mysql://192.168.8.8:3306/jd"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://hdp:8020",
"fileType": "orc",
"path": "/zhiyun/yaoyinjin/tmp/nmpa",
"fileName": "nmpa.data",
"column": [
{"name": "id", "type": "int"},
{"name": "link", "type": "string"},
{"name": "title", "type": "string"},
{"name": "index_id", "type": "string"},
{"name": "categories", "type": "string"},
{"name": "date_", "type": "string"},
{"name": "article", "type": "string"}
],
"writeMode": "truncate",
"fieldDelimiter": "\t"
}
}
}
]
}
}

抽取数据

注意抽取前需要提前创建HDFS目录 http://192.168.8.67:9870/

1
hadoop fs -mkdir -p /zhiyun/yaoyinjin/tmp/nmpa

Hive建表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- 建立ODS数据库
create database if not exists ods_yaoyinjin location "/zhiyun/yaoyinjin/ods";
-- 建立NMPA分区表
-- ODS的表应该是外部表
-- 内部表删除表的时候删除数据, 外部表删除表不删除数据
create external table if not exists ods_yaoyinjin.nmpa(
id int,
link string,
title string,
index_id string,
categories string,
data_ string,
article string
) partitioned by (dt string)
row format delimited fields terminated by "\t"
lines terminated by "\n"
stored as orc
location "/zhiyun/yaoyinjin/ods/nmpa";

加载数据到当月分区

1
load data inpath "/zhiyun/yaoyinjin/tmp/nmpa/*" overwrite into table ods_yaoyinjin.nmpa partition(dt="2025-07");

验证数据

1
2
select count(1) from ods_lijinquan.nmpa;
select * from ods_lijinquan.nmpa limit 1;

自动化调度

需求: 每天凌晨4点自动执行这个流程
调度平台 - 海豚调度
linux定时任务 - Cron
http://192.168.8.67:12345/dolphinscheduler/ui/login
用户名和密码 admin / dolphinscheduler123