什么是HIVE Hive 是基于 Hadoop 的一个数据仓库工具 。以下是具体介绍:
功能特点 :Hive 可以将结构化的数据文件映射为一张数据库表 ,并提供完整的 SQL 查询功能,能将 SQL 语句转换为 MapReduce 任务进行运行。它允许熟悉 SQL 的用户方便地查询数据,也支持熟悉 MapReduce 的开发者自定义 mapper 和 reducer,以处理复杂的分析工作。
优势 :学习成本低,通过类 SQL 语句可快速实现简单的 MapReduce 统计,无需开发专门的 MapReduce 应用,十分适合数据仓库的统计分析。
应用场景 :常用于对时效性要求不高的数据分析场景。由于 Hive 底层依赖 Hadoop 的 HDFS 存储数据 ,利用 MapReduce 进行计算,因此能够处理大规模的数据,在处理海量结构化日志的数据统计等方面应用广泛。
与数据库的区别 :
数据库一般用于在线应用,支持对某一行或某些行数据的更新、删除等操作,采用 “写时模式”,数据加载慢但查询快。
而 Hive 不支持对具体行的操作,也不支持事务和索引,采用 “读时模式”,适合处理非结构化或存储模式未知的数据,更侧重于对海量数据的批量处理和分析。
配置mysql安装源 (在线安装方法)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 wget https://dev.mysql.com/get/mysql57-community-release-el7-11.noarch.rpm yum localinstall mysql57-community-release-el7-11.noarch.rpm rpm --import https://repo.mysql.com/RPM-GPG-KEY-mysql-2022 vim /etc/yum.repos.d/mysql-community.repo 修改 baseurl 为 https://mirrors.cloud.tencent.com/mysql/yum/mysql-5.7-community-el7-x86_64/ yum install -y mysql-community-server
安装mysql (本地安装方法)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00#rpm包安装 cd /usr/localtar -zxvf mysql-5.7.22-linux-glibc2.12-x86_64.tar mv mysql-5.7.22-linux-glibc2.12-x86_64 mysql-5.7.22ln -s mysql-5.7.22 mysqlgroupadd mysql useradd -g mysql mysql cd /usr/local/mysqlmkdir datachown -R mysql:mysql ././bin/mysqld --initialize --user=mysql --basedir=/usr/local/mysql/m'y' s'q --datadir=/usr/local/mysql/data #将mysql/目录下除了data/目录的所有文件,改回root用户所有 chown -R root . #mysql用户只需作为mysql-5.7.22/data/目录下所有文件的所有者 chown -R mysql data #5.复制启动文件 cp support-files/mysql.server /etc/init.d/mysqld chmod 755 /etc/init.d/mysqld cp bin/my_print_defaults /usr/bin/ #6.修改启动脚本 vi /etc/init.d/mysqld #修改项: basedir=/usr/local/mysql-5.7.22/ datadir=/usr/local/mysql-5.7.22/data port=3306 #加入环境变量,编辑 /etc/profile,这样可以在任何地方用mysql命令了 vi ~/.bash_profile #添加mysql路径,加入下面内容,按ESC-->:wq保存 export PATH=$PATH:/usr/local/mysql-5.7.22/bin #刷新立即生效 source ~/.bash_profile #7.修改mysql配置项 vi /etc/my.cnf #配置如下: [mysqld] basedir = /usr/local/mysql datadir = /usr/local/mysql/data socket = /tmp/mysql.sock user = mysql tmpdir = /tmp symbolic-links=0 [mysqld_safe] log-error = /usr/local/mysql/data/error.log pid-file = /usr/local/mysql/data/mysql.pid #!includedir /etc/my.cnf.d #8.启动mysql service mysqld start #如启动失败,删除 /usr/local/mysql-5.7.22/data下所有文件,重新执行./bin/mysqld --initialize --user=mysql --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data,再启动 #9.进入mysql修改初始密码,修改远程连接的用户权限问题 mysql -uroot -p ALTER USER ' root'@' localhost' IDENTIFIED BY ' root'; use mysql; UPDATE user SET host=' %' WHERE user=' root'; flush privileges; #开机自启动 chkconfig --add mysqld chkconfig mysqld on chkconfig --list mysqld 0:关 1:关 2:开 3:开 4:开 5:开 6:关
配置mysql
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 systemctl start mysqld systemctl enable mysqld systemctl daemon-reload vim /etc/my.cnf validate_password=OFF systemctl restart mysqld grep 'temporary password' /var/log/mysqld.log mysql -uroot -p ALTER USER 'root' @'localhost' IDENTIFIED BY 'root' ; use mysql; UPDATE user SET host='%' WHERE user='root' ; flush privileges; create database metastore DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
安装HIVE 1 2 3 4 5 6 7 8 9 10 tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /opt/module/ vim /etc/profile.d/my_env.sh export HIVE_HOME=/opt/module/apache-hive-3.1.2-bin export PATH=$PATH :$HIVE_HOME /bin source /etc/profile
hive基础配置
1 2 3 4 5 mv $HIVE_HOME /lib/log4j-slf4j-impl-2.10.0.jar $HIVE_HOME /lib/log4j-slf4j-impl-2.10.0.bakcp /opt/software/mysql-connector-java-5.1.27-bin.jar $HIVE_HOME /lib
配置hive-site.xml
挑转到/opt/module/apache-hive-3.1.2-bin/conf/目录新建文件 hive-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 <?xml version="1.0" encoding="utf-8" ?> <?xml-stylesheet type="text/xsl" href="configuration.xsl" ?> <configuration > <property > <name > javax.jdo.option.ConnectionURL</name > <value > jdbc:mysql://hadoop100:3306/metastore?useSSL=false</value > </property > <property > <name > javax.jdo.option.ConnectionDriverName</name > <value > com.mysql.jdbc.Driver</value > </property > <property > <name > javax.jdo.option.ConnectionUserName</name > <value > root</value > </property > <property > <name > javax.jdo.option.ConnectionPassword</name > <value > root</value > </property > <property > <name > hive.metastore.schema.verification</name > <value > false</value > </property > <property > <name > hive.metastore.event.db.notification.api.auth</name > <value > false</value > </property > <property > <name > hive.metastore.warehouse.dir</name > <value > /user/hive/warehouse</value > </property > <property > <name > hive.server2.enable.doAs</name > <value > false</value > </property > <property > <name > hive.exec.mode.local.auto</name > <value > true</value > </property > <property > <name > mapred.map.child.java.opts</name > <value > -Xmx2048m</value > </property > </configuration >
初始化Hive元数据库
1 schematool -initSchema -dbType mysql -verbose
优化mapreduce 1 vim $HADOOP_HOME /etc/hadoop/mapred-site.xml
增加配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 <property > <name > mapreduce.map.memory.mb</name > <value > 1536</value > </property > <property > <name > mapreduce.map.java.opts</name > <value > -Xmx1024M</value > </property > <property > <name > mapreduce.reduce.memory.mb</name > <value > 3072</value > </property > <property > <name > mapreduce.reduce.java.opts</name > <value > -Xmx2560M</value > </property >
配置beeline 配置core-site.xml 使其任意节点都可以访问hadoop
1 2 3 4 5 6 7 8 <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property>
启动 hiveserver2
1 2 3 4 hiveserver2 nohup hiveserver2 &
登录命令
1 2 beeline -u jdbc:hive2://localhost:10000 -n root -p 123456 [密码随意] beeline -u jdbc:hive2://localhost:10000 -n root -p 123456 -e 'show tables;'
dbeaver登录
获取文件 hadoop-common-3.1.3.jar
获取文件 hive-jdbc-3.1.2-standalone.jar
添加hive数据库链接
hive 数据操作语句元数据查看语句 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 show database db_hiveshow databases like 'db_hive*' ;desc database db_hivedesc database extended db_hive;show tables;desc dept;desc extended emp;show formatted emp;show partitions emp;
建库操作 1 2 3 4 5 CREATE DATABASE [IF NOT EXISTS ] database_name[COMMENT database_comment] [LOCATION hdfs_path]
建表操作 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 CREATE [EXTERNAL ] TABLE [IF NOT EXISTS ] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC | DESC ], ...)] INTO num_buckets BUCKETS] [ROW FORMAT DELIMITED [FIELDS TERMINATED BY char ] [COLLECTION ITEMS TERMINATED BY char ] [MAP KEYS TERMINATED BY char ] [LINES TERMINATED BY char ]] [STORED AS file_format] [LOCATION hdfs_path] [TBLPROPERTIES (property_name= property_value, ...)] [AS select_statement]
上传数据 1 2 3 4 5 6 7 8 9 10 11 load data [local ] inpath '数据的path' [overwrite] into table student [partition (partcol1= val1,…)]; dfs - put / opt/ module / hive/ datas/ student.txt / user / atguigu/ hive; insert into table student_par values (1 ,'wangwu' ),(2 ,'zhaoliu' );insert overwrite table student_par select id, name from student ;
下载数据 1 2 3 4 5 6 7 8 9 10 11 12 13 14 insert overwrite local directory '数据的path' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' dql_command;dfs - get / user / hive/ warehouse/ student/ student.txt / opt/ module / datas/ export/ student3.txt;bin/ hive - e 'select * from default.student;' > / opt/ module / hive/ datas/ export/ student4.txt; export table default.student to '/user/hive/warehouse/export/student' ;
select语句 1 2 3 4 5 6 7 8 9 SELECT [ALL | DISTINCT ] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ORDER BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT number]
Hive复合数据类型
数组array : array<value数据类型>
相同数据类型
有序的排列
下标为数字
1 2 3 4 5 select a_score[0 ] from student2select array (值,值) from student
集合struct : struct<key值:value数据类型,key值:value数据类型>
预定义个数
预定义顺序
key预定义
数据类型可不同
1 2 3 4 5 6 select s_score.chinese from student2select named_struct(key,value ,key,value )from student
字典map : map<key数据类型,value数据类型>
标准字典类型
key自定义
数据类型可不同
个数不限
1 2 3 4 5 6 select m_score['语文' ] from student2select map(key,value ,key,value )from student
hive 内置函数 1 2 3 4 5 6 show functions;desc function upper;desc function extended upper;
python连接hive linux环境安装python 1 2 3 4 5 6 7 8 9 10 11 12 13 wget https://www.python.org/ftp/python/3.9.0/Python-3.9.0.tgz wget https://www.python.org/ftp/python/3.9.9/Python-3.9.9.tgz tar -zxvf Python-3.9.10.tgz -C /opt/module/ yum install openssl-devel libffi-devel bzip2-devel gcc gcc-c++ wget -y ./configure --enable-optimizations make altinstall
配置环境变量 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 vim /etc/profile.d/my_env.sh PATH=$PATH :/opt/python39/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin source /etc/profilecp libpython3.9.a /usr/lib64/ln -s /root/software/python3.9/Python-3.9.10/python /usr/bin/python3ln -s /opt/module/Python-3.9.10/python /usr/bin/python3
安装pyhive库 1 2 3 4 5 6 7 8 9 10 11 pip3 install -i https://mirrors.aliyun.com/pypi/simple/ thrift pip3 install -i https://mirrors.aliyun.com/pypi/simple/ thrift-sasl pip3 install -i https://mirrors.aliyun.com/pypi/simple/ PyHive pip3 install -i https://mirrors.aliyun.com/pypi/simple/ PyM from pyhive import hiveconn=hive.connect(host='localhost' ,port=10000 ,username='root' ,database='db_hive' ) cursor=conn.cursor() sql='show tables' cursor.execute(sql) print (cursor.fetchall())
分区表
分区应用场景
oracle 分区表种类
范围分区 (range)
列表分区 (list)
散列分区 (hash)
组合组合分区 (subpartition)
oracle 分区-范围分区1 2 3 4 5 6 7 8 9 10 11 12 13 14 CREATE TABLE ORDER_ACTIVITIES ( ORDER_ID NUMBER(7 ) NOT NULL , ORDER_DATE DATE , TOTAL_AMOUNT NUMBER, CUSTOTMER_ID NUMBER(7 ), PAID CHAR (1 ) ) PARTITION BY RANGE (ORDER_DATE) ( PARTITION ORD_ACT_PART01 VALUES LESS THAN (TO_DATE('01- MAY -2003' ,'DD-MON-YYYY' )) TABLESPACE ORD_TS01, PARTITION ORD_ACT_PART02 VALUES LESS THAN (TO_DATE('01-JUN-2003' ,'DD-MON-YYYY' )) TABLESPACE ORD_TS02, PARTITION ORD_ACT_PART02 VALUES LESS THAN (MAXVALUE) TABLESPACE ORD_TS03 );
oracle分区-列表分区 1 2 3 4 5 6 7 8 CREATE TABLE ORDER_ACTIVITIES ( PROBLEM_ID NUMBER(7 ) NOT NULL PRIMARY KEY , CUSTOMER_ID NUMBER(7 ) NOT NULL , STATUS VARCHAR2(20 )) PARTITION BY LIST (STATUS) ( PARTITION PROB_ACTIVE VALUES ('ACTIVE' ) TABLESPACE PROB_TS01, PARTITION PROB_INACTIVE VALUES ('INACTIVE' ,'unknow' ) TABLESPACE PROB_TS02 );
oracle分区-散列分区 1 2 3 4 5 6 7 8 9 10 CREATE TABLE HASH_TABLE ( COL NUMBER(8 ), INF VARCHAR2(100 ) ) PARTITION BY HASH (COL) ( PARTITION PART01 TABLESPACE HASH_TS01, PARTITION PART02 TABLESPACE HASH_TS02, PARTITION PART03 TABLESPACE HASH_TS03 )
oracle 分区-组合分区1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 CREATE TABLE SALES ( PRODUCT_ID VARCHAR2(5 ), SALES_DATE DATE , SALES_COST NUMBER(10 ), STATUS VARCHAR2(20 ) ) PARTITION BY RANGE (SALES_DATE) SUBPARTITION BY LIST (STATUS) ( PARTITION P1 VALUES LESS THAN(TO_DATE('2003-01-01' ,'YYYY-MM-DD' ))TABLESPACE rptfact2009 ( SUBPARTITION P1SUB1 VALUES ('ACTIVE' ) TABLESPACE rptfact2009, SUBPARTITION P1SUB2 VALUES ('INACTIVE' ) TABLESPACE rptfact2009 ), PARTITION P2 VALUES LESS THAN (TO_DATE('2003-03-01' ,'YYYY-MM-DD' )) TABLESPACE rptfact2009 ( SUBPARTITION P2SUB1 VALUES ('ACTIVE' ) TABLESPACE rptfact2009, SUBPARTITION P2SUB2 VALUES ('INACTIVE' ) TABLESPACE rptfact2009 ) )
oracle 分区-分区表操作1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ALTER TABLE SALES ADD PARTITION P3 VALUES LESS THAN(TO_DATE('2003-06-01' ,'YYYY-MM-DD' )); ALTER TABLE SALES MODIFY PARTITION P3 ADD SUBPARTITION P3SUB1 VALUES ('COMPLETE' ); ALTER TABLE SALES DROP PARTITION P3; ALTER TABLE SALES DROP SUBPARTITION P4SUB1;ALTER TABLE table_name EXCHANGE PARTITION partition_name WITH TABLE nonpartition_name;
hive分区-创建分区表 在 Hadoop 中,Hive 分区表通常以特定的目录结构来存储。
每个分区对应一个独立的目录,目录名通常包含分区列的值。数据文件会存储在相应的分区目录下。
1 2 3 4 5 6 7 create table dept_partition(deptno int , dname string, loc string ) partitioned by (day string) row format delimited fields terminated by '\t' ;
hive分区-分区表操作 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition (day = '20200401' ); insert into table log_list_6 partition (dat= '20221231' ) select * from log_list_tmpfrom student insert overwrite table student partition (month = '201707' )select id, name where month = '201707' insert overwrite table student partition (month = '201706' )select id, name where month = '201706' ; show partitions tab_name;alter table dept_partition add partition (day = '20200404' ) ;alter table dept_partition add partition (day = '20200405' ) partition (day = '20200406' );alter table dept_partition drop partition (day = '20200406' );show partitions dept_partition;desc formatted dept_partition;ALTER TABLE table_name PARTITION (dt= '2008-08-08' ) SET LOCATION "new location";ALTER TABLE table_name PARTITION (dt= '2008-08-08' ) RENAME TO PARTITION (dt= '20080808' );
超市分区表示例 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 create table supermarket_p (id string, ord_id string comment '订单 ID' , ord_date string comment '订单日期' , exch_date string comment '发货日期' , exch_type string comment '邮寄方式' , cust_id string comment '客户 ID ' , cust_name string comment '客户名称' , d_type string comment '细分' , city string comment '城市' , prov string comment '省/自治区' , country string comment'国家' , area string comment '地区' , pro_id string comment '产品 ID' , type1 string comment '类别' , type2 string comment '子类别' , pro_name string comment '产品名称' , sales float comment '销售额' , count1 int comment '数量 ' , discount float comment '折扣 ' , profit float comment '利润' ) partitioned by (c_type1 string) row format delimited fields terminated by '\t'
动态分区配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 --开启动态分区(默认开启) set hive.exec.dynamic.partition=true --指定非严格模式 nonstrict模式表示允许所有的分区字段都可以使用动态分区 set hive.exec.dynamic.partition.mode=nonstrict--在所有执行MR的节点上,最大一共可以创建多少个动态分区。默认1000 set hive.exec.max.dynamic.partitions=1000--在每个执行MR的节点上,最大可以创建多少个动态分区(分区字段有多少种设多少个) set hive.exec.max.dynamic.partitions.pernode=100--整个MR Job中,最大可以创建多少个HDFS文件。默认100000 set hive.exec.max.created.files=100000--当有空分区生成时,是否抛出异常 set hive.error.on.empty.partition=false --打开正则查询模式`(dt|hr)?+.+` set hive.support.quoted.identifiers=none
分桶表
分区提供一个隔离数据和优化查询的便利方式。不过,并非所有的数据集都可形成合理的分区。对于一张表或者分区,Hive 可以进一步组织成桶,也就是更为细粒度的数据范围划分。
分桶是将数据集分解成更容易管理的若干部分的另一个技术。
分区针对的是数据的存储路径 ;分桶针对的是数据文件 。
分桶表注意事项
分桶策略
Hive的分桶采用对分桶字段的值进行哈希,然后除以桶的个数求余的方 式决定该条记录存放在哪个桶当中
==reduce的个数设置为-1,让Job自行决定需要用多少个reduce或者将reduce的个数设置为大于等于分桶表的桶数==
==从hdfs中load数据到分桶表中,避免本地文件找不到问题 ==
==不要使用本地模式 ==
hive分桶表-创建分桶表 1 2 3 4 5 6 7 8 9 10 11 12 create table stu_bucket(id int , name string)clustered by (id) into 4 bucketsrow format delimited fields terminated by '\t' ;set mapreduce.job.reduces= 3 set mapred.reduce.tasks= 3 load data inpath '/student.txt' into table stu_bucket;
hive排序关键字
##hive****排序语句
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 select * from student2 order by idselect * from student2 sort by class_name desc set mapreduce.job.reduces= 15 ;select * from student2 distribute by class_name sort by id desc insert overwrite local directory '/root/student2/' row format delimited fields terminated by '\t' select * from student2_b distribute by sex sort by chinese desc select * from student2 cluster by class_name
使用awk 清洗 log 1 2 3 4 5 6 7 8 9 10 11 cat 2021-05-20.log | awk -F "\"-\"" '{split($1, arr, " ");\ split(substr(arr[4],2),dd,":");\ split(dd[1],ee,"/");\ print arr[1]"\t"ee[1]"- "ee[2]"-"ee[3]" "dd[2]":"dd[3]":"dd[4]"\t"arr[7]"\t"$2}' | \awk -F "\t" '{"date -d \""$2"\" +%Y%m%d%H%M%S" | getline d;print $1"\t"d"\t"$3"\t"$4 }' | \awk -F "\t" '{print $1"\t"$2"\t"$3"\t"(index($4,"Windows")?"Windows": (index($4,"Linux")?"Linux":"Mac"))"\t"(index($4,"Chrome")?"Chrome": (index($4,"Version")?"Safari":(index($4,"Firefox")?"Firefox":"Opera")))}' >new_2021-05-20.log