Hive常见故障多案例FAQ宝典 --项目总结(宝典一)

名称	说明
HiveServer	一个集群内可部署多个HiveServer，负荷分担。对外提供Hive数据库服务，将用户提交的HQL语句进行编译，解析成对应的Yarn任务或者HDFS操作，从而完成数据的提取、转换、分析。
MetaStore	一个集群内可部署多个MetaStore，负荷分担。提供Hive的元数据服务，负责Hive表的结构和属性信息读、写、维护和修改。提供Thrift接口，供HiveServer、Impala、WebHCat等MetaStore客户端来访问，操作元数据。
WebHCat	一个集群内可部署多个WebHCat，负荷分担。提供Rest接口，通过Rest执行Hive命令，提交MapReduce任务。
Hive客户端	包括人机交互命令行Beeline、提供给JDBC应用的JDBC驱动、提供给Python应用的Python驱动、提供给Mapreduce的HCatalog相关JAR包。

【1】参数及配置类常见故障案例如下：

执行set命令的时候报cannot modify xxx at runtime.

症状

执行set命令时报以下错误：

0: jdbc:hive2://xxx.xxx.xxx.xxx:21066/> set mapred.job.queue.name=QueueA; Error: Error while processing statement: Cannot modify mapred.job.queue.name at list of params that are allowed to be modified at runtime (state=42000,code=1)

解决方法

方案1：

登录集群 Manager页面，选择“集群 > 服务 > Hive > 配置 > 全部配置 > Hive > 安全”。
将要添加的参数添加到配置项hive.security.authorization.sqlstd.confwhitelist中。
点击保存并重启HiveServer后即可。如下图所示：

方案2：

登录集群 Manager页面，单击“集群 > 服务 > Hive > 配置 > 全部配置 > Hive > 安全”。
找到选项hive.security.whitelist.switch，选择OFF，点击保存并重启即可。

怎样在Hive提交任务的时候指定队列？

解决方法

如下，在执行语句前通过下述参数设置：

set mapred.job.queue.name=QueueA; select count(*) from rc;

提交任务后，可在Yarn页面看到，任务已经提交到队列QueueA了。（说明：队列的名称区分大小写，如写成queueA,Queuea均无效。）

如何在导入表时指定输出的文件压缩格式

解决方法

如需要全局设置，既对所有表都进行压缩；可以在 Manager页面上进行全局配置，如下：hive.exec.compress.output=true; 这个一定要选择true，否则下面选项不会生效。

如需在session级设置，只需要在执行命令前做如下设置即可：

set hive.exec.compress.output=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

当前Hive支持以下几种压缩格式：

org.apache.hadoop.io.compress.BZip2Codec org.apache.hadoop.io.compress.Lz4Codec org.apache.hadoop.io.compress.DeflateCodec org.apache.hadoop.io.compress.SnappyCodec org.apache.hadoop.io.compress.GzipCodec

desc描述表过长时，无法显示完整

解决方法

扩展：可通过beeline -help看到很多关于客户端显示的设置。如下：

 -u <database url> the JDBC URL to connect to -n <username> the username to connect as -p <password> the password to connect as -d <driver class> the driver class to use -i <init file> script file for initialization -e <query> query that should be executed -f <exec file> script file that should be executed --hiveconf property=value Use value for given property --color=[true/false] control whether color is used for display --showHeader=[true/false] show column names in query results --headerInterval=ROWS; the interval between which heades are displayed --fastConnect=[true/false] skip building table/column list for tab-completion --autoCommit=[true/false] enable/disable automatic transaction commit --verbose=[true/false] show verbose error messages and debug info --showWarnings=[true/false] display connection warnings --showNestedErrs=[true/false] display nested errors --numberFormat=[pattern] format numbers using DecimalFormat pattern --force=[true/false] continue running script even after errors --maxWidth=MAXWIDTH the maximum width of the terminal --maxColumnWidth=MAXCOLWIDTH the maximum width to use when displaying columns --silent=[true/false] be more silent --autosave=[true/false] automatically save preferences --outputformat=[table/vertical/csv2/tsv2/dsv/csv/tsv] format mode for result display Note that csv, and tsv are deprecated - use csv2, tsv2 instead --truncateTable=[true/false] truncate table column when it exceeds length --delimiterForDSV=DELIMITER specify the delimiter for delimiter-separated values output format (default: |) --isolation=LEVEL set the transaction isolation level --nullemptystring=[true/false] set to true to get historic behavior of printing null as empty string --socketTimeOut=n socket connection timeout interval, in second. The default value is 300.

启动时，设置参数maxWidth=20000即可，如下：

[[email protected] logs]# beeline --maxWidth=2000 scan complete in 3ms Connecting to …… Beeline version 1.1.0 by Apache Hive

增加分区列后再insert数据显示为NULL

症状

执行如下命令：

create table test_table( col1 string, col2 string ) PARTITIONED BY(p1 string) STORED AS orc tblproperties('orc.compress'='SNAPPY'); alter table test_table add partition(p1='a'); insert into test_table partition(p1='a') select col1,col2 from temp_table; alter table test_table add columns(col3 string); insert into test_table partition(p1='a') select col1,col2,col3 from temp_table; 这个时候select * from test_table where p1='a' 看见的列col3全为NULL alter table test_table add partition(p1='b'); insert into test_table partition(p1='b') select col1,col2,col3 from temp_table; select * from test_table where p1='b' 能看见col3有不为NULL的值

解决方法

add column的时候加入cascade关键字即可，如下：

alter table test_table add columns(col3 string) cascade;

如何设置hive on spark 模式及提交任务到指定队列

解决方法

如下，在执行语句前通过下述参数设置：

set hive.execution.engine = spark; set spark.yarn.queue = testQueue;

提交任务后，可在Yarn页面看到，如下任务已经提交到队列testQueue了。(说明：队列的名称区分大小写，如写成testqueue,TestQueue均无效。)

hive on spark应用如何设置spark应用的参数？

解决方法

Hive通过spark引擎在执行SQL语句前，可以通过set命令来设置Spark应用相关参数。
- 对于memoryOverhead参数，默认的单位是M，set命令中不能带有单位，否则会报错。
- 其他参数可通过同样的方式进行set。
- 参数设置只对当前session有效。

以下为与Spark相关的内存参数。

set spark.executor.memory = 1g; // executor内存大小 set spark.driver.memory = 1g; // driver内存大小 set spark.yarn.executor.memoryOverhead = 2048; // executor overhead 内存大小 set spark.yarn.driver.memoryOverhead = 1024; // driver overhead memory 大小

如何设置map和reduce个数

解决方法

reduce个数控制(以下命令中的值为默认值，默认不设置个数)

set mapred.reduce.tasks=-1;

设置每个reduce处理的数据量（默认值为256M）：

set hive.exec.reducers.bytes.per.reducer=*256000000*;

map个数控制。map数无法直接控制，需通过设置每个map加载数据量来控制map数（以下命令中的为默认值256M）：

set mapreduce.input.fileinputformat.split.maxsize=*256000000*;

说明：参数建议只对单个session设置，示例中参数设置只对当前session有效。

MapReduce任务内存溢出问题处理

症状

MapReduce任务运行中出现各种内存溢出问题，例如：java heap space、out of memory以及AM日志中的Full GC等。

解决方法

AM则是查看AM日志中的GC打印，有Full GC则存在AM内存溢出：

确定内存溢出的是map阶段还是reduce阶段还是AM，报错信息中“m”为map，“r”为reduce，如下图所示为map阶段内存溢出：

AM内存：AM需要的内存量：

set yarn.app.mapreduce.am.resource.mb=*1536*;

AM的JVM最大使用内存：

set yarn.app.mapreduce.am.command-opts=-Xmx*1024*m;

reduce内存：每个Reduce Task需要的内存量：

set mapreduce.reduce.memory.mb=*4096*;

每个Reduce Task的JVM最大使用内存：

set mapreduce.reduce.java.opts=-Xmx*3276*M;

map内存：每个Map Task需要的内存量：

set mapreduce.map.memory.mb=*4096*;

每个Map Task的JVM最大使用内存：

set mapreduce.map.java.opts=-Xmx*3276*M;

以下为与map/reduce/AM内存相关的参数（以下值都为默认值，需根据实际情况翻倍调整）：（说明：参数设置只对当前session有效。）实例都为session级别生效，如需设置全局参数，需要把JVM最大使用内存写全。例如：AM JVM最大使用内存（可在beeline命令行中执行set yarn.app.mapreduce.am.command-opts;命令获取）：

yarn.app.mapreduce.am.command-opts=-Xmx1024m -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -verbose:gc

Hive常见故障多案例FAQ宝典 --项目总结(宝典一)

Ne0inhk

架构概述

【1】参数及配置类常见故障案例如下：

执行set命令的时候报cannot modify xxx at runtime.

怎样在Hive提交任务的时候指定队列？

如何在导入表时指定输出的文件压缩格式

desc描述表过长时，无法显示完整

增加分区列后再insert数据显示为NULL

如何设置hive on spark 模式及提交任务到指定队列

hive on spark应用如何设置spark应用的参数？

如何设置map和reduce个数

MapReduce任务内存溢出问题处理

Read more

构建基于Go语言的高性能命令行AI对话客户端：从环境部署到核心实现

OpenClaw Skills 安装与实战：打造你的 AI 技能工具箱

人工智能：扩散模型（Diffusion Model）原理与图像生成实战

【Linux】Nginx配置域名+https&一个地址配置多个项目【项目实战】