本文介绍使用External Catalog读取LAS数据。通过创建 LAS Catalog,您不需要执行数据导入就可以直接查询 LAS里的数据。
说明
目前读取LAS数据功能需要开白使用,如需操作请通过 提工单 的方式,联系火山引擎技术支持人员。
LAS(Lakehouse Analytics Service) 是一种数据湖解决方案,目标是简化和加速数据湖的建设和管理过程。通过提供的统一元数据管理、数据访问控制、元数据发现和数据集成等关键功能,帮助组织打破数据孤岛,加强湖仓体系建设,更好地组织、共享、访问和分析大规模企业数据。详细说明可参考LAS 官方文档。
如下以数据存储在TOS为例。
CREATE EXTERNAL CATALOG <catalog_name> [COMMENT <comment>] PROPERTIES ( "type" = "hive", "hive.metastore.type" = "las", "hive.metastore.uris" = "thrift://<LAS_HMS_IP>:48869", -- 可选:指定使用 LAS 下哪个用户自定义 catalog;不填则默认为 hive "hive.metastore.catalog.default" = "<las_catalog_name>", -- LAS 鉴权 "las.hive.metastore.service" = "catalog_service", "las.hive.metastore.region" = "<las_region>", "las.hive.metastore.access.key"= "<your_access_key>", "las.hive.metastore.secret.key"= "<your_secret_key>", -- 仅 HNS 桶必填,FNS 桶可省略 "las.tos.endpoint" = "tos-cn-beijing.ivolces.com", -- TOS(S3 协议)访问配置 "aws.s3.enable_ssl" = "false", "aws.s3.region" = "<tos_region>", "aws.s3.use_instance_profile" = "false", "aws.s3.use_aws_sdk_default_behavior"= "false", "aws.s3.endpoint" = "<tos_s3_endpoint>", "aws.s3.access_key" = "<your_access_key>", "aws.s3.secret_key" = "<your_secret_key>" );
参数 | 必选 | 说明 |
|---|---|---|
catalog_name | 是 | 外挂到LAS Catalog的catalog名称,命名要求如下:
|
comment | 否 | 外挂到LAS Catalog的catalog描述。 |
hive.metastore.type | 是 | 固定为 |
hive.metastore.uris | 是 | LAS Hive Metastore 的 thrift 地址。格式为 |
hive.metastore.catalog.default | 否 | 用于 LAS 多 Catalog 场景下指定要使用哪个 LAS Catalog;缺省值为 |
las.hive.metastore.service | 是 | 固定为 |
las.hive.metastore.region | 是 | LAS Catalog 实例地域。例如 |
las.hive.metastore.access.key | 是 | 访问LAS元数据的AK。需要有LAS Catalog相关库表权限。具体可以参考LAS官网快速入门和权限管理给某个子用户授权,之后给该子用户创建ak/sk即可。该步骤请参考访问密钥。 |
las.hive.metastore.secret.key | 是 | 访问LAS元数据的SK。需要有LAS Catalog相关库表权限。具体可以参考LAS官网快速入门和权限管理给某个子用户授权,之后给该子用户创建ak/sk即可。该步骤可以参考访问密钥。 |
las.tos.endpoint | HNS 桶必填 / FNS 桶可选 | Proton 访问 TOS 用的 endpoint,例如 |
aws.s3.region | 是 | TOS桶实例所在的地区。参考地域和访问域名(Endpoint),一般和上面las.hive.metastore.region为一个地址。 |
aws.s3.endpoint | 是 | TOS桶实例所在的地区,参考TOS相关文档。注意这里需要填写s3 Endpoint。另外如果tos和lasformation是同一region建议用内网地址。 |
aws.s3.access_key | 是 | TOS使用的AK。 |
aws.s3.secret_key | 是 | TOS使用的SK。 |
aws.s3.enable_ssl | 否 | 是否启用 SSL,默认 |
aws.s3.use_instance_profile | 否 | 是否使用实例 Profile,默认 |
aws.s3.use_aws_sdk_default_behavior | 否 | 是否使用 AWS SDK 默认行为,默认 |
StarRocks 通过 PaimonConnector 配合 LAS 客户端,支持把 LAS 作为 Paimon 表的 Hive Metastore。
CREATE EXTERNAL CATALOG <catalog_name> PROPERTIES ( -- 让 StarRocks 选用 PaimonConnector "type" = "paimon", -- Paimon 的 Metastore 类型固定为 hive(不要写成 paimon) "paimon.catalog.type" = "hive", -- LAS Hive Metastore 的 thrift 地址 "hive.metastore.uris" = "thrift://<LAS_HMS_IP>:48869", -- 触发走 LAS 客户端的分支 "hive.metastore.type" = "las", -- 可选:指定使用 LAS 下哪个用户自定义 catalog;不填默认 hive "hive.metastore.catalog.default" = "<las_catalog_name>", -- Paimon 仓库路径(TOS) "paimon.catalog.warehouse" = "tos://<bucket>/<prefix>/", -- LAS 鉴权 "las.hive.metastore.service" = "catalog_service", "las.hive.metastore.region" = "<las_region>", "las.hive.metastore.access.key"= "<your_access_key>", "las.hive.metastore.secret.key"= "<your_secret_key>", -- HNS 桶必填 "las.tos.endpoint" = "tos-cn-beijing.ivolces.com", -- TOS(S3 协议)访问配置(走 Paimon 的 s3 FileIO) "aws.s3.enable_ssl" = "false", "aws.s3.region" = "<tos_region>", "aws.s3.use_instance_profile" = "false", "aws.s3.use_aws_sdk_default_behavior"= "false", "aws.s3.endpoint" = "tos-s3-cn-beijing.ivolces.com", "aws.s3.access_key" = "<your_access_key>", "aws.s3.secret_key" = "<your_secret_key>" );
参数 | 必选 | 说明 |
|---|---|---|
| 是 | 固定为 |
| 是 | 固定为 |
| 是 | Paimon 仓库路径,必须为对象存储路径(例如 |
其他参数:其余 LAS / TOS / S3 相关参数与 Hive Catalog 参数说明 一致。 | ||
CREATE EXTERNAL CATALOG <catalog_name> PROPERTIES ( "type" = "iceberg", -- Iceberg 接入 LAS 的关键开关 "iceberg.catalog.type" = "las", -- LAS Hive Metastore 的 thrift 地址 "hive.metastore.uris" = "thrift://<LAS_HMS_IP>:48869", "hive.metastore.type" = "las", -- 可选:指定使用 LAS 下哪个用户自定义 catalog "hive.metastore.catalog.default" = "<las_catalog_name>", -- Iceberg 仓库路径(TOS) "iceberg.catalog.warehouse" = "tos://<bucket>/<prefix>/", -- LAS 鉴权 "las.hive.metastore.service" = "catalog_service", "las.hive.metastore.region" = "<las_region>", "las.hive.metastore.access.key"= "<your_access_key>", "las.hive.metastore.secret.key"= "<your_secret_key>", -- HNS 桶必填 "las.tos.endpoint" = "tos-cn-beijing.ivolces.com", -- TOS(S3 协议)访问配置 "aws.s3.enable_ssl" = "false", "aws.s3.region" = "<tos_region>", "aws.s3.endpoint" = "tos-s3-cn-beijing.ivolces.com", "aws.s3.access_key" = "<your_access_key>", "aws.s3.secret_key" = "<your_secret_key>" );
参数 | 必选 | 说明 |
|---|---|---|
| 是 | 固定为 |
| 是 | 固定为 |
| 是 | Iceberg 仓库路径,必须为对象存储路径(例如 |
其他参数:其余 LAS / TOS / S3 相关参数与 Hive Catalog 参数说明 一致。 | ||
通过DROP CATALOG可以删除指定的外挂到LAS Catalog的catalog。
DROP CATALOG <catalog_name>
创建好外挂到LAS Catalog的catalog后,我们可以通过它访问las数据。
设置当前catalog。假设我们创建的catalog_name为las。
set catalog las;
切换数据库。如下以sr_demo数据库为例,用户需将sr_demo替换为实际数据库。
USE sr_demo;
查看表结构。如下以sr_demo_test表为例,用户需将sr_demo_test替换为实际数据表。
DESC sr_demo_test; SHOW CREATE TABLE sr_demo_test;
查询表中数据。如下以sr_demo_test表为例,用户需将sr_demo_test替换为实际数据表。
SELECT * FROM sr_demo_test LIMIT 10;
导入数据到内表。
假设有一个 OLAP 表,在库olap_db下,表名为 olap_tbl。您可以这样来转换该表中的数据,并把数据导入到 StarRocks 中:
INSERT INTO default_catalog.olap_db.olap_tbl SELECT * FROM las.sr_demo.sr_demo_test;
删除Catalog。
DROP CATALOG las;
和开源Hive catalog逻辑一致,目前StarRocks会缓存元数据,参考StarRocks文档。如果遇到需要手动刷新缓存的场景,可以执行如下语句。
refresh external table <table_name>;
下面给出 几个典型组合下的"Spark 写 + StarRocks 读"完整 SQL,覆盖 (Paimon / Hive / Iceberg) × (HNS / FNS) × (用户自定义 catalog / 默认 catalog / Filesystem 模式) 的常见组合。
示例中的
tos://hns-test-xxx/test1/、<your_access_key>、<your_secret_key>均为占位符,请替换为您的实际取值。
Spark 侧(构造数据)
-- 默认参数 SET spark.sql.catalog.spark_paimon_catalog=org.apache.paimon.spark.SparkCatalog; SET spark.sql.catalog.spark_paimon_catalog.metastore=hive; SET spark.sql.storeAssignmentPolicy=ansi; -- 自定义 warehouse SET spark.sql.catalog.spark_paimon_catalog.warehouse=tos://hns-test-xxx/test1/; USE spark_paimon_catalog; -- 指定 LAS 自定义 catalog SET spark.hadoop.hive.metastore.catalog.default=my_hns_catalog; CREATE DATABASE IF NOT EXISTS test_paimon_db; CREATE TABLE IF NOT EXISTS test_paimon_db.test_tb2(id INT, name STRING); INSERT INTO TABLE test_paimon_db.test_tb2 VALUES(1, 'paimon'),(2, 'paimon');
StarRocks 侧(查询)
DROP CATALOG IF EXISTS my_hns_catalog; CREATE EXTERNAL CATALOG my_hns_catalog PROPERTIES ( "type" = "paimon", "paimon.catalog.type" = "hive", "hive.metastore.uris" = "thrift://lakeformation.las.cn-beijing.ivolces.com:48869", "hive.metastore.type" = "las", "hive.metastore.catalog.default" = "my_hns_catalog", "paimon.catalog.warehouse" = "tos://hns-test-xxx/test1/", "las.hive.metastore.service" = "catalog_service", "las.hive.metastore.region" = "cn-beijing", "las.hive.metastore.access.key"= "<your_access_key>", "las.hive.metastore.secret.key"= "<your_secret_key>", "las.tos.endpoint" = "tos-cn-beijing.ivolces.com", "aws.s3.enable_ssl" = "false", "aws.s3.region" = "cn-beijing", "aws.s3.use_instance_profile" = "false", "aws.s3.use_aws_sdk_default_behavior"= "false", "aws.s3.endpoint" = "tos-s3-cn-beijing.ivolces.com", "aws.s3.access_key" = "<your_access_key>", "aws.s3.secret_key" = "<your_secret_key>" ); SET CATALOG my_hns_catalog; SHOW DATABASES; USE test_paimon_db; SELECT * FROM test_tb2;
无需 Hive Metastore,直接以 Paimon Filesystem Catalog 形式访问 HNS 桶。
DROP CATALOG IF EXISTS paimon_fs_catalog; CREATE EXTERNAL CATALOG paimon_fs_catalog PROPERTIES ( "type" = "paimon", "paimon.catalog.type" = "filesystem", "paimon.catalog.warehouse" = "tos://hns-test-xxx/test1/", "las.tos.endpoint" = "tos-cn-beijing.ivolces.com", "aws.s3.region" = "cn-beijing", "aws.s3.endpoint" = "tos-s3-cn-beijing.ivolces.com", "aws.s3.access_key" = "<your_access_key>", "aws.s3.secret_key" = "<your_secret_key>" ); SET CATALOG paimon_fs_catalog; SHOW DATABASES; USE test_paimon_db; SHOW TABLES; SELECT * FROM test_tb;
hive) 与 11.1 用法基本一致,区别在于 hive.metastore.catalog.default 取默认值 hive,并由 Spark 一侧把数据写入 LAS 默认 catalog。
Spark 侧(构造数据)
SET spark.sql.catalog.spark_paimon_catalog=org.apache.paimon.spark.SparkCatalog; SET spark.sql.catalog.spark_paimon_catalog.metastore=hive; SET spark.sql.storeAssignmentPolicy=ansi; SET spark.sql.catalog.spark_paimon_catalog.warehouse=tos://hns-test-xxx/test1/; USE spark_paimon_catalog; CREATE DATABASE IF NOT EXISTS demo_paimon_default; CREATE TABLE IF NOT EXISTS demo_paimon_default.t2(id INT, name STRING); INSERT INTO TABLE demo_paimon_default.t2 VALUES(1, 'hehe'),(777, 'haha');
StarRocks 侧(查询)
DROP CATALOG IF EXISTS paimon_default_catalog; CREATE EXTERNAL CATALOG paimon_default_catalog PROPERTIES ( "type" = "paimon", "paimon.catalog.type" = "hive", "hive.metastore.uris" = "thrift://lakeformation.las.cn-beijing.ivolces.com:48869", "hive.metastore.type" = "las", "hive.metastore.catalog.default" = "hive", "paimon.catalog.warehouse" = "tos://hns-test-xxx/test1/", "las.hive.metastore.service" = "catalog_service", "las.hive.metastore.region" = "cn-beijing", "las.hive.metastore.access.key"= "<your_access_key>", "las.hive.metastore.secret.key"= "<your_secret_key>", "las.tos.endpoint" = "tos-cn-beijing.ivolces.com", "aws.s3.enable_ssl" = "false", "aws.s3.region" = "cn-beijing", "aws.s3.endpoint" = "tos-s3-cn-beijing.ivolces.com", "aws.s3.access_key" = "<your_access_key>", "aws.s3.secret_key" = "<your_secret_key>" ); SET CATALOG paimon_default_catalog; USE demo_paimon_default; SELECT * FROM t2;
非 LAS、非 TOS 的对照场景:访问 EMR 集群自建 Hive Metastore 上的 HDFS 表。
DROP CATALOG IF EXISTS hive_hdfs_catalog; CREATE EXTERNAL CATALOG hive_hdfs_catalog PROPERTIES ( "type" = "hive", "hive.metastore.uris" = "thrift://<emr-master>:9083", "hadoop.username" = "doris", "fs.defaultFS" = "hdfs://<emr-master>:8320" ); SET CATALOG hive_hdfs_catalog; USE hive_demo_db; SELECT * FROM hive_demo_t1;
Spark 侧(构造数据)
SET spark.hadoop.hive.metastore.catalog.default=my_hns_catalog; CREATE DATABASE IF NOT EXISTS hive_lll_fns_1; CREATE TABLE IF NOT EXISTS hive_lll_fns_1.hive_lll_fns_table_1(id INT, name STRING); INSERT INTO TABLE hive_lll_fns_1.hive_lll_fns_table_1 VALUES(1, 'hehe'),(2, 'haha'),(3, 'mary');
StarRocks 侧(查询)
DROP CATALOG IF EXISTS hive_hns_user_catalog; CREATE EXTERNAL CATALOG hive_hns_user_catalog PROPERTIES ( "type" = "hive", "hive.metastore.type" = "las", "hive.metastore.uris" = "thrift://lakeformation.las.cn-beijing.ivolces.com:48869", "hive.metastore.catalog.default" = "my_hns_catalog", "las.hive.metastore.service" = "catalog_service", "las.hive.metastore.region" = "cn-beijing", "las.hive.metastore.access.key"= "<your_access_key>", "las.hive.metastore.secret.key"= "<your_secret_key>", "las.tos.endpoint" = "tos-cn-beijing.ivolces.com", "aws.s3.enable_ssl" = "true", "aws.s3.region" = "cn-beijing", "aws.s3.use_instance_profile" = "false", "aws.s3.use_aws_sdk_default_behavior"= "false", "aws.s3.endpoint" = "tos-s3-cn-beijing.ivolces.com", "aws.s3.access_key" = "<your_access_key>", "aws.s3.secret_key" = "<your_secret_key>" ); SET CATALOG hive_hns_user_catalog; USE hive_lll_fns_1; SELECT * FROM hive_lll_fns_table_1;
FNS 桶可以不传 las.tos.endpoint,此时走 s3a;如果同时传了 las.tos.endpoint,则统一走 Proton。
DROP CATALOG IF EXISTS hive_fns_default_catalog; CREATE EXTERNAL CATALOG hive_fns_default_catalog PROPERTIES ( "type" = "hive", "hive.metastore.type" = "las", "hive.metastore.uris" = "thrift://lakeformation.las.cn-beijing.ivolces.com:48869", "las.hive.metastore.service" = "catalog_service", "las.hive.metastore.region" = "cn-beijing", "las.hive.metastore.access.key"= "<your_access_key>", "las.hive.metastore.secret.key"= "<your_secret_key>", -- 可选:FNS 桶若希望统一走 Proton 通路则保留下面这一行;若希望走 s3a 则删除 "las.tos.endpoint" = "tos-cn-beijing.ivolces.com", "aws.s3.enable_ssl" = "false", "aws.s3.region" = "cn-beijing", "aws.s3.use_instance_profile" = "false", "aws.s3.use_aws_sdk_default_behavior"= "false", "aws.s3.endpoint" = "tos-s3-cn-beijing.ivolces.com", "aws.s3.access_key" = "<your_access_key>", "aws.s3.secret_key" = "<your_secret_key>" ); SET CATALOG hive_fns_default_catalog; USE demo_hive_db; SELECT * FROM demo_hive_t;
非 LAS 的 Iceberg 接入对照示例:
DROP CATALOG IF EXISTS iceberg_hms_catalog; CREATE EXTERNAL CATALOG iceberg_hms_catalog PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "hive", "hive.metastore.uris" = "thrift://<emr-master>:9083", "las.tos.endpoint" = "tos-cn-beijing.ivolces.com", "aws.s3.region" = "cn-beijing", "aws.s3.endpoint" = "tos-s3-cn-beijing.ivolces.com", "aws.s3.access_key" = "<your_access_key>", "aws.s3.secret_key" = "<your_secret_key>" ); SET CATALOG iceberg_hms_catalog; USE iceberg_demo; SELECT * FROM demo_table;
Spark 侧(构造数据)
SET spark.sql.catalog.spark_iceberg_catalog=org.apache.iceberg.spark.SparkCatalog; SET spark.sql.catalog.spark_iceberg_catalog.type=hive; SET spark.sql.storeAssignmentPolicy=ansi; USE spark_iceberg_catalog; SET spark.hadoop.hive.metastore.catalog.default=my_hns_catalog; CREATE DATABASE IF NOT EXISTS test_iceberg_db; CREATE TABLE IF NOT EXISTS test_iceberg_db.test_tb(id INT, name STRING); INSERT INTO TABLE test_iceberg_db.test_tb VALUES(1, 'hehe'),(2, 'haha');
StarRocks 侧(查询)
DROP CATALOG IF EXISTS iceberg_hns_user_catalog; CREATE EXTERNAL CATALOG iceberg_hns_user_catalog PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "las", "hive.metastore.uris" = "thrift://lakeformation.las.cn-beijing.ivolces.com:48869", "hive.metastore.type" = "las", "hive.metastore.catalog.default" = "my_hns_catalog", "iceberg.catalog.warehouse" = "tos://hns-test-xxx/test1/", "las.hive.metastore.service" = "catalog_service", "las.hive.metastore.region" = "cn-beijing", "las.hive.metastore.access.key"= "<your_access_key>", "las.hive.metastore.secret.key"= "<your_secret_key>", "las.tos.endpoint" = "tos-cn-beijing.ivolces.com", "aws.s3.enable_ssl" = "false", "aws.s3.region" = "cn-beijing", "aws.s3.use_instance_profile" = "false", "aws.s3.use_aws_sdk_default_behavior"= "false", "aws.s3.endpoint" = "tos-s3-cn-beijing.ivolces.com", "aws.s3.access_key" = "<your_access_key>", "aws.s3.secret_key" = "<your_secret_key>" ); SET CATALOG iceberg_hns_user_catalog; USE test_iceberg_db; SELECT * FROM test_tb;
DROP CATALOG IF EXISTS iceberg_fns_default_catalog; CREATE EXTERNAL CATALOG iceberg_fns_default_catalog PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "las", "hive.metastore.uris" = "thrift://lakeformation.las.cn-beijing.ivolces.com:48869", "iceberg.catalog.warehouse" = "tos://<bucket>/<prefix>/", "las.hive.metastore.service" = "catalog_service", "las.hive.metastore.region" = "cn-beijing", "las.hive.metastore.access.key"= "<your_access_key>", "las.hive.metastore.secret.key"= "<your_secret_key>", "aws.s3.enable_ssl" = "false", "aws.s3.region" = "cn-beijing", "aws.s3.endpoint" = "tos-s3-cn-beijing.ivolces.com", "aws.s3.access_key" = "<your_access_key>", "aws.s3.secret_key" = "<your_secret_key>" ); SET CATALOG iceberg_fns_default_catalog; USE iceberg_demo_db; SELECT COUNT(*) FROM iceberg_demo_db.demo_table; SELECT * FROM iceberg_demo_db.demo_table LIMIT 10;