Data Catalogs can be extremely useful for understanding your data at a glance. By making your data catalogs available in a REST API format it becomes widely available for applications and users to pull from, but more importantly, you can also use it to manage your data catalog by plugging it into other frameworks. Today we’ll review how to use a REST Catalog for Apache Iceberg.
Since the current Apache Iceberg version (1.5.1) does not natively support a REST Catalog, we will use Gravitino, an open-source metadata lake and data catalog. It provides a standard REST catalog for Apache Iceberg. Throughout this process, you can also refer to Gravitino’s documentation. They’ve also released some blog posts that are helpful for today, such as “Gravitino: the unified metadata lake”, “Gravitino: Next-Gen REST Catalog for Iceberg, and Why You Need It”.
Today however, we will go over how to leverage their Iceberg REST Catalog service instead of building it from scratch.
1. Getting the Gravitino package
You can either download the latest version binary package from the Gravitino Github releases, or use the following code to checkout and build from source locally:
git clone git@github.com:datastrato/gravitino.git
// You will see some output here of git retrieving and downloading the repo locally
cd gravitino
./gradlew clean assembleDistribution -x test
// Build the package, this may take a minute and check the output that it built successfully
ls distribution/gravitino-0.5.0-bin.tar.gz
To do the initial installation, you will need to first clone the repository using git and then build the package using Gradle, a popular JVM build tool. Although everything you need is in the repository, you will need to make sure you have a Java version supported (e.g. JDK 8, 11, or 17). If you would rather build Gravitino from source code, please refer to the build doc for more details.
After building, decompress the package:
cd distribution
tar xfz gravitino-0.5.0-bin.tar.gz
cd gravitino-0.5.0-bin
2. Launching the Iceberg REST catalog
./bin/gravitino.sh start
Log dir doesn't exist, create /Users/user/Downloads/gravitino/distribution/gravitino-0.5.0-bin/logs
Gravitino Server start success!
Gravitino Server is running[PID:38047]
// Check server process started, will see the GravitinoServer process
jps | grep GravitinoServer
// Check interface works as expected, like `{"defaults":{},"overrides":{}}`
curl http://127.0.0.1:9001/iceberg/v1/config
Running the Gravitino script will then start the Gravitino server. If there are no errors returned you can assume the Iceberg REST catalog service is started and listening on local port 9001. If you want to, you could add more configurations in conf/gravitino.conf
. There are two locations for logging you can check: logs/gravitino-server.log
and logs/gravitino-server.out
. Below are some critical configuration items listed, but you can also refer to the Iceberg REST catalog service document for details.
Configuration item | Description |
---|---|
gravitino.auxService.names | Must specify iceberg-rest to start Iceberg REST service. |
gravitino.auxService.iceberg-rest.catalog-backend | memory is the default mainly used to test, could use hive or jdbc for production. |
3. Using the Iceberg REST catalog
Using Spark as an example, we can configure the Spark catalog options to use the Gravitino Iceberg REST catalog with the name “rest”
. This then allows us to directly query the Iceberg Catalog using SparkSQL - the same can be done with Trino, for instance.
- Run spark-sql shell using the following cmd and set the following configurations. Please note in the case below, we are using Spark version 3.4, Scala version 2.12 and Iceberg version 1.3.1. You may need to adjust the jar file (provided by Iceberg project) according to the actual version in your environment.
./bin/spark-sql -v \
--packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.rest=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.rest.type=rest \
--conf spark.sql.catalog.rest.uri=http://127.0.0.1:9001/iceberg/
For convenience, you can also place your configuration settings in conf/spark-defaults.conf
. This approach gets rid of specifying configs each time you run Spark and ensures the consistency for your cluster. Here is an example of what this file may look like:
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.rest org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.rest.type rest
spark.sql.catalog.rest.uri http://127.0.0.1:9001/iceberg/
- Exploring your new catalog with Spark SQL
USE rest;
CREATE DATABASE IF NOT EXISTS dml;
CREATE TABLE dml.test (id bigint COMMENT 'unique id') using iceberg;
DESCRIBE TABLE EXTENDED dml.test;
INSERT INTO dml.test VALUES (1), (2);
SELECT * FROM dml.test;
As you can see, Spark will now default to using the Iceberg REST catalog service. You can also check the Gravitino Iceberg REST catalog service document, or contact the developers on their community Slack channel.
Supported versions
We have verified the following versions in our testbed and some community users’ deployment. If you have different versions of compute engines and want to verify the build, please let the developers know by filing an issue on Github or messaging them on slack.
Engine | Supported versions |
---|---|
Spark | 3.0 and above |
Flink | 1.13 and above |
Trino | 405 and above |