Apache Gravitino™ 1.2: From Metadata Catalog to Operational Platform

· 13 min read

March 2026

Every production lakehouse eventually develops the same set of problems. Tables accumulate small files until scan performance degrades. Row-level deletes pile up as Iceberg V2 delete debt until reads slow to a crawl. Custom functions scatter across notebooks and engine-specific scripts with no ownership, no versioning, and no governance. Lightweight operational queries (freshness checks, pipeline monitors, SLA watchers) keep expensive query engines running around the clock even when nobody is doing real analytics.

These aren't edge cases. They are the normal operating cost of running a heterogeneous data platform at scale. And until now, solving them has meant custom Spark jobs, manual inspection, or third-party observability products bolted on top of a catalog that was never designed to help.

Apache Gravitino 1.2 changes that posture. This release moves Gravitino from a passive metadata catalog into an operational layer for the lakehouse. Three capabilities define the shift:

Table Maintenance Service for proactive partition-level table health
UDF management so transformation logic becomes a governed catalog asset
Server-side scan planning so lightweight workloads like AI queries can run in Python without a query engine

Proactive Table Health: The Table Maintenance Service

The most expensive moment in lakehouse operations is usually the one nobody planned for: a Spark job that ran fine last week suddenly takes three times as long because a partition has accumulated 800 small files from a week of streaming ingest. Or a dashboard query that started returning slowly because six months of row-level updates have built up a delete file backlog that every read has to merge at query time.

The real problem isn't that tables degrade; it's that degradation is invisible until it hurts, and the tools available to fix it operate at the table level rather than the partition level. Running a blanket OPTIMIZE across a twelve-partition table to fix two sick partitions is expensive and imprecise. You don't need to compact June if June is healthy. You need to compact January and March, and you need to compact them in the right order.

Gravitino 1.2 introduces the Table Maintenance Service (TMS), a statistics-driven recommendation engine that operates at partition granularity. TMS collects per-partition statistics, evaluates them against a configurable policy, scores and ranks candidates by urgency, and dispatches compaction jobs only for the partitions that need it most.

How It Works

TMS separates two distinct degradation signals that require different treatment:

File fragmentation: streaming ingest and frequent small writes produce files that deviate far from the optimal 128MB target size. The signal is a high mean squared deviation from that target across data files within a partition — the larger the value, the more fragmented the partition.

Delete file count: row-level updates and deletes in Iceberg V2 append position delete files rather than rewriting data. Over time, queries must merge these delete files against data files at read time. The signal is a growing delete file count per partition.

A TMS policy captures the trigger logic and scoring logic as governed expressions:

trigger-expr : custom-datafile_size_mse > 5000 || custom-delete_file_count > 0
score-expr   : custom-datafile_size_mse + custom-delete_file_count * 1000

The 1000x weight on delete files is a deliberate policy decision: delete debt compounds at query time because every read must merge data and delete files, whereas small files degrade performance more gradually. Your data platform team sets this weight. You can tune it without touching any code.

A Concrete Example

Consider a table with twelve monthly partitions in different states of health:

Partition	Condition	File Fragmentation (MSD)	Delete Files	Triggered	Score
2024-06	Healthy, well compacted	180	0	No	n/a
2024-01	Small file accumulation from streaming ingest	28,000	0	Yes	28,000
2024-03	Delete debt from row-level updates	3,200	18	Yes	21,200

TMS evaluates each partition against the trigger expression. June passes quietly: a mean squared deviation of 180 is well below threshold, no delete files. January and March both trigger. The recommender scores them: January at 28,000 (pure small file problem), March at 21,200 (lower file variance but eighteen delete files weighted at 1,000 each). January goes first. The healthy partition never gets touched.

The shift TMS enables is fundamental: not "compact this table" but "compact these specific partitions, in this order, because here is the evidence." The policy is a governed catalog object, auditable, version-controlled, and shareable across teams. The statistics live alongside your table metadata. The decisions are reproducible.

Neither Databricks OPTIMIZE nor Snowflake Auto-Clustering operates with this level of precision. Both treat the table as the unit of maintenance. TMS treats the partition as the unit of maintenance, which is the right granularity for production lakehouse operations.

TMS in 1.2 ships as a CLI tool (gravitino-optimizer.sh) designed to run on a schedule you control. The two core commands are update-statistics, which collects fresh per-partition metrics, and submit-strategy-jobs, which evaluates your policy and dispatches compaction for partitions that triggered. You wire the cadence into whatever scheduler your platform already uses — cron, Airflow, or otherwise. The partition scoring, prioritization, and job dispatch are all handled by Gravitino once invoked.

TMS serviceability, including a fully managed service with built-in scheduling intervals, health checks, and observability hooks, is planned for 1.3.

UDF Management: Governing the Logic, Not Just the Data

Most data catalogs govern data assets: tables, schemas, views. The functions that transform that data are a different story. Today, user-defined functions live in ad hoc scripts, get duplicated across teams, drift out of sync between Spark and Trino implementations, and are almost never subject to the lifecycle controls applied to the tables they operate on. An ML engineer writes a text normalization function, puts it in a notebook, and six months later nobody knows where it is, which version is current, or who owns it.

Gravitino 1.2 introduces end-to-end UDF management, a complete stack from REST API to Spark FunctionCatalog integration, Python client, and Web UI. Functions become first-class catalog objects: governed by the same RBAC model as your tables, discoverable through the same catalog interface, and auditable through the same access trail.

Python UDFs for AI Pipelines

r = requests.post(
    f"{GRAVITINO_URI}/api/metalakes/{METALAKE}/catalogs/{CATALOG}/schemas/{SCHEMA}/functions",
    json={
        "name": "normalize_text",
        "functionType": "SCALAR",
        "deterministic": True,
        "comment": "Normalize raw text for ML feature extraction. Owner: data-eng team.",
        "definitions": [
            {
                "parameters": [{"name": "text", "dataType": "string"}],
                "returnType": "string",
                "impls": [
                    {
                        "language": "PYTHON",
                        "runtime": "SPARK",
                        "handler": "normalize_text",
                        "codeBlock": normalize_code
                    }
                ]
            }
        ]
    }
)

A downstream AI pipeline (a different team, a different notebook) discovers and retrieves the function through the same catalog API it uses to find tables:

# Discover what functions are available
r = requests.get(
    f"{GRAVITINO_URI}/api/metalakes/{METALAKE}/catalogs/{CATALOG}/schemas/{SCHEMA}/functions",
    params={"details": "true"}
)

# Retrieve the function and execute it locally — code came from the governed catalog
fn = r.json()["function"]
exec(fn["definitions"][0]["impls"][0]["codeBlock"], globals())
result = normalize_text("  Hello, World!  ")
# → "hello world"

No Slack message. No shared S3 path. No pip package to keep in sync. The function is where the data is, in the catalog, with an owner, a description, access controls, and an audit trail.

SQL UDFs Across Engines

The same governance model applies to SQL UDFs. Register a function once with implementations for multiple engines:

{
    "name": "revenue_bucket",
    "comment": "Classify revenue into reporting tiers. Consistent across Spark and Trino.",
    "definitions": [
        {
            "parameters": [{"name": "amount", "dataType": "double"}],
            "returnType": "string",
            "impls": [
                {
                    "language": "SQL",
                    "runtime": "SPARK",
                    "sql": "CASE WHEN amount < 1000 THEN 'small' WHEN amount < 10000 THEN 'medium' ELSE 'large' END"
                },
                {
                    "language": "SQL",
                    "runtime": "TRINO",
                    "sql": "CASE WHEN amount < 1000 THEN 'small' WHEN amount < 10000 THEN 'medium' ELSE 'large' END"
                }
            ]
        }
    ]
}

One governed definition. Two runtime implementations. Each engine gets the right version when it looks up the function. Drift between your Spark and Trino implementations becomes a policy violation rather than a slow-moving inconsistency nobody notices until a dashboard breaks.

The full 1.2 UDF stack includes: Java API, REST interface, relational storage backend, Java and Python client support, Spark FunctionCatalog integration, Web UI for visual browsing and creation, and complete OpenAPI documentation.

Server-Side Scan Planning: Shut the Engine Down

Query engines in enterprise data platforms rarely run 24/7 because of complex analytics. They run 24/7 because something small always keeps them alive: a freshness check, a pipeline monitor, an SLA watcher that needs to know whether the morning partition arrived. None of these workloads require compute. They just need to know which files to look at. But because that metadata work happens client-side today, the engine has to be running to do it.

Gravitino 1.2 is the first open-source Iceberg REST Catalog implementation to ship the /scan endpoint, delivering server-side scan planning that evaluates partition filters on the server and returns a ready-made file list to the client. No engine required.

Engine Shutdown, in Practice

The NYC Yellow Taxi dataset: 40 million rows across 12 monthly partitions on S3. A query for all long-haul trips in June 2024 — a single partition, with a distance filter pushed down to the Parquet reader:

files = grav_table.scan(
    row_filter="tpep_pickup_datetime >= '2024-06-01T00:00:00' AND tpep_pickup_datetime < '2024-07-01T00:00:00'",
    selected_fields=("VendorID", "tpep_pickup_datetime", "fare_amount", "trip_distance"),
).plan_files()

Gravitino evaluates the partition filter server-side and returns in 0.90 seconds:

Scan plan returned in 0.90 seconds

Total partitions in table : 12 months (Jan–Dec 2024)
Partitions scanned        : 1 (June only)
Files returned            : 3
Rows in partition         : 3,539,170

A PyArrow client reads those three files in parallel from S3, applies the distance filter with Parquet predicate pushdown, and returns 281,403 long-haul trips in 2.59 seconds. Total end-to-end: 3.5 seconds. No Spark. No Trino. No Databricks cluster sitting idle on the clock.

The 0.90 seconds is Gravitino's contribution: server-side partition pruning that hands the client exactly the files it needs, skipping eleven months of data it never asked for. When the engine is needed for real analytical work, it starts with that same ready-made file list rather than spending billable time on scan planning itself.

	Without Gravitino /scan	With Gravitino /scan
Engine required	Yes	No
Partitions evaluated	12	1
Rows considered	~40 million	~3.5 million
Query time	Engine startup + execution	3.5 seconds, no engine
Engine billable time	Full query duration	Zero

Continuous Data Observability at Zero Cost

Server-side scan planning enables a second capability that often goes unnoticed: continuous monitoring of lakehouse data without compute infrastructure, or cost.

Gravitino runs as persistent infrastructure. A lightweight Python monitor, a few lines of PyIceberg, can call /scan every few minutes against your latest partition and check: did new files appear? Did record counts change unexpectedly? Did a partition that should have data arrive empty?

# Run every 5 minutes — no engine, no cost
result = table.scan(
    row_filter=GreaterThanOrEqual("event_date", yesterday)
).plan_files()

files = list(result)
if len(files) == 0:
    alert("Morning partition did not arrive by SLA window")
elif sum(f.file.record_count for f in files) < expected_minimum:
    alert(f"Volume anomaly: {sum(...):,} rows, expected > {expected_minimum:,}")

Late data detection, volume anomalies, pipeline failures, schema drift: all detectable continuously, against live Iceberg metadata, at effectively zero infrastructure cost. Commercial data observability products charge significant license fees for exactly this capability. Gravitino delivers it as a side effect of the catalog already being present.

Broader Ecosystem Additions

Iceberg View Management and Authorization

Views have long been a governance blind spot. Gravitino 1.2 manages Iceberg views as first-class entities with view-level permissions and full authorization support for IRC view operations, with fine-grained access control applied consistently to both the view and the underlying tables it references.

Gravitino implements the INVOKER security model for view authorization: permissions are evaluated at query time against the identity of the user running the query, not the identity of the view's creator. Every access to a view and its underlying tables is checked in real time, against real privileges, for the real user making the request.

The difference between INVOKER and DEFINER is not just technical preference; it is an architectural choice with real governance consequences. Many popular platforms default to the DEFINER model, where a view inherits the privileges of whoever created it. Definer-model views are convenient to set up but create a subtle governance problem: a user with no direct access to a table can query it freely through a view created by someone who does. Access control becomes a function of who built the view rather than who is querying it, making it difficult to audit and harder to enforce consistently.

With Gravitino's INVOKER model, what you see in your access policy is what actually happens at query time. No privilege laundering through view ownership chains.

Delta Lake External Table Support

Delta Lake is one of the most widely deployed open table formats. Gravitino 1.2 supports registering and managing external Delta tables through the generic lakehouse catalog, providing unified governance for Delta alongside Iceberg, Hudi, and Paimon without data migration. Organizations running Delta workloads can bring them under Gravitino's open catalog architecture without dependence on proprietary catalog services.

ClickHouse Catalog

Teams running real-time analytics on ClickHouse can now govern it alongside their lakehouse, with one metadata layer for both streaming and batch workloads. The ClickHouse catalog includes full DDL support with distributed and partitioned cluster modes.

What's Under the Hood

A release is more than headline features. Gravitino 1.2 includes substantial performance and reliability improvements across the core:

IRC performance: load table performance under high concurrency with authorization and credential distribution enabled is significantly improved. The IRC internal catalog getter now uses internal interfaces instead of HTTP when using a dynamic configuration provider, reducing round-trip overhead.

Authorization: Jcasbin internal maps converted to caches, reducing authorization latency under high concurrency. Batch authorization now preloads table metadata, reducing per-request overhead. Authorization overhead when the plugin is empty has been eliminated entirely.

Batch Metadata API: batch get operations now supported for catalogs, schemas, topics, metalakes, jobs, job templates, policies, and tags, with fewer round trips for clients that need metadata at scale.

Iceberg upgraded to 1.10.1: the IRC layer runs on the latest stable Iceberg release.

Web UI v2: rebuilt to support ClickHouse catalog management, UDF management, and improved views for tags, policies, and task templates.

Get Started

Apache Gravitino 1.2 is available now.

Documentation: gravitino.apache.org
GitHub: github.com/apache/gravitino
Community: Join the conversation on the Apache Gravitino mailing list and Slack

Thanks to everyone who contributed to the 1.2 release: code, reviews, testing, issue triage, design, and feedback. The full contributor list is in the release notes.

Apache Gravitino™, and the Gravitino logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Proactive Table Health: The Table Maintenance Service​

How It Works​

A Concrete Example​

UDF Management: Governing the Logic, Not Just the Data​

Python UDFs for AI Pipelines​

SQL UDFs Across Engines​

Server-Side Scan Planning: Shut the Engine Down​

Engine Shutdown, in Practice​

Continuous Data Observability at Zero Cost​

Broader Ecosystem Additions​

Iceberg View Management and Authorization​

Delta Lake External Table Support​

ClickHouse Catalog​

What's Under the Hood​

Get Started​