The Open Source Lakehouse for the AI Era

One control plane.
Every catalog, every client.
Human or agent.

Ship Iceberg REST on any cloud. Federate Hudi, Delta, Lance, and filesets. One authorization path for every client, human or agent. Built on Apache Gravitino.

Request a demo Get started

Apache TLP Top-Level Project · Apache 2.0

10+ engines Spark · Trino · Flink · Iceberg

AI-native MCP · GVFS · model catalog

catalog.yaml Iceberg REST

type: iceberg-rest
uri:  https://lakehouse.your-co.net
warehouse: s3://prod/warehouse
credential: vended  # per-principal
access:
  row-filter: region = $user.region
  mask: [pii.email, pii.ssn]

Clients & engines

Spark Trino Flink Ray Daft

BI tools AI agents

Apache Gravitino

one catalog · one governance · one credential path

Catalog RBAC FGAC Tags Credential vending Lineage Audit

Sources, federated in place

Iceberg Hive Glue Relational Kafka Files Models

The Problem

The Lakehouse Without a Catalog

Your data is scattered across formats like Iceberg, every cloud, and now vectors and models, each with its own catalog, governance, and credentials. Every AI agent you ship is one more identity these systems weren't designed to govern.

Sprawl

Catalog sprawl across clouds and formats

Hive Metastore in one region, Glue in another, Polaris in a third. Lance datasets and ML models tracked in spreadsheets, if at all. No shared namespace, no shared identity, no consistent audit.

Drift

Governance fragmented per engine

Trino enforces one row-filter syntax, Spark another, BigQuery a third. Tabular data has rules. Files, vectors, and model artifacts often have none. Policies drift, audits diverge.

Exposure

Credentials baked into pipelines and handed to agents

Long-lived keys hardcoded into jobs. Service accounts shared across teams. AI agents handed broad credentials because nobody's wired up principal-scoped tokens yet. No clean revocation path when anything goes away.

The Solution

The Catalog of Catalogs for the Open Iceberg Lakehouse Era

Datastrato is built on Apache Gravitino, an Apache Top-Level Project and the federated metadata catalog for modern data and AI workloads. One namespace for everything your engines query and everything your AI agents need to find, understand, and use safely.

Open standard

Iceberg-native, every engine

Standards-compliant Iceberg REST Catalog service, in-tree, not a proprietary fork. Spark, Flink, Trino, Dremio, and BigQuery all hit one catalog with one FQTN namespace. Iceberg is the open standard your stack already speaks.

Federation

Federate, don't migrate, across every format

Register existing Hive Metastore, Glue, Polaris, or remote Gravitino catalogs without copying metadata. The same federation model covers Lance datasets, vector indexes, feature stores, and ML model registries. Decisions stay with the data owner.

Governance

Governed access for humans and AI agents

Row filtering and column masking through the Iceberg spec. Tag-based policies travel with the data. Short-lived credentials get vended per principal, whether that's an analyst, a batch job, or an AI agent calling through ADP.

Trusted by

Governed by Default

One policy path. Humans and agents alike.

Every query, training run, and agent call resolves through the same catalog. Row filters, column masks, and short-lived credentials are applied before a single byte is returned, whether the caller is an analyst or an autonomous agent.

row-level filters column masking credential vending

agent-session — gravitino

# RAG agent asks for customer data through the Gravitino MCP server
→ caller        agent:rag-bot · via MCP
→ resolve       catalog: prod.customers
→ row-filter    region = 'EU'
→ column-mask   email → sha256(email)
→ credential    vended · ttl 15m · scoped to agent:rag-bot
# engine reads Iceberg with the vended credential → governed result
 region │ email
 ───────┼─────────────────────
 EU     │ 9f2a…c71d
 EU     │ 4b8e…0a22
 EU     │ 1c93…7f5a

How It Works

One Catalog. Every Engine. Every Format.

REST catalog

Iceberg REST Catalog

Standards-compliant IRC service. Hierarchical namespaces and three-level FQTN (catalog.namespace.table). Pluggable JDBC backend on PostgreSQL or MySQL for low-latency, HA-ready persistence.

Multimodal

Multimodal metadata: tables, files, vectors, models

The same catalog covers Iceberg tables, files, Lance datasets, vector indexes, feature stores, and ML models. Tables for analytics. Lance for multimodal training. Vectors for retrieval. Models for serving.

Storage

Multi-cloud, multi-format storage

Native S3, GCS, and ADLS support with per-table backend dispatch via MultiSchemeFileIO. Same catalog across cloud and on-prem object stores. No engine changes when storage moves.

Federation

Catalog federation, no metadata copy

IRC-to-IRC registration across clouds and on-prem. No replication lag, no second source of truth. Each remote catalog keeps its own IAM, RBAC, vending, and audit log.

Access control

Fine-grained access control

Row filtering and column masking through the Iceberg spec (PR #13879). Tag-based policies attach to data, not tables, so classification travels with the asset. Identical evaluation on every compliant engine.

Credentials

Credential vending for humans and agents

Short-lived credentials minted per principal and per asset, native on AWS, GCP, and Azure. Analysts, batch jobs, and AI agents all get capability-scoped tokens through the same path. Same audit trail, same revocation, no long-lived keys baked into pipelines.

The Reference Architecture

Storing data is solved. Governing how every engine, user, and agent uses it is not.

Your data is scattered across formats, clouds, and now vectors and models, each with its own catalog, governance, and credentials. Datastrato is the unified metadata context layer that connects them and enforces governance at the point of access, for every consumer and agent that touches your data.

Consumers — analytics, applications & AI

Analytics & BI

query in place

Data products

governed APIs

Notebooks & ML

model registry

AI agents

MCP · governed

requests in

Apache Gravitino — the metadata control plane

description becomes enforcement, applied at the read, before any bytes are returned

Federated catalog

Semantic layer · OSI

RBAC · FGAC

Classification & tags

Credential vending

Lineage & audit

registered in place

Approved sources — registered in place, no copies

Iceberg

via IRC

Lakehouse

Delta · Hudi · Paimon

Hive · Glue

metastore

Relational

Oracle · PG · MSSQL

Kafka

topics

Files & Models

S3 · GCS · Lance

Define once

Access, classification, lineage and audit applied across every engine.

Federate, don't copy

Governed for AI

Agents reach data through the same policy path as humans and engines.

Trustworthiness

The Lakehouse Your Auditors Will Sign Off On.

Regulated multi-cloud platforms, multimodal AI on the lakehouse, AI agents acting on production data — Datastrato gives you one catalog and one governance model across all of it.

Compliance

Regulated, multi-cloud lakehouse

Govern PII and PHI across S3, GCS, and ADLS, with audit events streamed to Splunk, Datadog, or Sentinel. Identical enforcement on Spark, Trino, Flink, Dremio, and BigQuery.

AI / ML

Multimodal lakehouse for AI and ML

Tables, files, Lance datasets, vector indexes, feature stores, and ML models in one namespace. Training jobs find the right version, and lineage runs from raw file to deployed model.

Agents

AI agent data access at scale

Vend capability-scoped, just-in-time credentials to agents through ADP. They discover what's available, ask what they're allowed to use, and work inside the same policy and audit framework as everyone else.

Migration

Hybrid lakehouse migration and consolidation

Federate Hive Metastore, Glue, and Polaris under one namespace, before or instead of migrating. Cross-catalog CTAS and zero-copy register-table for promotion. Same Gravitino on OpenShift, EKS, GKE, and AKS, with air-gap support.

Why Datastrato

Open Foundation. Enterprise Distribution.

Apache Gravitino is the open metadata lake. Datastrato Enterprise is what you deploy in production: hardened, certified, supported, and compliance-ready, with the open core unchanged.

Open source

Apache Top-Level Project

Apache 2.0 license, vendor-neutral governance, community-driven roadmap. No license-change risk, no proprietary fork.

Hardened

Hardened, certified distribution

Signed container images and a full-stack Helm chart pinned per release. Deploys as one unit on EKS, GKE, AKS, OpenShift, or Kubernetes 1.28+.

Identity

Enterprise identity and audit

LDAP, Active Directory, and SCIM for people; federated identity with delegation chains for AI agents. SIEM streaming plus KMS and HashiCorp Vault integration.

Compliance

Compliance posture

SOC 2 Type II, ISO 27001 readiness, HIPAA BAA, and GDPR alignment. NDA-gated access to audit reports and control documentation.

Support

Standard and Premium support

Dedicated Slack Connect channel, 1-hour P0 response (24/7 on Premium), a named Technical Account Manager, and extended LTS windows beyond the standard 18 months.

Ready to Federate Your Lakehouse?

Whether your next workload is a Trino query, a Lance training run, or an AI agent asking for data, Datastrato gives you one catalog and one governance plane to handle it. Without copying a byte of metadata.

Request a demo Read the docs

Subscribe to our newsletter

Stay up to date with all things Datastrato and the Gravitino Ecosystem

One control plane.Every catalog, every client.Human or agent.

The Lakehouse Without a Catalog

Catalog sprawl across clouds and formats

Governance fragmented per engine

Credentials baked into pipelines and handed to agents

The Catalog of Catalogs for the Open Iceberg Lakehouse Era

Iceberg-native, every engine

Federate, don't migrate, across every format

Governed access for humans and AI agents

One policy path. Humans and agents alike.

One Catalog. Every Engine. Every Format.

Iceberg REST Catalog

Multimodal metadata: tables, files, vectors, models

Multi-cloud, multi-format storage

Catalog federation, no metadata copy

Fine-grained access control

Credential vending for humans and agents

Storing data is solved. Governing how every engine, user, and agent uses it is not.

The Lakehouse Your Auditors Will Sign Off On.

Regulated, multi-cloud lakehouse

Multimodal lakehouse for AI and ML

AI agent data access at scale

Hybrid lakehouse migration and consolidation

Open Foundation. Enterprise Distribution.

Apache Top-Level Project

Hardened, certified distribution

Enterprise identity and audit

Compliance posture

Standard and Premium support

Ready to Federate Your Lakehouse?

Subscribe to our newsletter

One control plane.
Every catalog, every client.
Human or agent.