Skip to main content
The Open Source Lakehouse for the AI Era

One control plane.
Every catalog, every client.
Human or agent.

Ship Iceberg REST on any cloud and any existing Hive. Federate Hudi, Delta, Lance, and filesets. One authorization path for every client, human or agent. Built on Apache Gravitino.

Apache TLP Top-Level Project · Apache 2.0
10+ engines Spark · Trino · Flink · Iceberg
AI-native MCP · GVFS · model catalog
catalog.yaml Iceberg REST
type: iceberg-rest
uri: https://lakehouse.your-co.net
warehouse: s3://prod/warehouse
credential: vended # per-principal
access:
row-filter: region = $user.region
mask: [pii.email, pii.ssn]
Clients & engines
Spark Trino Flink Ray Daft
BI tools AI agents
Apache Gravitino
one catalog · one governance · one credential path
Catalog RBAC FGAC Tags Credential vending Lineage Audit
Sources, federated in place
Iceberg Hive Glue Relational Kafka Files Models
The Problem

The Lakehouse Without a Catalog

Your data is scattered across formats like Iceberg, every cloud, and now vectors and models, each with its own catalog, governance, and credentials. Every AI agent you ship is one more identity these systems weren't designed to govern.

Sprawl

Catalog sprawl across clouds and formats

Hive Metastore in one region, Glue in another, Polaris in a third. Lance datasets and ML models tracked in spreadsheets, if at all. No shared namespace, no shared identity, no consistent audit.

Drift

Governance fragmented per engine

Trino enforces one row-filter syntax, Spark another, BigQuery a third. Tabular data has rules. Files, vectors, and model artifacts often have none. Policies drift, audits diverge.

Exposure

Credentials baked into pipelines and handed to agents

Long-lived keys hardcoded into jobs. Service accounts shared across teams. AI agents handed broad credentials because nobody's wired up principal-scoped tokens yet. No clean revocation path when anything goes away.

The Solution

The Catalog of Catalogs for the Open Iceberg Lakehouse Era

Datastrato is built on Apache Gravitino, an Apache Top-Level Project and the federated metadata catalog for modern data and AI workloads. One namespace for everything your engines query and everything your AI agents need to find, understand, and use safely.

Open standard

Iceberg-native, every engine

Standards-compliant Iceberg REST Catalog service, in-tree, not a proprietary fork. Spark, Flink, Trino, Dremio, and BigQuery all hit one catalog with one FQTN namespace. Iceberg is the open standard your stack already speaks.

Federation

Federate, don't migrate, across every format

Register existing Hive Metastore, Glue, Polaris, or remote Gravitino catalogs without copying metadata. The same federation model covers Lance datasets, vector indexes, feature stores, and ML model registries. Decisions stay with the data owner.

Governance

Governed access for humans and AI agents

Row filtering and column masking through the Iceberg spec. Tag-based policies travel with the data. Short-lived credentials get vended per principal, whether that's an analyst, a batch job, or an AI agent calling through ADP.

Trusted by
OpenAIUberPinterestRokuIntelTencent
Governed by Default

One policy path. Humans and agents alike.

Every query, training run, and agent call resolves through the same catalog. Row filters, column masks, and short-lived credentials are applied before a single byte is returned, whether the caller is an analyst or an autonomous agent.

row-level filters column masking credential vending
agent-session — gravitino
# RAG agent asks for customer data through the Gravitino MCP server
caller agent:rag-bot · via MCP
resolve catalog: prod.customers
row-filter region = 'EU'
column-mask email → sha256(email)
credential vended · ttl 15m · scoped to agent:rag-bot
# engine reads Iceberg with the vended credential → governed result
region │ email
───────┼─────────────────────
EU │ 9f2a…c71d
EU │ 4b8e…0a22
EU │ 1c93…7f5a
How It Works

One Catalog. Every Engine. Every Format.

REST catalog

Iceberg REST Catalog

Standards-compliant IRC service. Hierarchical namespaces and three-level FQTN (catalog.namespace.table). Pluggable JDBC backend on PostgreSQL or MySQL for low-latency, HA-ready persistence.

Multimodal

Multimodal metadata: tables, files, vectors, models

The same catalog covers Iceberg tables, files, Lance datasets, vector indexes, feature stores, and ML models. Tables for analytics. Lance for multimodal training. Vectors for retrieval. Models for serving.

Storage

Multi-cloud, multi-format storage

Native S3, GCS, and ADLS support with per-table backend dispatch via MultiSchemeFileIO. Same catalog across cloud and on-prem object stores. No engine changes when storage moves.

Federation

Catalog federation, no metadata copy

IRC-to-IRC registration across clouds and on-prem. No replication lag, no second source of truth. Each remote catalog keeps its own IAM, RBAC, vending, and audit log.

Access control

Fine-grained access control

Row filtering and column masking through the Iceberg spec (PR #13879). Tag-based policies attach to data, not tables, so classification travels with the asset. Identical evaluation on every compliant engine.

Credentials

Credential vending for humans and agents

Short-lived credentials minted per principal and per asset, native on AWS, GCP, and Azure. Analysts, batch jobs, and AI agents all get capability-scoped tokens through the same path. Same audit trail, same revocation, no long-lived keys baked into pipelines.

The Reference Architecture

Storing data is solved. Governing how every engine, user, and agent uses it is not.

Your data is scattered across formats, clouds, and now vectors and models, each with its own catalog, governance, and credentials. Datastrato is the unified metadata context layer that connects them and enforces governance at the point of access, for every consumer and agent that touches your data.

Consumers — analytics, applications & AI
Analytics & BI
query in place
Data products
governed APIs
Notebooks & ML
model registry
AI agents
MCP · governed
requests in
Apache Gravitino — the metadata control plane
description becomes enforcement, applied at the read, before any bytes are returned
Federated catalog
Semantic layer · OSI
RBAC · FGAC
Classification & tags
Credential vending
Lineage & audit
registered in place
Approved sources — registered in place, no copies
Iceberg
via IRC
Lakehouse
Delta · Hudi · Paimon
Hive · Glue
metastore
Relational
Oracle · PG · MSSQL
Kafka
topics
Files & Models
S3 · GCS · Lance
Define once
Access, classification, lineage and audit applied across every engine.
Federate, don't copy
Register sources in place; ownership and decisions stay with the data.
Governed for AI
Agents reach data through the same policy path as humans and engines.
Trustworthiness

The Lakehouse Your Auditors Will Sign Off On.

Regulated multi-cloud platforms, multimodal AI on the lakehouse, AI agents acting on production data — Datastrato gives you one catalog and one governance model across all of it.

Compliance

Regulated, multi-cloud lakehouse

Govern PII and PHI across S3, GCS, and ADLS, with audit events streamed to Splunk, Datadog, or Sentinel. Identical enforcement on Spark, Trino, Flink, Dremio, and BigQuery.

AI / ML

Multimodal lakehouse for AI and ML

Tables, files, Lance datasets, vector indexes, feature stores, and ML models in one namespace. Training jobs find the right version, and lineage runs from raw file to deployed model.

Agents

AI agent data access at scale

Vend capability-scoped, just-in-time credentials to agents through ADP. They discover what's available, ask what they're allowed to use, and work inside the same policy and audit framework as everyone else.

Migration

Hybrid lakehouse migration and consolidation

Federate Hive Metastore, Glue, and Polaris under one namespace, before or instead of migrating. Cross-catalog CTAS and zero-copy register-table for promotion. Same Gravitino on OpenShift, EKS, GKE, and AKS, with air-gap support.

Why Datastrato

Open Foundation. Enterprise Distribution.

Apache Gravitino is the open metadata lake. Datastrato Enterprise is what you deploy in production: hardened, certified, supported, and compliance-ready, with the open core unchanged.

Open source

Apache Top-Level Project

Apache 2.0 license, vendor-neutral governance, community-driven roadmap. No license-change risk, no proprietary fork.

Hardened

Hardened, certified distribution

Signed container images and a full-stack Helm chart pinned per release. Deploys as one unit on EKS, GKE, AKS, OpenShift, or Kubernetes 1.28+.

Identity

Enterprise identity and audit

LDAP, Active Directory, and SCIM for people; federated identity with delegation chains for AI agents. SIEM streaming plus KMS and HashiCorp Vault integration.

Compliance

Compliance posture

SOC 2 Type II, ISO 27001 readiness, HIPAA BAA, and GDPR alignment. NDA-gated access to audit reports and control documentation.

Support

Standard and Premium support

Dedicated Slack Connect channel, 1-hour P0 response (24/7 on Premium), a named Technical Account Manager, and extended LTS windows beyond the standard 18 months.

Ready to Federate Your Lakehouse?

Whether your next workload is a Trino query, a Lance training run, or an AI agent asking for data, Datastrato gives you one catalog and one governance plane to handle it. Without copying a byte of metadata.

Subscribe to our newsletter

Stay up to date with all things Datastrato and the Gravitino Ecosystem