One catalog.
Every cloud,
every engine.

Datastrato is the federated metadata catalog built on Apache Gravitino and the Iceberg REST Catalog. Ship Iceberg REST on any cloud, federate Hive, Delta, Lance, and filesets, and govern one authorization path for every client, human or agent.

Get started See how it works

Apache TLP

Top-Level Project, Apache 2.0

10+ engines

Spark · Trino · Flink · Iceberg

AI-native

MCP · GVFS · model catalog

Clients & engines

Spark

Trino

Flink

Ray

Daft

BI tools

AI agents

Apache Gravitino

one catalog · one governance · one credential path

catalogRBAC · FGACtagscredential vendinglineage · audit

Sources, federated in place

Iceberg

Hive · Glue

Relational

Kafka

Files

Models

The reference architecture

Storing data is solved. Governing how every engine, user, and agent uses it is not.

Your data is scattered across formats, clouds, and now vectors and models, each with its own catalog, governance, and credentials. Datastrato is the unified metadata context layer that connects them and enforces governance at the point of access, for every consumer and agent that touches your data.

Consumers — analytics, applications & AI

Analytics & BIquery in place

Data productsgoverned APIs

Notebooks & MLmodel registry

AI agentsMCP · governed

Apache Gravitino — the metadata control plane

description becomes enforcement, applied at the read, before any bytes are returned

Federated catalogSemantic layer · OSIRBAC · FGAC Classification & tagsCredential vendingLineage & audit

Approved sources — registered in place, no copies

Icebergvia IRC

LakehouseDelta · Hudi · Paimon

Hive · Gluemetastore

RelationalOracle · PG · MSSQL

Kafkatopics

Files & ModelsS3 · GCS · Lance

Define once

Access, classification, lineage and audit applied across every engine.

Federate, don't copy

Governed for AI

Agents reach data through the same policy path as humans and engines.

Capability 01 — Federated Catalog

One catalog over every source

Register Iceberg, Hive, Glue, relational databases, Kafka, filesets, and models in place. No copies, no migration. One FQTN namespace and one governance layer across them all.

Federate, don't migrate

Register existing Hive Metastore, Glue, Polaris, or remote Gravitino catalogs without copying metadata. Decisions stay with the data owner.

One FQTN namespace

A three-level catalog.namespace.table name resolves the same across Spark, Trino, Flink, Dremio, and BigQuery.

A catalog of catalogs

IRC-to-IRC registration across clouds and on-prem. No replication lag, no second source of truth. Each remote catalog keeps its own IAM and audit.

Multimodal by design

Tables, files, streams, vectors, and ML models in one model. A single governed surface for analytics and AI.

Sources Datastrato federates today

RelationalOracle · PGMySQL · MSSQL

LakehouseIceberg · DeltaHudi · Paimon

MetastoreHive · Gluecatalog federation

StreamingApache Kafkatopics · schema registry

FilesS3 · GCS · ADLSHDFS via GVFS

AI assetsModels · Lancevectors · registry

catalog classification lineage access control credential vending

Capability 02 — Iceberg Lakehouse Platform

Standards-compliant Iceberg REST, built for scale

A standards-compliant Iceberg REST Catalog service that runs in-tree, not as a proprietary fork, plus the table management, caching, and maintenance that keep a production lakehouse fast and available.

Iceberg-native, every engine

Spark, Flink, Trino, and Dremio all hit one catalog with one namespace. Iceberg is the open standard your stack already speaks.

IRC metadata cache

4× faster reads under concurrent load. Load-table and plan avoid backend round-trips, and every entry is validated against the backend before it is served.

Iceberg maintenance framework

Per-table policy for compaction, snapshot expiry, and orphan cleanup, submitted through a pluggable JobExecutor SPI to Airflow, K8s, EMR, or Databricks.

HA-ready persistence

Pluggable JDBC backend on PostgreSQL or MySQL for low-latency, highly-available metadata, with DR for the catalog.

just normal SQL — any engine, any cloud

-- one catalog, one FQTN namespace
CREATE TABLE catalog.sales.orders (
  id BIGINT, amount DECIMAL, region STRING
);

-- alternate backend, same catalog
CREATE TABLE catalog.sales.events (id BIGINT)
LOCATION 's3://bucket/events/';

-- query in place, no extracts
SELECT region, sum(amount)
FROM catalog.sales.orders GROUP BY region;

4×

faster reads measured on AWS under concurrent load. The IRC cache serves hot metadata from memory, never past the access check or a stale commit.

Capability 03 — Unified Governance

Define once, enforce everywhere

Role-based access, fine-grained controls, and classification evaluated at the catalog on every request, for every engine, client, and AI agent, with one audit trail.

Role-based access control

Privileges scope to object types and inherit down the tree. DENY overrides ALLOW, ownership and roles checked together, evaluated and audited server-side.

Fine-grained access control

Row filtering and column masking through the Iceberg spec. Nine spec-defined mask actions produce identical output on every compliant engine.

Tags travel with the data

Classify once; policies attach to tags, not tables. Tags inherit down the tree, and FGAC rules key on them for blast-radius analysis.

Identity from your IdP

OIDC sign-in and SCIM provisioning of users and groups; new members inherit access at once, with optional push-down to Apache Ranger.

Classify once · inherited down the tree

Catalogdomain=salesSELECT
Schemaordersinherited
Table · customers+ piimasked

tag-based FGAC policy

{
  "policyType": "system_fgac_row_filter",
  "content": {
    "tag": "pii-region-restricted",
    "rowFilter": { "column": "region",
      "operator": "eq", "value": "home_region" }
  }
}

Capability 04 — Semantic Layer (OSI)

One definition for every metric

A governed registry for Open Semantic Interchange (OSI) models. Define a metric once and serve it to any producer or consumer, secured with the same RBAC, tags, and lineage as the tables beneath it.

An open standard

OSI is a vendor-neutral, Apache-2.0 spec for datasets, metrics, dimensions, relationships, and context. The logic isn't locked in a BI tool.

One number, no drift

Define the metric once; dbt, AtScale, Tableau, Sigma, notebooks, and agents all read the same logic. One consistent set of metrics everywhere.

Governed execution

The metric compiles to SQL and runs on governed tables, enforced on the read. Definition and enforcement in one place.

Agent-ready

ai_context grounds LLMs with intent and synonyms, so an agent asking over MCP gets governed meaning, not raw tables.

ecommerce_analytics.yaml · OSI v1.0

semantic_model:
  name: ecommerce_analytics
  ai_context:           # natural-language grounding
    instructions: "Revenue = settled order amount"
  datasets:
    - name: orders
      source: sales.public.orders
  metrics:             # defined once, spans datasets
    - name: settled_revenue
      expression:
        dialect: ANSI_SQL
        expression: SUM(orders.amount)
          FILTER (WHERE status='settled')
      ai_context: { synonyms: ["net revenue"] }

Author once Any OSI tool reads it Execute under governance

Capability 05 — Multi-Cloud & Credentials

Every cloud, one catalog — no long-lived keys

Native S3, GCS, and ADLS within a single catalog, with short-lived credentials minted per principal at access time. Engines never see long-lived keys, and authorization is evaluated at the authoritative metalake.

One catalog, every cloud

MultiSchemeFileIO routes by URI scheme: a default warehouse plus per-scheme backends, fully IRC-compliant. No engine changes when storage moves.

Credential vending

Short-lived, scoped STS tokens minted at access time on AWS, GCP, and Azure. Satisfies zero-trust posture with no keys baked into pipelines.

Identity passthrough

A JWT identity flows caller → Gravitino → IRC → object-store credential, evaluated at the authoritative metalake.

Federation across boundaries

Remote IRCs are referenced by name and URI. Each keeps its own IAM, RBAC, vending, and audit. Nothing is shared except the request shape.

Caller — Spark · Trino · Flink

User JWTidentity, no keys

Gravitino catalog

default warehouse + per-scheme overrides · authz at the metalake

Scoped, short-lived tokens vended per request

s3-tokens3://

gcs-tokengs://

adls-tokenabfs://

Capability 06 — AI & Agents

A governed surface for AI agents

Agents reach governed metadata and unstructured data through one standardized, policy-enforced surface: MCP for metadata, GVFS for files. It's held to the same RBAC, FGAC, and audit as every human and engine.

MCP server

The metadata API exposed as MCP tools, running local or remote. Connects to Claude Desktop and Cursor out of the box.

Same identity and policy

RBAC, FGAC, masking, and audit all apply. Agents are governed exactly like humans and engines, so over-permissioned access doesn't happen.

GVFS for unstructured data

One virtual path masks the real location across S3, GCS, ADLS, and HDFS, vending a scoped credential per request, with the same identity and audit.

Discover and act

List and inspect catalogs, tables, and columns; read and apply tags, all under one policy-enforced surface.

MCP tool surface · exposed in 1.0.0

# LLM agent (Claude · Cursor) → MCP server
get_list_of_catalogs()
get_list_of_tables(schema)
get_table_columns_by_fqn(fqn)
list_objects_by_tag("pii")
associate_tag_to_column(col, tag)

# one virtual path for unstructured data
gvfs://fileset/{catalog}/{schema}/{fileset}/path
   ↳ resolves location · vends scoped credential

Agents are gated by the same governance as everyone else. Every request carries identity, object, and operation to an access check, so humans, engines, and agents reach data through one governed path.

Ready to federate your lakehouse?

Whether your next workload is a Trino query, a Lance training run, or an AI agent asking for data, Datastrato gives you one catalog and one governance plane to handle it, without copying a byte of metadata.

Request a demo Read the docs

One catalog.Every cloud,every engine.