Skip to main content
The open metadata lake for the AI era

One control plane.
Every catalog, every client.

Datastrato is the federated metadata catalog built on Apache Gravitino and the Iceberg REST Catalog. Ship Iceberg REST on any cloud, federate Hive, Delta, Lance, and filesets, and govern one authorization path for every client, human or agent.

Apache TLP
Top-Level Project, Apache 2.0
10+ engines
Spark · Trino · Flink · Iceberg
AI-native
MCP · GVFS · model catalog
Clients & engines
Spark
Trino
Flink
BI tools
AI agents
Apache Gravitino
one catalog · one governance · one credential path
catalogRBAC · FGACtagscredential vendinglineage · audit
Sources, federated in place
Iceberg
Hive · Glue
Relational
Kafka
Files
Models
Trusted by
OpenAIUberPinterestRokuIntelTencent
The reference architecture

Storing data is solved. Governing how every engine, user, and agent uses it is not.

Your data is scattered across formats, clouds, and now vectors and models, each with its own catalog, governance, and credentials. Datastrato is the unified metadata context layer that connects them and enforces governance at the point of access, for every consumer and agent that touches your data.

Consumers — analytics, applications & AI
Analytics & BIquery in place
Data productsgoverned APIs
Notebooks & MLmodel registry
AI agentsMCP · governed
Apache Gravitino — the metadata control plane
description becomes enforcement, applied at the read, before any bytes are returned
Federated catalogSemantic layer · OSIRBAC · FGAC Classification & tagsCredential vendingLineage & audit
Approved sources — registered in place, no copies
Icebergvia IRC
LakehouseDelta · Hudi · Paimon
Hive · Gluemetastore
RelationalOracle · PG · MySQL
Kafkatopics
Files & ModelsS3 · GCS · Lance
Define once
Access, classification, lineage and audit applied across every engine.
Federate, don't copy
Register sources in place; ownership and decisions stay with the data.
Governed for AI
Agents reach data through the same policy path as humans and engines.
Capability 01 — Federated Catalog

One catalog over every source

Register Iceberg, Hive, Glue, relational databases, Kafka, filesets, and models in place. No copies, no migration. One FQTN namespace and one governance layer across them all.

Federate, don't migrate

Register existing Hive Metastore, Glue, Polaris, or remote Gravitino catalogs without copying metadata. Decisions stay with the data owner.

One FQTN namespace

A three-level catalog.namespace.table name resolves the same across Spark, Trino, Flink, Dremio, and BigQuery.

A catalog of catalogs

IRC-to-IRC registration across clouds and on-prem. No replication lag, no second source of truth. Each remote catalog keeps its own IAM and audit.

Multimodal by design

Tables, files, streams, vectors, and ML models in one model. A single governed surface for analytics and AI.

Sources Datastrato federates today
RelationalOracle · PGMySQL · SQL Server · Doris
LakehouseIceberg · DeltaHudi · Paimon
MetastoreHive · Gluecatalog federation
StreamingApache Kafkatopics · schema registry
FilesS3 · GCS · ADLSHDFS via GVFS
AI assetsModels · Lancevectors · registry
catalog classification lineage access control credential vending
Capability 02 — Iceberg Lakehouse Platform

Standards-compliant Iceberg REST, built for scale

A standards-compliant Iceberg REST Catalog service that runs in-tree, not as a proprietary fork, plus the table management, caching, and maintenance that keep a production lakehouse fast and available.

Iceberg-native, every engine

Spark, Flink, Trino, and Dremio all hit one catalog with one namespace. Iceberg is the open standard your stack already speaks.

IRC metadata cache

4× faster reads under concurrent load. Load-table and plan avoid backend round-trips, and every entry is validated against the backend before it is served.

Iceberg maintenance framework

Per-table policy for compaction, snapshot expiry, and orphan cleanup, submitted through a pluggable JobExecutor SPI to Airflow, K8s, EMR, or Databricks.

HA-ready persistence

Pluggable JDBC backend on PostgreSQL or MySQL for low-latency, highly-available metadata, with DR for the catalog.

just normal SQL — any engine, any cloud
-- one catalog, one FQTN namespace
CREATE TABLE catalog.sales.orders (
  id BIGINT, amount DECIMAL, region STRING
);

-- alternate backend, same catalog
CREATE TABLE catalog.sales.events (id BIGINT)
LOCATION 's3://bucket/events/';

-- query in place, no extracts
SELECT region, sum(amount)
FROM catalog.sales.orders GROUP BY region;
faster reads measured on AWS under concurrent load. The IRC cache serves hot metadata from memory, never past the access check or a stale commit.
Capability 03 — Unified Governance

Define once, enforce everywhere

Role-based access, fine-grained controls, and classification evaluated at the catalog on every request, for every engine, client, and AI agent, with one audit trail.

Role-based access control

Privileges scope to object types and inherit down the tree. DENY overrides ALLOW, ownership and roles checked together, evaluated and audited server-side.

Fine-grained access control

Row filtering and column masking through the Iceberg spec. Nine spec-defined mask actions produce identical output on every compliant engine.

Tags travel with the data

Classify once; policies attach to tags, not tables. Tags inherit down the tree, and FGAC rules key on them for blast-radius analysis.

Identity from your IdP

OIDC sign-in and SCIM provisioning of users and groups; new members inherit access at once, with optional push-down to Apache Ranger.

Classify once · inherited down the tree
Catalogdomain=salesSELECT
Schemaordersinherited
Table · customers+ piimasked
tag-based FGAC policy
{
  "policyType": "system_fgac_row_filter",
  "content": {
    "tag": "pii-region-restricted",
    "rowFilter": { "column": "region",
      "operator": "eq", "value": "home_region" }
  }
}
Capability 04 — Semantic Layer (OSI)

One definition for every metric

A governed registry for Open Semantic Interchange (OSI) models. Define a metric once and serve it to any producer or consumer, secured with the same RBAC, tags, and lineage as the tables beneath it.

An open standard

OSI is a vendor-neutral, Apache-2.0 spec for datasets, metrics, dimensions, relationships, and context. The logic isn't locked in a BI tool.

One number, no drift

Define the metric once; dbt, AtScale, Tableau, Sigma, notebooks, and agents all read the same logic. One consistent set of metrics everywhere.

Governed execution

The metric compiles to SQL and runs on governed tables, enforced on the read. Definition and enforcement in one place.

Agent-ready

ai_context grounds LLMs with intent and synonyms, so an agent asking over MCP gets governed meaning, not raw tables.

ecommerce_analytics.yaml · OSI v1.0
semantic_model:
  name: ecommerce_analytics
  ai_context:           # natural-language grounding
    instructions: "Revenue = settled order amount"
  datasets:
    - name: orders
      source: sales.public.orders
  metrics:             # defined once, spans datasets
    - name: settled_revenue
      expression:
        dialect: ANSI_SQL
        expression: SUM(orders.amount)
          FILTER (WHERE status='settled')
      ai_context: { synonyms: ["net revenue"] }
Author once Any OSI tool reads it Execute under governance
Capability 05 — Multi-Cloud & Credentials

Every cloud, one catalog — no long-lived keys

Native S3, GCS, and ADLS within a single catalog, with short-lived credentials minted per principal at access time. Engines never see long-lived keys, and authorization is evaluated at the authoritative metalake.

One catalog, every cloud

MultiSchemeFileIO routes by URI scheme: a default warehouse plus per-scheme backends, fully IRC-compliant. No engine changes when storage moves.

Credential vending

Short-lived, scoped STS tokens minted at access time on AWS, GCP, and Azure. Satisfies zero-trust posture with no keys baked into pipelines.

Identity passthrough

A JWT identity flows caller → Gravitino → IRC → object-store credential, evaluated at the authoritative metalake.

Federation across boundaries

Remote IRCs are referenced by name and URI. Each keeps its own IAM, RBAC, vending, and audit. Nothing is shared except the request shape.

Caller — Spark · Trino · Flink
User JWTidentity, no keys
Gravitino catalog
default warehouse + per-scheme overrides · authz at the metalake
Scoped, short-lived tokens vended per request
s3-tokens3://
gcs-tokengs://
adls-tokenabfs://
Capability 06 — AI & Agents

A governed surface for AI agents

Agents reach governed metadata and unstructured data through one standardized, policy-enforced surface: MCP for metadata, GVFS for files. It's held to the same RBAC, FGAC, and audit as every human and engine.

MCP server

The metadata API exposed as MCP tools, running local or remote. Connects to Claude Desktop and Cursor out of the box.

Same identity and policy

RBAC, FGAC, masking, and audit all apply. Agents are governed exactly like humans and engines, so over-permissioned access doesn't happen.

GVFS for unstructured data

One virtual path masks the real location across S3, GCS, ADLS, and HDFS, vending a scoped credential per request, with the same identity and audit.

Discover and act

List and inspect catalogs, tables, and columns; read and apply tags, all under one policy-enforced surface.

MCP tool surface · exposed in 1.0.0
# LLM agent (Claude · Cursor) → MCP server
get_list_of_catalogs()
get_list_of_tables(schema)
get_table_columns_by_fqn(fqn)
list_objects_by_tag("pii")
associate_tag_to_column(col, tag)

# one virtual path for unstructured data
gvfs://fileset/{catalog}/{schema}/{fileset}/path
   ↳ resolves location · vends scoped credential
Agents are gated by the same governance as everyone else. Every request carries identity, object, and operation to an access check, so humans, engines, and agents reach data through one governed path.

Ready to federate your lakehouse?

Whether your next workload is a Trino query, a Lance training run, or an AI agent asking for data, Datastrato gives you one catalog and one governance plane to handle it, without copying a byte of metadata.