One control plane.
Every catalog, every client.
Datastrato is the federated metadata catalog built on Apache Gravitino and the Iceberg REST Catalog. Ship Iceberg REST on any cloud, federate Hive, Delta, Lance, and filesets, and govern one authorization path for every client, human or agent.
Storing data is solved. Governing how every engine, user, and agent uses it is not.
Your data is scattered across formats, clouds, and now vectors and models, each with its own catalog, governance, and credentials. Datastrato is the unified metadata context layer that connects them and enforces governance at the point of access, for every consumer and agent that touches your data.
One catalog over every source
Register Iceberg, Hive, Glue, relational databases, Kafka, filesets, and models in place. No copies, no migration. One FQTN namespace and one governance layer across them all.
Federate, don't migrate
Register existing Hive Metastore, Glue, Polaris, or remote Gravitino catalogs without copying metadata. Decisions stay with the data owner.
One FQTN namespace
A three-level catalog.namespace.table name resolves the same across Spark, Trino, Flink, Dremio, and BigQuery.
A catalog of catalogs
IRC-to-IRC registration across clouds and on-prem. No replication lag, no second source of truth. Each remote catalog keeps its own IAM and audit.
Multimodal by design
Tables, files, streams, vectors, and ML models in one model. A single governed surface for analytics and AI.
Standards-compliant Iceberg REST, built for scale
A standards-compliant Iceberg REST Catalog service that runs in-tree, not as a proprietary fork, plus the table management, caching, and maintenance that keep a production lakehouse fast and available.
Iceberg-native, every engine
Spark, Flink, Trino, and Dremio all hit one catalog with one namespace. Iceberg is the open standard your stack already speaks.
IRC metadata cache
4× faster reads under concurrent load. Load-table and plan avoid backend round-trips, and every entry is validated against the backend before it is served.
Iceberg maintenance framework
Per-table policy for compaction, snapshot expiry, and orphan cleanup, submitted through a pluggable JobExecutor SPI to Airflow, K8s, EMR, or Databricks.
HA-ready persistence
Pluggable JDBC backend on PostgreSQL or MySQL for low-latency, highly-available metadata, with DR for the catalog.
-- one catalog, one FQTN namespace
CREATE TABLE catalog.sales.orders (
id BIGINT, amount DECIMAL, region STRING
);
-- alternate backend, same catalog
CREATE TABLE catalog.sales.events (id BIGINT)
LOCATION 's3://bucket/events/';
-- query in place, no extracts
SELECT region, sum(amount)
FROM catalog.sales.orders GROUP BY region;
Define once, enforce everywhere
Role-based access, fine-grained controls, and classification evaluated at the catalog on every request, for every engine, client, and AI agent, with one audit trail.
Role-based access control
Privileges scope to object types and inherit down the tree. DENY overrides ALLOW, ownership and roles checked together, evaluated and audited server-side.
Fine-grained access control
Row filtering and column masking through the Iceberg spec. Nine spec-defined mask actions produce identical output on every compliant engine.
Tags travel with the data
Classify once; policies attach to tags, not tables. Tags inherit down the tree, and FGAC rules key on them for blast-radius analysis.
Identity from your IdP
OIDC sign-in and SCIM provisioning of users and groups; new members inherit access at once, with optional push-down to Apache Ranger.
{
"policyType": "system_fgac_row_filter",
"content": {
"tag": "pii-region-restricted",
"rowFilter": { "column": "region",
"operator": "eq", "value": "home_region" }
}
}
One definition for every metric
A governed registry for Open Semantic Interchange (OSI) models. Define a metric once and serve it to any producer or consumer, secured with the same RBAC, tags, and lineage as the tables beneath it.
An open standard
OSI is a vendor-neutral, Apache-2.0 spec for datasets, metrics, dimensions, relationships, and context. The logic isn't locked in a BI tool.
One number, no drift
Define the metric once; dbt, AtScale, Tableau, Sigma, notebooks, and agents all read the same logic. One consistent set of metrics everywhere.
Governed execution
The metric compiles to SQL and runs on governed tables, enforced on the read. Definition and enforcement in one place.
Agent-ready
ai_context grounds LLMs with intent and synonyms, so an agent asking over MCP gets governed meaning, not raw tables.
semantic_model:
name: ecommerce_analytics
ai_context: # natural-language grounding
instructions: "Revenue = settled order amount"
datasets:
- name: orders
source: sales.public.orders
metrics: # defined once, spans datasets
- name: settled_revenue
expression:
dialect: ANSI_SQL
expression: SUM(orders.amount)
FILTER (WHERE status='settled')
ai_context: { synonyms: ["net revenue"] }
Every cloud, one catalog — no long-lived keys
Native S3, GCS, and ADLS within a single catalog, with short-lived credentials minted per principal at access time. Engines never see long-lived keys, and authorization is evaluated at the authoritative metalake.
One catalog, every cloud
MultiSchemeFileIO routes by URI scheme: a default warehouse plus per-scheme backends, fully IRC-compliant. No engine changes when storage moves.
Credential vending
Short-lived, scoped STS tokens minted at access time on AWS, GCP, and Azure. Satisfies zero-trust posture with no keys baked into pipelines.
Identity passthrough
A JWT identity flows caller → Gravitino → IRC → object-store credential, evaluated at the authoritative metalake.
Federation across boundaries
Remote IRCs are referenced by name and URI. Each keeps its own IAM, RBAC, vending, and audit. Nothing is shared except the request shape.
A governed surface for AI agents
Agents reach governed metadata and unstructured data through one standardized, policy-enforced surface: MCP for metadata, GVFS for files. It's held to the same RBAC, FGAC, and audit as every human and engine.
MCP server
The metadata API exposed as MCP tools, running local or remote. Connects to Claude Desktop and Cursor out of the box.
Same identity and policy
RBAC, FGAC, masking, and audit all apply. Agents are governed exactly like humans and engines, so over-permissioned access doesn't happen.
GVFS for unstructured data
One virtual path masks the real location across S3, GCS, ADLS, and HDFS, vending a scoped credential per request, with the same identity and audit.
Discover and act
List and inspect catalogs, tables, and columns; read and apply tags, all under one policy-enforced surface.
# LLM agent (Claude · Cursor) → MCP server
get_list_of_catalogs()
get_list_of_tables(schema)
get_table_columns_by_fqn(fqn)
list_objects_by_tag("pii")
associate_tag_to_column(col, tag)
# one virtual path for unstructured data
gvfs://fileset/{catalog}/{schema}/{fileset}/path
↳ resolves location · vends scoped credential
Ready to federate your lakehouse?
Whether your next workload is a Trino query, a Lance training run, or an AI agent asking for data, Datastrato gives you one catalog and one governance plane to handle it, without copying a byte of metadata.