How to use the playground
Playground introduction
Playground is a complete Gravitino Docker runtime environment with Hive
, Hdfs
, Trino
, MySQL
, PostgreSQL
, and Gravitino
server.
Depending on your network, the startup may take 3-5 minutes.
Once the playground environment has started, you can open http://localhost:8090 to access the Gravitino Web UI.
Prerequisite
You should install git and docker-compose.
Start playground
git clone git@github.com:datastrato/gravitino-playground.git
cd gravitino-playground
./launch-playground.sh
Experience Gravitino with Trino SQL
- Login to Gravitino playground Trino Docker container using the following command.
docker exec -it playground-trino bash
- Open Trino CLI in the container.
trino@d2bbfccc7432:/$ trino
Example
Simple queries
Use simple queries to test in the Trino CLI.
SHOW CATALOGS;
CREATE SCHEMA "metalake_demo.catalog_hive".db1
WITH (location = 'hdfs://hive:9000/user/hive/warehouse/db1.db');
SHOW CREATE SCHEMA "metalake_demo.catalog_hive".db1;
CREATE TABLE "metalake_demo.catalog_hive".db1.table_001
(
name varchar,
salary varchar
)
WITH (
format = 'TEXTFILE'
);
INSERT INTO "metalake_demo.catalog_hive".db1.table_001 (name, salary) VALUES ('sam', '11');
SELECT * FROM "metalake_demo.catalog_hive".db1.table_001;
SHOW SCHEMAS from "metalake_demo.catalog_hive";
DESCRIBE "metalake_demo.catalog_hive".db1.table_001;
SHOW TABLES from "metalake_demo.catalog_hive".db1;
Cross-catalog queries
In companies, there may be different departments using different data stacks. In this example, HR department uses Apache Hive to store its data. Sales department uses PostgreSQL to store its data. This example has generated some data for two departments. You can query some interesting results with Gravitino.
If you want to know which employee has the largest sales amount. You can run the SQL.
WITH totalsales AS (
SELECT
employee_id,
SUM(total_amount) AS sales_amount
FROM "metalake_demo.catalog_hive".sales.sales
GROUP BY
employee_id
), rankedemployees AS (
SELECT
employee_id,
sales_amount,
RANK() OVER (ORDER BY sales_amount DESC) AS sales_rank
FROM totalsales
)
SELECT
e.employee_id,
given_name,
family_name,
job_title,
sales_amount
FROM rankedemployees AS r
JOIN "metalake_demo.catalog_postgres".hr.employees AS e
ON r.employee_id = e.employee_id
WHERE
sales_rank = 1;
If you want to know top 10 customers who bought the most by state. You run the SQL.
WITH customersales AS (
SELECT
"metalake_demo.catalog_hive".sales.customers.customer_id,
customer_name,
customer_email,
location AS state,
SUM(total_amount) AS total_spent
FROM "metalake_demo.catalog_hive".sales.sales
JOIN "metalake_demo.catalog_hive".sales.customers
ON "metalake_demo.catalog_hive".sales.sales.customer_id = "metalake_demo.catalog_hive".sales.customers.customer_id
JOIN "metalake_demo.catalog_hive".sales.stores
ON "metalake_demo.catalog_hive".sales.sales.store_id = "metalake_demo.catalog_hive".sales.stores.store_id
GROUP BY
"metalake_demo.catalog_hive".sales.customers.customer_id,
customer_name,
customer_email,
location
), rankedcustomersales AS (
SELECT
customer_id,
customer_name,
customer_email,
state,
total_spent,
RANK() OVER (PARTITION BY state ORDER BY total_spent DESC) AS customer_rank
FROM customersales
)
SELECT
customer_id,
customer_name,
customer_email,
state,
total_spent
FROM rankedcustomersales
WHERE
customer_rank <= 10
ORDER BY
state,
customer_rank;
If you want to know that employees average performance rating and total sales. You run the SQL.
set session allow_pushdown_into_connectors=false;
WITH employeeperformance AS (
SELECT
employee_id,
AVG(rating) AS average_rating
FROM "metalake_demo.catalog_postgres".hr.employee_performance
GROUP BY
employee_id
), employeesales AS (
SELECT
employee_id,
SUM(total_amount) AS total_sales
FROM "metalake_demo.catalog_hive".sales.sales
GROUP BY
employee_id
)
SELECT
e.employee_id,
average_rating,
total_sales
FROM employeeperformance AS e
JOIN employeesales AS s
ON e.employee_id = s.employee_id;