aio-starrocks-best-practices

From plugin aio-starrocks

/plugin install aio-starrocks@aiocean-plugins

StarRocks Best Practices

Official best practices from docs.starrocks.io/docs/best_practices/.

Core design philosophy: "Designing for efficiency does more than improve query speed — it decreases costs by reducing storage, CPU, and object storage API costs."

/aio-starrocks-query-tuning — Query performance tuning, EXPLAIN plans, operator metrics, hints
/starrocks — Query syntax, cluster connections, Grafana integration (external skill, not part of this plugin)
/starrocks-expert — General table types, data loading, query optimization (external skill, not part of this plugin)

1. Partitioning

Partitioning enables coarse-grain data pruning via partition elimination AND metadata-only lifecycle operations (TTL, GDPR deletes, tenant isolation).

Partition Key Selection

Time-first — If 80%+ of queries filter by time, lead with date_trunc('day', dt)
Tenant isolation — Include tenant_id when managing data per-tenant
Retention alignment — Include columns you'll purge via DROP PARTITION
Composite keys — Creates #tenants x #days partitions — keep total below ~100K to avoid FE memory strain

Granularity Decision

Granularity	Best For	Advantage	Limitation
Daily	BI/reporting	365 partitions/year; simple TTL	Coarse for sub-day queries
Hourly	IoT/burst workloads	Hot-spot isolation	8,700 partitions/year
Weekly/Monthly	Historical archive	Minimal metadata	Less precise pruning

Sizing Rule

Each partition <= 100GB
<= 20K tablets per partition (across replicas)
Total partitions < 100K (FE memory constraint)
Tablets per BE < 200K

DDL Templates

Single-tenant clickstream:

CREATE TABLE click_stream (
  user_id BIGINT, event_time DATETIME, url STRING, ...)
DUPLICATE KEY(user_id, event_time)
PARTITION BY date_trunc('day', event_time)
DISTRIBUTED BY HASH(user_id) BUCKETS xxx;

Multi-tenant SaaS (recommended for sales-engine pattern):

CREATE TABLE metrics (
  tenant_id INT, dt DATETIME, metric_name STRING, v DOUBLE)
PRIMARY KEY(tenant_id, dt, metric_name)
PARTITION BY date_trunc('DAY', dt)
DISTRIBUTED BY HASH(tenant_id) BUCKETS xxx;

Large-tenant composite (when single tenant > 100GB/partition):

CREATE TABLE activity (
  tenant_id INT, dt DATETIME, id BIGINT, ....)
DUPLICATE KEY(dt, id)
PARTITION BY tenant_id, date_trunc('MONTH', dt)
DISTRIBUTED BY HASH(id) BUCKETS xxx;

Partitioning vs Bucketing

Partitions = lifecycle management tools (TTL, DROP PARTITION, GDPR). Enable query-time partition pruning — skip entire data blocks.
Buckets = parallelism levers. Distribute data within partitions for parallel scan/ingest.

2. Table Clustering (Sort Keys)

"A thoughtful sort-key is the highest-leverage physical-design knob in StarRocks."

Why Sort Keys Matter

Sort keys deliver compounding benefits across write, storage, and read:

I/O elimination — Segment and page pruning via min/max metadata skips irrelevant data blocks
Point lookups — Sparse prefix index enables millisecond queries on leading sort columns
Sorted aggregation — Streaming aggregation (2-3x faster) when GROUP BY aligns with sort key
Compression & caching — Sorted data improves encoding efficiency and CPU cache locality

Sort Key Selection Playbook

Decision hierarchy:

Equality columns first — High-cardinality columns with frequent = / IN filters
Range columns second — Timestamps or numeric ranges for temporal/value windows
Aggregation helpers third — GROUP BY columns that enable sorted aggregation

Configuration Rules

Rule	Guidance
Width	3-5 columns max. Wider keys degrade ingest and exhaust the 36-byte prefix-index limit
Cardinality order	Low-cardinality before high-cardinality enhances compression
String columns	Long strings consume prefix-index bytes, blocking subsequent columns from indexing

Reference Templates

Scenario	Partition	Sort Key	Why
B2C Orders	`date_trunc('day', order_ts)`	`(user_id, order_ts)`	User-first filters; then temporal ranges
IoT Telemetry	`date_trunc('day', ts)`	`(device_id, ts)`	Device time-series dominates
SaaS Multi-Tenant	`tenant_id`	`(dt, event_id)`	Tenant isolation; dashboard time clustering
Dimension Lookup	none	`(dim_id)`	Point lookups only

DDL Example

CREATE TABLE telemetry (
  device_id VARCHAR,
  ts DATETIME,
  value DOUBLE
)
ENGINE=OLAP
PRIMARY KEY(device_id, ts)
PARTITION BY date_trunc('day', ts)
DISTRIBUTED BY HASH(device_id) BUCKETS 16
ORDER BY (device_id, ts);

Anti-Patterns

Placing long string columns at the sort-key head (wastes prefix-index bytes)
Overly wide sort keys (>5 columns)
Misaligning partition and sort keys (defeating logical pruning order)

3. Bucketing (Distribution Strategy)

Quick Decision Framework

Scenario	Choice	Rationale
Stable join/filter keys, high cardinality	Hash	Enables pruning, colocated joins, local aggregation
Write-heavy logs/events; multi-tenant	Random	Prevents skew, uniform throughput, elastic growth
Aggregate/Primary Key tables	Hash	Only option for these table types
Duplicate Key tables needing elasticity	Random	Auto-splits when `bucket_size` set

Hash Bucketing

DISTRIBUTED BY HASH(column1, column2) BUCKETS n
PROPERTIES ("colocate_with" = "group_name")

Key requirements:

Key must be stable, evenly distributed, high-cardinality
Cardinality rule: >= 1000x the number of BE nodes to prevent skew
Tablet sizing: Target 1-10 GB per tablet initially
Tablets > 10GB = compaction efficiency degradation

Query optimizations enabled:

-- Tablet pruning: single tablet accessed
SELECT sum(amount) FROM sales WHERE customer_id = 123;

-- Local aggregation: no shuffle phase
SELECT customer_id, sum(amount) FROM sales GROUP BY customer_id;

-- Colocated join: no network shuffle between BEs
SELECT c.region, sum(s.amount)
FROM sales s JOIN customers c USING (customer_id)
WHERE s.sale_date BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY c.region;

Colocated join setup (matching key + bucket count required):

CREATE TABLE sales (
  sale_id BIGINT, customer_id INT, sale_date DATE, amount DECIMAL(10,2))
DISTRIBUTED BY HASH(customer_id) BUCKETS 48
PARTITION BY date_trunc('DAY', sale_date)
PROPERTIES ("colocate_with" = "group1");

CREATE TABLE customers (
  customer_id INT, region VARCHAR(32), status TINYINT)
DISTRIBUTED BY HASH(customer_id) BUCKETS 48
PROPERTIES ("colocate_with" = "group1");

Random Bucketing

DISTRIBUTED BY RANDOM
PROPERTIES ("bucket_size" = "1GB")  -- Enables auto-split (v3.2+)

Round-robin row assignment (no hash key)
Auto tablet splitting when partition grows (requires bucket_size)
Growth-only — no shrinking
Limitation: Duplicate Key tables only
Trade-off: No bucket pruning; every query scans all tablets in a partition; no colocated joins

Operational Maintenance

Random: Always set bucket_size (e.g., 1GB) for auto-split
Hash: Monitor tablet size; re-shard before tablets exceed 5-10 GB (ALTER TABLE ... BUCKETS n)
Both: Watch for metadata bloat with excessive tablet counts

Anti-Patterns

Low-cardinality hash keys — Creates hot tablets and imbalanced writes
Undersizing initial buckets — Hampers ingestion parallelism and compaction
Random bucketing for dimensional joins — Eliminates locality optimizations
Ignoring bucket_size in Random mode — Tablets never split; metadata grows unbounded

4. Primary Key Table Tuning

Primary Key Index Types

Type	Recommendation	Notes
Full in-memory	NOT recommended	Excessive memory waste
Local disk persistent	Standard option	Good for shared-nothing
Cloud-native persistent	Recommended for shared-data	Avoids disk capacity constraints, eliminates index rebuilds after rebalancing

Key Design Principles

Focus on uniqueness requirements during import/updates, NOT query acceleration
Minimize column count and size (default max: 128 bytes)
Use ORDER BY clause separately for query optimization via sort keys

Resource Consumption Formula

Storage: (key_size + 8 bytes) x row_count x 50%
Memory: min(l0_max_mem_usage x tablet_count, update_memory_limit_percent x BE_memory)

Memory Management

Monitor: http://be_ip:be_http_port/mem_tracker?type=update

Reduce import memory overhead:

l0_max_mem_usage = <value < 104857600>    # Default 104857600 (100MB)
skip_pk_preload = true
transaction_apply_worker_count = <cpu_cores - n>
transaction_publish_version_worker_count = <cpu_cores - n>

Trade-off: Reduced memory increases I/O; fewer worker threads slow ingestion

Performance Balance

Goal	Configuration
High freshness + low query latency	Increase `compact_threads`; decrease `update_compaction_per_tablet_min_interval_seconds`
Good freshness, limited resources	Raise `lake_ingest_slowdown_threshold` and `lake_compaction_score_upper_bound`
Good latency, limited resources	Reduce write frequency; batch larger data loads

Monitoring

Shared-data: SHOW PROC '/transactions/{db}/running' for compaction slowdown messages
Shared-nothing: Monitor tablet_max_versions threshold before ingestion failures

5. Authentication & Authorization

Three-Layer Access Control

Identity Authentication — "I am who I claim to be" (user verification)
Access Authentication — Group/role-based login eligibility to the cluster
Operation Authorization — Query execution and data access permissions

Authentication Methods

Method	User Storage	Setup	Best For
Native User	In-cluster	Manual creation	Small user bases
Security Integration	External system	Configuration-driven	Large enterprises

Configuration Examples

Native user with external auth (LDAP):

CREATE USER <username> IDENTIFIED WITH authentication_ldap_simple
AS 'uid=tom,ou=company,dc=example,dc=com';

Security integration (LDAP):

CREATE SECURITY INTEGRATION <name> PROPERTIES (
    "type" = "authentication_ldap_simple",
    "authentication_ldap_simple_server_host" = "",
    "authentication_ldap_simple_server_port" = "",
    "authentication_ldap_simple_bind_base_dn" = "",
    "authentication_ldap_simple_user_search_attr" = ""
);

ADMIN SET FRONTEND CONFIG (
    "authentication_chain" = "<security_integration_name>"
);

Group provider + role grants:

CREATE GROUP PROVIDER <name> PROPERTIES (
    "type" = "ldap",
    "ldap_conn_url" = "",
    "ldap_bind_root_dn" = "",
    "ldap_bind_base_dn" = ""
);

GRANT <role> TO EXTERNAL GROUP <group_name>;

Solution Selection

Full external control — Security Integration + Apache Ranger
Minimal setup, native control — Security Integration + Internal RBAC
Legacy — Native users with GRANT statements

Supported protocols: LDAP, OIDC, OAuth 2.0, JWT, native passwords

Critical: User IDs and group names must match across auth, group provider, and authorization systems. Mismatches cause permission failures.

6. Audit Log & Resource Groups

Core principle: Use data-driven resource allocation by analyzing starrocks_audit_db__.starrocks_audit_tbl__ rather than guesswork.

CPU Resource Allocation

Analyze per-user CPU consumption, allocate proportionally:

-- Aggregate cpuCostNs per user, last 30 days
SELECT user, SUM(cpuCostNs) / 1e9 AS cpu_seconds
FROM starrocks_audit_db__.starrocks_audit_tbl__
WHERE state IN ('EOF', 'OK')
  AND timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY user
ORDER BY cpu_seconds DESC;

Configuration:

exclusive_cpu_cores — Cannot exceed single BE core count; sum across all groups <= BE total
cpu_weight — For soft-isolation groups; determines relative share on remaining cores

Rule of thumb: If a user is 16% of CPU on a 64-core BE, allocate ~11 cores.

Memory Management

-- Peak single-query memory per user
SELECT user, MAX(memCostBytes) / (1024 * 1024) AS peak_mem_mb
FROM starrocks_audit_db__.starrocks_audit_tbl__
WHERE state IN ('EOF', 'OK')
  AND timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY user;

big_query_mem_limit — Set high to avoid false-positive termination of legitimate large queries
mem_limit — Set high (e.g., 0.9 for 90%)
Per-BE usage ~= total_max_mem_mb / number_of_BEs

Concurrency Control

-- Peak concurrent queries per user per minute
SELECT user, DATE_FORMAT(timestamp, '%Y-%m-%d %H:%i') AS minute_bucket,
       COUNT(*) AS concurrent_queries
FROM starrocks_audit_db__.starrocks_audit_tbl__
WHERE queryType = 'query' AND state IN ('EOF', 'OK')
  AND timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY user, minute_bucket
ORDER BY concurrent_queries DESC;

concurrency_limit — Set to 1.5x observed peak for headroom
For extreme spikes: enable Query Queues for load smoothing
Minute-level analysis may underestimate per-second spikes

Materialized View Resource Isolation

Prevent async MV refreshes from degrading interactive queries:

CREATE RESOURCE GROUP rg_mv (
    user = 'mv_user',
    query_type IN ('insert', 'select')
)
WITH (
    'cpu_weight' = '32',
    'mem_limit' = '0.9',
    'concurrency_limit' = '10',
    'spill_mem_limit_threshold' = '0.5'
);

-- Assign to MV at creation
CREATE MATERIALIZED VIEW ... PROPERTIES ('resource_group' = 'rg_mv');

-- Or existing MV
ALTER MATERIALIZED VIEW ... SET ("resource_group" = "rg_mv");

Anti-Patterns

Relying on guesswork instead of audit log analysis
Setting exclusive_cpu_cores sum to exceed available BE cores
Using low concurrency_limit without headroom buffer
Applying low mem_limit that terminates legitimate queries
Allowing MV refreshes to share resources with interactive workloads

Decision Matrix — Table Design by Use Case

Decision	Single-Tenant Fact	Multi-Tenant SaaS	IoT/Events	Dimension Lookup
Partition	`date_trunc('day', ts)`	`date_trunc('DAY', dt)`	`date_trunc('day', ts)`	None
Sort Key	`(user_id, ts)`	`(tenant_id, dt)`	`(device_id, ts)`	`(dim_id)`
Distribution	HASH(user_id)	HASH(tenant_id)	RANDOM w/ bucket_size	HASH(dim_id)
Bucket Type	Hash (colocate joins)	Hash (tenant filter pruning)	Random (prevent skew)	Hash
PK Index	Local disk persistent	Cloud-native (elastic)	Cloud-native	Local disk
Achievement	Query on user; streaming agg	Tenant isolation; fast TTL	Write throughput; device clustering	Millisecond lookups

/00. Claude Plugins

Folders

Articles

/03. aio-starrocks-best-practices

StarRocks Best Practices