/03. aio-starrocks-query-tuning

PLUGINS

aio-starrocks-query-tuning

From plugin aio-starrocks · v1.0.2 · Install: /plugin install aio-starrocks@aiocean-plugins

StarRocks Query Tuning

Complete reference from docs.starrocks.io/docs/best_practices/query_tuning/.

  • /aio-starrocks-best-practices — Table design: partitioning, bucketing, sort keys, PK tuning
  • /starrocks — Query syntax, cluster connections, Grafana integration
  • /starrocks-expert — General table types, data loading, materialized views

1. Tuning Methodology — Top-Down Diagnostic

Five-step process, always in order:

  1. Identify — Use monitoring, query history, audit logs to detect slow queries or resource bottlenecks
  2. Collect & AnalyzeEXPLAIN / EXPLAIN ANALYZE + Query Profile for detailed metrics
  3. Locate Root Cause — Pinpoint problematic operators: join order, missing indexes, data distribution, SQL inefficiencies
  4. Apply Tuning — SQL rewriting, schema optimization, query hints, session variables
  5. Validate & Iterate — Rerun, compare metrics, review plans, continue optimizing

2. Query Plans — EXPLAIN Variants

Commands

CommandWhat It ShowsWhen to Use
EXPLAIN LOGICALSimplified logical planQuick overview
EXPLAINBasic physical planStandard analysis
EXPLAIN VERBOSEDetailed physical plan with extensive infoDeep investigation
EXPLAIN COSTSPhysical plan + cost estimates per operationCost comparison
EXPLAIN ANALYZEExecutes query + actual runtime statisticsProduction diagnosis
-- Quick check
EXPLAIN SELECT shop_id, SUM(net_sales) FROM sales WHERE shop_id = 123 GROUP BY shop_id;

-- Deep investigation
EXPLAIN VERBOSE SELECT ...;

-- Actual execution stats (runs the query!)
EXPLAIN ANALYZE SELECT ...;

Plan Hierarchy

Plans are structured in 3 levels — read bottom-up starting from scan nodes:

  1. Fragment — Top-level work units distributed to BEs. Spawn FragmentInstances.
  2. Pipeline — Chain of operators within a fragment. Concurrent PipelineDrivers.
  3. Operator — Atomic execution steps: scan, join, aggregate, sort, exchange, etc.

What to Look For

  • Total runtime and memory/CPU ratios
  • Filter pushdown — are predicates pushed to scan level?
  • Data skew — uneven row counts across fragments
  • Join strategy — Broadcast vs Shuffle vs Colocate vs Bucket Shuffle
  • Exchange nodes — data movement costs between BEs
  • Aggregation/sorting — are they expensive relative to total?

Execution Phases

  1. Planning (FE) — Parse → Analyze → Optimize → Generate plan
  2. Scheduling (FE) — Distribute plan to BEs
  3. Execution (BE) — Pipeline engine processes the plan

3. Query Profile — Enabling & Accessing

Enable Profiling

-- Per session
SET enable_profile = true;

-- Global
SET GLOBAL enable_profile = true;

Slow Query Profiling (production-safe)

Avoid overhead by only profiling slow queries:

-- Only profile queries > 30 seconds
SET GLOBAL big_query_profile_threshold = '30s';

-- Supports: ms, s, m
SET GLOBAL big_query_profile_threshold = '500ms';

Runtime Profile (v3.1+, for long-running queries)

Collects data at fixed intervals during execution:

-- Default: 10 seconds. Adjust:
SET runtime_profile_report_interval = 30;

Configuration Reference

ParameterScopeDefaultPurpose
enable_profileSessionfalseActivate profiling
pipeline_profile_levelSession11=merged metrics, 2=retain structure
runtime_profile_report_intervalSession10Seconds between runtime reports
big_query_profile_thresholdSession0sDuration threshold for auto-profiling
enable_statistics_collect_profileFE DynamicfalseProfile statistics collection queries

Accessing Profiles

-- Get last query ID
SELECT last_query_id();

-- List recent queries
SHOW PROFILELIST;

-- Detailed profile for specific query
SELECT get_query_profile('<query_id>');

-- Full analysis with bottleneck highlighting
ANALYZE PROFILE FROM '<query_id>';

Web UI: http://<fe_ip>:<fe_http_port> -> Queries -> Finished Queries


4. Text-Based Profile Analysis

ANALYZE PROFILE

-- List all queries (finished, failed, or running 10+ seconds)
SHOW PROFILELIST;

-- Detailed analysis
ANALYZE PROFILE FROM '<query_id>';

Summary section shows:

  • QueryID, version, status, total time
  • Memory usage
  • Top 10 CPU consuming nodes
  • Top 10 memory consuming nodes
  • Non-default session variables

Fragments section shows per-node:

  • Time, memory, cost estimates, output rows
  • Red highlighting: nodes exceeding 30% of total time
  • Pink highlighting: nodes in 15-30% range

EXPLAIN ANALYZE

-- Executes and profiles simultaneously
EXPLAIN ANALYZE SELECT ...;

-- Also works for INSERT (aborts transaction to prevent data changes)
EXPLAIN ANALYZE INSERT INTO ... SELECT ...;

Limitation: INSERT ANALYZE only supported for default catalog tables.

Runtime Profile Indicators

  • ? (not started)
  • (executing)
  • (completed)

Progress: operators finished / total operators and per-operator rows processed / total rows

Tip: Use MyCLI instead of mysql client for proper ANSI color rendering.


5. Tuning Recipes — Symptom-to-Fix Playbook

Fast Diagnosis Workflow

  1. Initial scan — Check execution overview:
    • Memory usage > 80%?
    • Spill bytes > 1GB?
  2. Identify bottleneck — Sort operators by time percentage, find the slowest
  3. Match signature — Confirm the specific bottleneck type, then apply fix

Recipe: Scan Bottleneck

Symptoms:

  • High BytesRead, IOTaskExecTime — cold/slow storage
  • Low PushdownPredicates, high ExprFilterRows — missing filter pushdown
  • Elevated IOTaskWaitTime — thread-pool saturation
  • Uneven tablet row counts — data skew
  • Many small segments — segment fragmentation

Fixes:

  • Enable Data Cache for cold storage
  • Simplify predicates to enable pushdown (avoid functions on partition/sort columns)
  • Add bloom filter or bitmap indexes
  • Rebalance bucketing (increase buckets or change hash key)
  • Trigger manual compaction: ALTER TABLE ... COMPACT;

Recipe: Aggregation Bottleneck

Symptoms:

  • High-cardinality GROUP BY causing hash table bloat
  • Shuffle skew across fragments
  • State-heavy functions (HLL, BITMAP, COUNT DISTINCT)
  • Degraded partial aggregation (PassThroughRowCount high in auto-mode)

Fixes:

  • Enable sorted streaming aggregation (align GROUP BY with sort key)
  • Create roll-up materialized views for common aggregations
  • Cast wide keys to integers
  • Pre-compute sketches (HLL, BITMAP) at ingestion

Recipe: Join Bottleneck

Symptoms:

  • Oversized build side exceeding memory
  • Cache-inefficient probe operations
  • Shuffle skew on join keys
  • Accidental broadcast of large tables
  • Missing runtime filters

Fixes:

  • Swap probe/build tables (smaller table on build side)
  • Pre-filter data before join
  • Enable hash spilling (SET enable_spill = true;)
  • Adjust broadcast threshold or force shuffle via hint
  • Check runtime filter effectiveness in profile

Recipe: Network Exchange Bottleneck

Symptoms:

  • NetworkTime > 30% of total with large BytesSent
  • Receiver backlog (thread pool constraints)

Fixes:

-- Enable network compression
SET transmission_compression_type = 'zstd';
  • Reduce data volume before exchange (filter earlier, pre-aggregate)
  • Check for unnecessary shuffles — use colocate joins where possible

Recipe: Sort/Merge/Window Bottleneck

Symptoms:

  • Spilling when MaxBufferedBytes > 2GB
  • High merge time relative to total

Fixes:

  • Add LIMIT clause when possible
  • Pre-aggregate data before sorting
  • Increase sort_spill_threshold
  • Align window PARTITION BY with table sort key

Memory Quick Reference

ThresholdMetricAction
> 80% BE memoryQueryPeakMemoryUsagePerNodeLower exec_mem_limit or add RAM
SpillBytes > 0QuerySpillBytesUpgrade to SR 3.2+ or increase memory

Post-Mortem Template

Document every tuning: symptom -> root cause -> fix applied -> quantified outcome


6. Schema Tuning

Table Type Selection

TypeWhen to UseKey Property
DUPLICATE KEYRaw data logging, no pre-aggregationAllows duplicate rows
AGGREGATE KEYPre-aggregated analytics (SUM, MIN, MAX, REPLACE)Aggregates on load
UNIQUE KEYFrequently updated datasetsNew overwrites old
PRIMARY KEYReal-time updates with ACID semanticsStrongest uniqueness guarantee

Flat Table vs Star Schema

ApproachProsCons
Flat (denormalized)Extreme query concurrency, lowest latencyExpensive dimension maintenance, high storage, sorting overhead during load
Star schemaFlexible multi-table queries, easier maintenanceJoin overhead at query time

Rule: Use flat tables for extreme concurrency/latency requirements. Star schema for flexibility.

Colocate Tables

CREATE TABLE ... PROPERTIES ("colocate_with" = "group_name");

Groups tables by bucketing column for local joins without network transfer. Matching key + bucket count required.

Partition Strategy

Time-based RANGE partitions provide:

  • Clear hot/cold data distinction
  • Tiered storage optimization (SSD + SATA with storage_cooldown_time)
  • Efficient partition-based deletion

Bucket Strategy

  • Use high-cardinality columns to prevent skew
  • Target: 100MB-1GB compressed per bucket
  • Always explicitly specify columns — avoid random bucketing for analytical tables

Index Optimization

Sparse Index (Prefix Index)

  • Granularity: 1024 rows
  • Fixed prefix size: 36 bytes
  • Place high-frequency filter fields first in schema
  • Critical: VARCHAR field truncates the index — always place VARCHAR last in sparse index

Bloom Filter Index

PROPERTIES ("bloom_filter_columns" = "column1, column2")
  • Best for high-cardinality columns
  • Enables placing VARCHAR fields earlier when needed (compensates for sparse index limitation)

Bitmap Index

CREATE INDEX idx_status ON table(status) USING BITMAP;
  • Best for low-cardinality columns (gender, city, status)
  • Applicable to: Duplicate Key tables and key columns of Aggregate/Unique Key tables

Materialized Views (Rollups)

Use cases:

  • Aggregate specific column combinations different from base table sort key
  • Optimize prefix index coverage for different query patterns
  • Reorder columns to match common WHERE clause patterns

Schema Change Types

TypeOperationData Impact
SortedDrop columns, reorder dataFull data rewrite
DirectModify column typesData transformation, no reorder
LinkedAdd columnsStructure-only, no data transformation

Anti-pattern: Minimize sorted schema changes through careful initial design.


7. Query Hints

SET_VAR — Session Variable Hints

Override session variables for a single query:

SELECT /*+ SET_VAR(key=value [, key=value]) */ ...

Common uses:

-- Force streaming preaggregation
SELECT /*+ SET_VAR(streaming_preaggregation_mode='force_streaming') */
  shop_id, SUM(net_sales) FROM sales GROUP BY shop_id;

-- Set query timeout
SELECT /*+ SET_VAR(query_timeout=60) */ ...;

-- Enable spill
SELECT /*+ SET_VAR(enable_spill=true) */ ...;

Limitation: SET_VAR in CTE's SELECT clause does NOT take effect.

SET_USER_VARIABLE — Cache Subquery Results (v3.2.4+)

Avoid repeated scalar subquery execution:

SELECT /*+ SET_USER_VARIABLE(@threshold = (SELECT AVG(amount) FROM orders)) */
  * FROM orders WHERE amount > @threshold;

Limitation: Cannot be used in CREATE MATERIALIZED VIEW or CREATE VIEW.

Join Hints

Force specific join strategies:

-- Force shuffle join (avoid accidental broadcast of large table)
SELECT * FROM large_table a JOIN [SHUFFLE] medium_table b ON a.id = b.id;

-- Force broadcast (small dimension table)
SELECT * FROM fact_table a JOIN [BROADCAST] dim_table b ON a.dim_id = b.id;

-- Force bucket shuffle (when bucketing key matches join key)
SELECT * FROM sales a JOIN [BUCKET] customers b ON a.customer_id = b.customer_id;

-- Force colocate join (pre-distributed colocated tables)
SELECT * FROM sales a JOIN [COLOCATE] customers b ON a.customer_id = b.customer_id;

-- Preserve original join order (disable reorder)
SELECT * FROM a JOIN [UNREORDER] b ON a.id = b.id;

Critical: When a Join hint is used, the optimizer does NOT perform Join Reorder. You take full control.

Verify hint effectiveness:

EXPLAIN SELECT * FROM a JOIN [SHUFFLE] b ON a.id = b.id;
-- Check DistributionMode in output

8. Operator Metrics Reference

Summary-Level Metrics

MetricWhat It Tells You
Total durationEnd-to-end query time
Query StateSuccess/failure/running
Default DB, SQL, Session varsQuery context

Planner Metrics

Covers parsing, analyzing, transforming, optimizing phases.

Concern threshold: Planner time > 10ms warrants investigation. Common causes:

  • Complex queries with many joins
  • Numerous materialized views to evaluate
  • External table metadata fetching

Execution Overview Metrics

CategoryKey MetricNormal Threshold
MemoryPeak consumption< 80% BE capacity
CPUCumulative CPU timeRelative to query complexity
NetworkExchange network timeLow relative to total
ScanIO time aggregatedDepends on data volume
Disk SpillSpillBytes< 1GB
ScheduleSchedule time< 1s for simple queries

Pipeline-Level Metrics

Core relationship:

DriverTotalTime = ActiveTime + PendingTime + ScheduleTime
MetricMeaning
ActiveTimeActual operator execution time
PendingTimeBlocking time (InputEmpty, OutputFull, PreconditionBlock, PendingFinish)
ScheduleTimeQueue-to-execution wait

Diagnosis: If PendingTime dominates, check which sub-reason:

  • InputEmpty — upstream operator is slow
  • OutputFull — downstream operator is blocked
  • PreconditionBlock — waiting for dependency (e.g., build side of hash join)
  • PendingFinish — waiting for other pipelines to complete

OLAP Scan Operator Metrics

MetricMeaningTuning Signal
Table / RollupWhich table/MV is scannedVerify correct MV selection
TabletCountNumber of tablets scannedHigh = missing partition pruning
BytesReadTotal bytes readHigh = missing filter pushdown
CompressedBytesReadCompressed bytes from storageStorage I/O indicator
RowsReadRows after filteringCompare with RawRowsRead
RawRowsReadRows before filteringHigh ratio to RowsRead = filter not pushed down
CachedPagesNumPages from cacheLow = cold data, enable Data Cache
ReadPagesNumTotal pages readBaseline for cache hit ratio
ScanTimeTotal scan durationPrimary scan bottleneck metric
IOTaskExecTimeI/O execution timeHigh = slow storage
IOTaskWaitTimeI/O queue wait timeHigh = thread-pool saturation
PeakIOTasksMax concurrent I/O tasksThread pool capacity
PeakChunkBufferSizeMax chunk bufferMemory pressure indicator

Key ratios:

  • RawRowsRead / RowsRead — Filter efficiency. High ratio = predicates not pushed down
  • CachedPagesNum / ReadPagesNum — Cache hit ratio. Low = enable Data Cache

Connector Scan Operator (External Tables)

Same metrics as OLAP Scan, plus:

  • DataSourceType — Identifies source (Iceberg, Hive, Hudi, Delta)

Exchange Sink Metrics

MetricMeaning
Serialization timeEncoding overhead
Hash timeShuffle key computation
Compression metricsNetwork compression efficiency
RPC countsNumber of network calls
Network bandwidthData transfer rate
ThroughputMessages per second

Passthrough optimization: When data is colocated, "short-circuit logic" skips network transfer entirely.

Exchange Source Metrics

MetricMeaning
Decompression timeDecoding overhead
Deserialization timeMessage parsing
Lock waiting timeContention indicator

Bottleneck patterns:

  • Broadcast joins with suboptimal plans — large table broadcast
  • Shuffle aggregation/join with large tables — excessive network transfer

Aggregate Operator Metrics

MetricMeaningTuning Signal
Hash table sizeNumber of groupsHigh = high-cardinality GROUP BY
Hash table memoryMemory consumedApproaching limits = consider MV
PassThroughRowCountRows in streaming modeHigh = auto-mode degraded to streaming, partial agg ineffective
Result construction timeFinal result buildUsually not bottleneck

Join Operator Metrics

MetricMeaningTuning Signal
DistributionModeBROADCAST / PARTITIONED / COLOCATEVerify expected strategy
JoinTypeINNER / LEFT / RIGHT / etc.
Hash table bucket statsDistribution qualitySkew = hot buckets
Build phase timeHash table constructionHigh = large build side
Probe phase timeHash table lookupsHigh = cache-inefficient
Conjunct evaluation timeJoin predicate computeHigh = complex predicates
Runtime filter constructionFilter build timeMissing = add hint

Sort Operator Metrics

MetricMeaningTuning Signal
SortTypeFull sort vs top-Ntop-N is much cheaper
MaxBufferedBytesPeak memory> 2GB = spilling likely
MaxBufferedRowsPeak rows buffered
Sorted run countNumber of sorted runsHigh = many merge passes
Building / Merging / Sorting / Output timesStage breakdownIdentify dominant stage

Window Function Operator Metrics

MetricMeaning
ProcessModeMaterializing/Streaming x Cumulative/RemovableCumulative/ByDefinition
Partition/peer group boundary searchesPartitioning overhead
Peak buffered rowsMemory pressure
Unused row removal countEfficiency of streaming mode

Merge Operator Stages

Execution progresses through: Init -> Prepare -> Process -> SplitChunk -> FetchChunk -> Pending -> Finished

Each stage has per-stage counts and times. Late materialization buffering metrics available.

OlapTableSink (INSERT) Metrics

MetricMeaningTuning Signal
RowsReadInput rows
RowsFilteredRejected rowsHigh = data quality issue
RowsReturnedSuccessfully written
PushChunkNum per nodeChunks sent to each BELarge differences = data skew
RpcClientSideTime vs RpcServerSideTimeNetwork overheadClient >> Server = enable compression

Project Operator

Computes expressions (calculations, casts, etc.). If expensive expressions exist, this can take significant time. Check:

  • Expression computation time
  • Common sub-expression timing

LocalExchange Operator

Types: Passthrough, Partition, or Broadcast. Metrics:

  • Peak memory, buffer size, chunk counts, per-chunk metrics