Query Optimizer
TurboLynx uses ORCA, a Cascades-framework cost-based optimizer originally developed at Pivotal for Greenplum/HAWQ, adapted for graph queries.
Pipeline
Cypher String
│
▼
ANTLR4 Parser → RegularQuery AST
│
▼
TurboLynx Binder → BoundRegularQuery
(type resolution, catalog lookup, property binding)
│
▼
Cypher2OrcaConverter → ORCA Logical Plan (CExpression tree)
│
▼
ORCA Optimizer → Physical Plan
(cost-based join ordering, operator selection, GEM)
│
▼
Pipeline Executor
Binder
The TurboLynx Binder (src/binder/binder.cpp) resolves Cypher AST nodes against the graph catalog:
- Maps node/rel variables to partitions and graphlets (via catalog OIDs)
- Resolves property names to
key_idvalues - Populates
BoundNodeExpressionandBoundRelExpressionwith all property columns needed for execution - Synthesizes inline node/rel property filters (
{k:v}→ equality predicates)
Cypher2OrcaConverter
The converter (src/converter/cypher2orca_converter.cpp) translates BoundRegularQuery into an ORCA CExpression logical tree:
- Each label scan becomes a
CLogicalGetwrapped in aCLogicalProjectColumnar - Multi-graphlet labels use
CLogicalUnionAllto combine per-graphlet scans - Edge traversals are modeled as joins:
CLogicalInnerJoin(index join) orCLogicalNAryJoin - WHERE predicates become
CLogicalSelect - Projections and aggregates map to
CLogicalProjectColumnarandCLogicalGbAgg
ORCA Overview
ORCA uses the Cascades optimization framework:
- Memo — A compact representation of the space of equivalent logical plans (expression groups)
- Exploration — Transformation rules enumerate alternative logical plans
- Implementation — Logical operators are mapped to physical operators
- Optimization — Cost-based selection of the best physical plan given statistics
GEM Algorithm
GEM (Graph-aware Enumeration via Memo) is an ORCA extension specific to TurboLynx.
When a label scan expands to a UNION ALL across multiple graphlets, a naive optimizer treats the entire union as a single relation. GEM instead pushes joins below the UNION ALL boundary — each graphlet group gets its own join subtree with its own cost-optimal join order. This is especially effective on schemaless graphs where graphlets have very different row counts and column distributions.
Join Strategies
| Strategy | Trigger | Description |
|---|---|---|
| Index join | Selective edge traversal | Uses forward/backward adjacency list; fast for low-fanout patterns |
| Hash join | Large intermediate results | General-purpose; ORCA default fallback |
| Merge join | Ordered inputs | Can be disabled with --disable-merge-join |
Statistics
ORCA uses column-level statistics for cost estimation. Statistics are stored as histograms (numeric, string, validity) and are built with the analyze command:
Without statistics, ORCA falls back to heuristic cardinality estimates.
Shell Flags
# Run a single query
./tools/turbolynx --workspace /path/to/db --query "MATCH (n:Person) RETURN n.firstName LIMIT 10;"
# Force join strategy
./tools/turbolynx --workspace /path/to/db --index-join-only --query "..."
./tools/turbolynx --workspace /path/to/db --hash-join-only --query "..."
# Disable merge join
./tools/turbolynx --workspace /path/to/db --disable-merge-join --query "..."
# Print ORCA internal trace
./tools/turbolynx --workspace /path/to/db --debug-orca --query "..."
# Print selected physical plan without executing
./tools/turbolynx --workspace /path/to/db --explain --query "..."
# Compile only (no execution)
./tools/turbolynx --workspace /path/to/db --compile-only --query "..."