Query Optimizer

TurboLynx uses ORCA, a Cascades-framework cost-based optimizer originally developed at Pivotal for Greenplum/HAWQ, adapted for graph queries.

Pipeline

Cypher String
     │
     ▼
  ANTLR4 Parser  →  RegularQuery AST
     │
     ▼
  TurboLynx Binder  →  BoundRegularQuery
     (type resolution, catalog lookup, property binding)
     │
     ▼
  Cypher2OrcaConverter  →  ORCA Logical Plan (CExpression tree)
     │
     ▼
  ORCA Optimizer  →  Physical Plan
     (cost-based join ordering, operator selection, GEM)
     │
     ▼
  Pipeline Executor

Binder

The TurboLynx Binder (src/binder/binder.cpp) resolves Cypher AST nodes against the graph catalog:

Maps node/rel variables to partitions and graphlets (via catalog OIDs)
Resolves property names to key_id values
Populates BoundNodeExpression and BoundRelExpression with all property columns needed for execution
Synthesizes inline node/rel property filters ({k:v} → equality predicates)

Cypher2OrcaConverter

The converter (src/converter/cypher2orca_converter.cpp) translates BoundRegularQuery into an ORCA CExpression logical tree:

Each label scan becomes a CLogicalGet wrapped in a CLogicalProjectColumnar
Multi-graphlet labels use CLogicalUnionAll to combine per-graphlet scans
Edge traversals are modeled as joins: CLogicalInnerJoin (index join) or CLogicalNAryJoin
WHERE predicates become CLogicalSelect
Projections and aggregates map to CLogicalProjectColumnar and CLogicalGbAgg

ORCA Overview

ORCA uses the Cascades optimization framework:

Memo — A compact representation of the space of equivalent logical plans (expression groups)
Exploration — Transformation rules enumerate alternative logical plans
Implementation — Logical operators are mapped to physical operators
Optimization — Cost-based selection of the best physical plan given statistics

GEM Algorithm

GEM (Graph-aware Enumeration via Memo) is an ORCA extension specific to TurboLynx.

When a label scan expands to a UNION ALL across multiple graphlets, a naive optimizer treats the entire union as a single relation. GEM instead pushes joins below the UNION ALL boundary — each graphlet group gets its own join subtree with its own cost-optimal join order. This is especially effective on schemaless graphs where graphlets have very different row counts and column distributions.

Join Strategies

Strategy	Trigger	Description
Index join	Selective edge traversal	Uses forward/backward adjacency list; fast for low-fanout patterns
Hash join	Large intermediate results	General-purpose; ORCA default fallback
Merge join	Ordered inputs	Can be disabled with `--disable-merge-join`

Statistics

ORCA uses column-level statistics for cost estimation. Statistics are stored as histograms (numeric, string, validity) and are built with the analyze command:

./tools/turbolynx --workspace /path/to/db
TurboLynx >> .analyze
TurboLynx >> .exit

Without statistics, ORCA falls back to heuristic cardinality estimates.

Shell Flags

# Run a single query
./tools/turbolynx --workspace /path/to/db --query "MATCH (n:Person) RETURN n.firstName LIMIT 10;"

# Force join strategy
./tools/turbolynx --workspace /path/to/db --index-join-only --query "..."
./tools/turbolynx --workspace /path/to/db --hash-join-only  --query "..."

# Disable merge join
./tools/turbolynx --workspace /path/to/db --disable-merge-join --query "..."

# Print ORCA internal trace
./tools/turbolynx --workspace /path/to/db --debug-orca --query "..."

# Print selected physical plan without executing
./tools/turbolynx --workspace /path/to/db --explain --query "..."

# Compile only (no execution)
./tools/turbolynx --workspace /path/to/db --compile-only --query "..."