CTEs In PySpark Are Not Effective

Source: Why I Stopped Using CTEs Everywhere in SQL by Namratha Shenoy

The Problem with CTEs

CTEs (Common Table Expressions) appear elegant and modular, but in most SQL engines including Spark SQL, they create significant performance issues because they are not physically materialized. They’re only logical aliases, meaning the query inside a CTE is recomputed every time it’s referenced.

Why This Hurts Performance

When you reference the same CTE multiple times in a query, Spark re-runs that subquery repeatedly, leading to:

Increased computation overhead
Redundant data scanning
Higher shuffle and I/O costs

Example of problematic pattern:

WITH active_users AS (
  SELECT user_id
  FROM events
  WHERE event_type = 'login'
)
SELECT COUNT(*) FROM active_users;           -- computed once
SELECT AVG(session_time) FROM active_users;  -- computed again!

The active_users CTE executes twice - once for each reference.

Better Alternatives

1. Temporary Views - Logical Breakpoints

Temp views provide the same modularity as CTEs but are reusable across multiple queries in the same session:

CREATE OR REPLACE TEMP VIEW active_users AS
SELECT user_id
FROM events
WHERE event_type = 'login';

Characteristics:

Logical only (not physically stored)
Reusable across queries in same Spark session
Perfect for breaking large transformations into steps
Not cached unless explicitly cached

2. Temporary Tables - Physical Snapshots

For intermediate results that need to persist physically:

CREATE OR REPLACE TEMPORARY TABLE active_users AS
SELECT user_id
FROM events
WHERE event_type = 'login';

Use when you need:

Materialized snapshot to avoid recomputation
Ability to inspect intermediate data
Debugging or comparing runs

3. Cache Tables - For High Reuse

When using the same data repeatedly:

CACHE TABLE active_users;

Spark stores it in memory (or disk), making subsequent queries significantly faster by skipping recomputation entirely.

Handling Data Changes

Important consideration: If underlying data changes frequently, cached views hold a snapshot and won’t update automatically, leading to stale data.

Solutions:

Manual refresh:

UNCACHE TABLE active_users;
CACHE TABLE active_users;

Lazy cache with refresh:

CACHE LAZY TABLE active_users;
REFRESH TABLE active_users;

Don’t cache if data changes too frequently - Spark’s optimizer handles recomputation efficiently

Decision Matrix

Use Case	Recommended Approach	Reason
Logic clarity & readability	CTEs	Simple, single-use transformations
Reusable intermediate layers	Temp Views	Multiple references, same session
Physical inspection needed	Temp Tables	Debugging, materialized results
Stable, frequently accessed data	Cache Tables	Performance optimization
Frequently changing data	No caching	Avoid stale data issues

Key Takeaway

CTEs are excellent for readability but not for performance in distributed systems. Understanding what gets recomputed, cached, or materialized can save hours of processing time in production pipelines.

Rule of thumb: Use CTEs for logic clarity, not for performance optimization.

Rituraj's Garden

Graph View

Explorer