geomermaids · PostGIS spatial joins, translated to DuckDB
I've been struggling quite a bit with spatial joins in DuckDB. Recipes that felt natural after 25 years of PostGIS suddenly hit wall after wall when I tried to apply them to the Duck. Not that it doesn't work, but the Duck has to be treated a certain way, since he's not as geo-aware as PostGIS can be. So I put together a small cookbook to help you avoid the same pitfalls.
PostGIS is twenty years mature: GiST is woven into the PostgreSQL planner, so the database
thinks spatially as a first-class citizen: you write the join, the index is used,
you never think about it. DuckDB's spatial extension is young and moving fast,
but it is still an extension bolted onto a columnar core: the spatial index is not woven into
join planning the same way, and there are sharp caveats. This page takes the spatial joins
you already write in PostGIS and translates each one to DuckDB, query plan beside
query plan, against the same FEMA data, so you can see exactly where the two agree
and where DuckDB makes you drive by hand. The results are identical on both engines
throughout; only the plans differ.
The two tables
Every example joins two tables: flood_zones, US-wide flood-hazard
polygons published by FEMA, each with a flood risk score (0 = minimal,
25 = 0.2%-annual zone, 35 = 1%-annual zone) and reference_points, a small
table of points we want to score (5 sitting inside real zones, 2 in open country). PostGIS
holds ~5.9M polygons; the DuckDB copy is the same data subdivided into ~14.9M finer polygons,
stored as a .ddb (~76 GB with its R-tree). Same data, fair fight.
First divergence: the index you forget vs the index you choose
In PostGIS you build a GiST index once and never think about it again. The planner reaches for it in every query shape, single-row or join, containment or distance:
-- PostGIS: build once, the planner uses it everywhere, forever
CREATE INDEX flood_zones_geom_idx ON flood_zones USING GIST (geometry);
DuckDB has an R-tree too, but two things are different, and they are the whole story of this
page. First, you must create it explicitly. Second, DuckDB gives you two unrelated
spatial execution paths rather than one planner-integrated index: an
RTREE_INDEX_SCAN that only fires for a constant geometry, and a
SPATIAL_JOIN operator that builds a throwaway R-tree at run time and streams one
whole side. Which one you get depends on exactly how you phrase the query, and the optimizer
will not rescue a bad phrasing the way PostgreSQL does.
-- DuckDB: same idea, but only consulted in some query shapes (see below)
CREATE INDEX flood_zones_geom_idx ON flood_zones USING RTREE (geometry); 1. Point-in-polygon: which zone is each point in?
The workshop's first spatial join ("what neighborhood is this station in?",
PostGIS Intro · Spatial Joins)
is a containment join, written as a bare ST_Contains / ST_Intersects in the ON clause. You write the same idiomatic SQL on both engines; nothing in the query mentions an index. The whole difference lives in the plan underneath, and
the bbox prefilter you never type is PostGIS's doing, visible only in EXPLAIN.
PostGIS
SELECT r.id, MAX(f.score) AS score
FROM flood_zones f
JOIN reference_points r
ON ST_Intersects(f.geometry, r.geom)
GROUP BY r.id; GroupAggregate (rows=7)
└─ Nested Loop (rows=7)
├─ Seq Scan reference_points r (rows=7)
└─ Index Scan flood_zones_geom_idx (×7)
Index Cond: geometry && r.geom
Filter: ST_Intersects(…)
Rows Removed by Filter: 3 DuckDB
SELECT r.id, MAX(f.score) AS score
FROM flood_zones f
JOIN reference_points r
ON ST_Intersects(f.geometry, r.geom)
GROUP BY r.id; PERFECT_HASH_GROUP_BY
└─ SPATIAL_JOIN Join Type: INNER
│ ST_Intersects(f.geometry, r.geom)
├─ SEQ_SCAN flood_zones ~14.9M (streamed)
└─ SEQ_SCAN reference_points 7 (→ R-tree) PostGIS ~23 ms vs DuckDB ~155 s (same answer, 7 points (~6,800× slower). PostGIS: 7 indexed probes. DuckDB: scans all ~14.9M polygons.)
Identical results:
id | PostGIS | DuckDB
----+---------+--------
1 | 35 | 35 ← inside a 1% (high-risk) zone
2 | 35 | 35
3 | 35 | 35
4 | 25 | 25 ← inside a 0.2% zone
5 | 25 | 25
101 | 0 | 0 ← open country
102 | 0 | 0
Read the two plans. You wrote ST_Intersects on both. PostGIS's planner turned it into a GiST-driven Nested Loop from the small side: a
Nested Loop that, for each of the 7 points, does a GiST Index Scan
on the 5.9M-polygon table. The bbox prefilter is the planner's, shown as Index Cond: geometry && r.geom (you never typed &&); it then rechecks
the exact geometry as a Filter (Rows Removed by Filter: 3,
the loose bbox, then the tight recheck). Seven index probes, ~23 ms.
DuckDB's SPATIAL_JOIN builds its R-tree on the right input (the 7
points) and streams all ~14.9M polygons through it. It never touches the
persistent R-tree in a join. For 7 probe points that is a full-table scan,
~155 s.
The shape of the difference matters more than the one number, and it runs through this whole
page: PostGIS's cost grows with the number of points (one indexed probe
each), while DuckDB's SPATIAL_JOIN cost is roughly flat: it
scans the big table once whether you give it 7 points or 7 million. At 7 points that flat
~155 s scan is a catastrophe next to 23 ms; at millions of points it becomes the
winning plan (§6). The rest of the sections are variations on this.
Does join order matter? No. Writing it the other way around
(FROM reference_points r JOIN flood_zones f …) produces a byte-identical plan and
the same ~155 s: DuckDB's optimizer reorders the SPATIAL_JOIN to build its
R-tree on the smaller table regardless of how you write it, exactly as PostGIS's planner
ignores written order. (When both sides are large and similar, the build-side choice does
start to matter; see the cheat-sheet.)
2. The one query shape that does use DuckDB's index
So why could PostGIS probe its index 7 times while DuckDB scanned everything? Not a quirk; it's structural. PostgreSQL's planner treats GiST as an access path it can use anywhere, including "for each row of the join, probe the index" (the nested-loop-index-scan you just saw). DuckDB's optimizer does not plug the persistent R-tree into joins at all. That R-tree fires in exactly one situation: when you compare the geometry column to a single constant geometry. Watch it work: this is the one shape where the two engines do the same thing:
PostGIS · one point
SELECT score FROM flood_zones
WHERE ST_Intersects(geometry,
ST_SetSRID(ST_MakePoint(-88.61,33.50),4326)); Index Scan flood_zones_geom_idx (rows=1)
Index Cond: geometry && <point>
Filter: ST_Intersects(geometry, <point>) DuckDB · one point
SELECT score FROM flood_zones
WHERE ST_Intersects(geometry,
ST_GeomFromText('POINT(-88.61 33.50)')); PROJECTION
└─ FILTER ST_Intersects(geometry, <point>)
└─ RTREE_INDEX_SCAN Index: flood_zones_geom_idx PostGIS ~1.1 ms vs DuckDB ~0.2 ms (a tie. One constant geometry is the single shape where both reach for the index.)
That's the key: DuckDB's index works perfectly for one constant point at a time. What it won't do is what PostGIS did in §1: drive that index from the rows of a join. So the fork PostgreSQL hides becomes yours to make:
- A handful of points? Don't write a join. Run one constant-geometry
query per point (each an
RTREE_INDEX_SCAN, sub-millisecond) from your application loop, orUNION ALLa few. You get PostGIS-like speed because each is the indexed lookup above. - Millions of points? Use the
SPATIAL_JOIN(§6). Its flat full-table scan stops mattering once the probe set is huge: that one scan is amortised across all of them.
PostGIS makes this choice for you (per-probe index for small inputs, a different plan for large ones); in DuckDB it's a decision you own, and getting it wrong costs orders of magnitude with no warning.
3. Within a distance: ST_DWithin
The workshop's distance join, "census blocks within 200 m of a subway line",
ON ST_DWithin(census.geom, subways.geom, 200). In PostGIS ST_DWithin
expands the search box (ST_Expand) and rides the GiST index, no extra syntax from you. In DuckDB
ST_DWithin only accelerates inside the join operator, and only after you
reproject both sides to a metric CRS (EPSG:5070), because ST_DWithin's threshold is in the
geometry's own units, degrees in EPSG:4326. The same is true in PostGIS: on lon/lat data the
degree threshold above is only approximate (0.0015° is ~140-170 m at this latitude); for an
exact metre cutoff you cast ::geography or reproject, exactly as DuckDB must. The
points here sit well inside or well outside the zones, so the small threshold difference does
not change which zone each falls in, and both engines return the same rows.
PostGIS
SELECT r.id, MAX(f.score) AS score
FROM flood_zones f
JOIN reference_points r
ON ST_DWithin(f.geometry, r.geom, 0.0015) -- degrees: ~140-170 m here
GROUP BY r.id; Nested Loop
├─ Seq Scan reference_points r
└─ Index Scan flood_zones_geom_idx (rows=3 ×7)
Index Cond: geometry && ST_Expand(r.geom, 0.0015)
Filter: ST_DWithin(…)
Rows Removed by Filter: 2 DuckDB
WITH z AS (SELECT score,
ST_Transform(geometry,'EPSG:4326','EPSG:5070',always_xy:=true) g FROM flood_zones),
p AS (SELECT id,
ST_Transform(geom,'EPSG:4326','EPSG:5070',always_xy:=true) g FROM reference_points)
SELECT p.id, MAX(z.score) AS score
FROM z JOIN p ON ST_DWithin(z.g, p.g, 125) -- exact metres
GROUP BY p.id; PERFECT_HASH_GROUP_BY
└─ SPATIAL_JOIN Join Type: INNER
│ ST_DWithin(z.g, p.g, 125)
├─ SEQ_SCAN flood_zones (reprojected 5070)
└─ SEQ_SCAN reference_points (reprojected) PostGIS ~36 ms vs DuckDB ~154 s (same answer, 7 points (~4,200×). Same constant-vs-linear story as §1.)
PostGIS rides the same GiST index it always uses, just with an expanded search box. DuckDB
gets its SPATIAL_JOIN, but only because the join condition is a
single spatial predicate on metric geometries. Phrase it as a single-row
WHERE ST_DWithin(geometry, <point>, …) instead of a join and DuckDB drops to
a SEQ_SCAN (no index): ST_DWithin simply does not trigger the R-tree
outside the join operator. PostGIS uses the index in either shape.
4. The trap that proves the thesis: refining in a WHERE
Here is a query every PostGIS user writes: filter cheaply with the index, then refine with the exact distance. Run the same query pattern on both engines and watch what happens.
PostGIS · folds both predicates
SELECT r.id, MAX(f.score) AS score
FROM flood_zones f
JOIN reference_points r ON ST_DWithin(f.geometry, r.geom, 0.0015)
WHERE ST_Distance(f.geometry, r.geom) <= 0.0015 -- exact refine
GROUP BY r.id; Nested Loop
└─ Index Scan flood_zones_geom_idx (rows=3 ×7)
Index Cond: geometry && ST_Expand(r.geom, 0.0015)
Filter: ST_DWithin(…)
AND ST_Distance(…) <= 0.0015
Rows Removed by Filter: 2 ← still indexed, still fast DuckDB · gives up the operator
SELECT r.id, MAX(z.score) AS score
FROM z JOIN p ON ST_DWithin(z.g, p.g, 125)
WHERE ST_Distance(z.g, p.g) <= 125 -- second spatial predicate
GROUP BY r.id; PERFECT_HASH_GROUP_BY
└─ BLOCKWISE_NL_JOIN Join Type: INNER ← nested loop!
│ (R-tree dropped: 2 spatial predicates)
├─ SEQ_SCAN flood_zones ~14.9M
└─ SEQ_SCAN reference_points PostGIS ~41 ms vs DuckDB ~160 s (same pattern. The extra predicate barely changes PostGIS; it makes DuckDB abandon the operator.)
This is the maturity gap in one query. PostgreSQL folds both spatial predicates into
one indexed nested loop: the && ST_Expand still drives the GiST
scan, the two functions ride along as filters. DuckDB sees a second spatial predicate and
abandons the SPATIAL_JOIN entirely, falling back to
BLOCKWISE_NL_JOIN, a row-by-row nested loop over 14.9M rows. The rule:
exactly one spatial predicate per join in DuckDB; a non-spatial
WHERE (e.g. score > 0) is fine, a second spatial one is not. And
you do not need the refine: DuckDB's metric ST_DWithin is already exact (no
planar/geography split to recheck). Drop the WHERE and the operator returns.
5. Nearest neighbour: the <-> operator has no equivalent
"The single nearest zone." PostGIS has an indexed KNN operator: GiST streams rows in distance
order, so LIMIT 1 stops after one. DuckDB has neither the operator nor the
early-stop.
PostGIS · indexed KNN
SELECT score FROM flood_zones
ORDER BY geometry <-> ST_MakePoint(-88.61,33.50)
LIMIT 1; Limit (rows=1)
└─ Index Scan flood_zones_geom_idx (rows=1)
Order By: geometry <-> <point> ← GiST, distance-ordered DuckDB · full scan, no early stop
SELECT score FROM flood_zones
ORDER BY ST_Distance(geometry, ST_GeomFromText('POINT(-88.61 33.50)'))
LIMIT 1; TOP_N (Order By #0 ASC)
└─ SEQ_SCAN flood_zones ~14.9M
(ST_Distance computed for every row;
LIMIT does not short-circuit; issue #20113) PostGIS ~0.25 ms vs DuckDB ~149 s (same answer (~600,000×). PostGIS stops at the first index row; DuckDB computes the distance for all ~14.9M.)
No <-> KNN operator exists on DuckDB's R-tree, and worse,
ORDER BY ST_Distance(...) LIMIT 1 evaluates the distance for every
row regardless of the LIMIT, a reproduced bug,
duckdb/duckdb#20113.
The workaround is bound-then-rank: narrow with the indexable
ST_Intersects(geometry, ST_Buffer(<point>, r)) first (an
RTREE_INDEX_SCAN returning a handful of rows), then
ORDER BY ST_Distance(...) LIMIT 1 over that tiny set.
6. Where DuckDB wins: big against big
5.4M points × ~9.9M polygons: DuckDB ~96 s vs PostGIS = 5.4M indexed probes (now the flat scan wins. DuckDB's ~constant cost finally beats PostGIS's per-point linear cost.)
The thesis is not "PostGIS always wins". It is "PostGIS is integrated and forgiving; DuckDB
is powerful but you must drive it." There is one regime where DuckDB is decisively the right
tool: when both sides are large. Streaming one big side once and probing a run-time
R-tree is exactly what you want when the probe set is also huge. Scoring 5.4M points
against the ~9.9M risk-bearing polygons (the real batch enrichment this came from)
runs as a single SPATIAL_JOIN in ~96 s on one machine. The
same workload in PostGIS would be 5.4M nested-loop index probes; the columnar, set-based
SPATIAL_JOIN wins handily. (The mechanics are in DuckDB's own
spatial joins announcement.)
Few points → PostGIS-style per-probe index. Millions of points → DuckDB's join. Knowing which
is which is the whole skill.
7. A reader's recipe: clip to one region by inlining the geometry
Hagen Hübel, a reader from Berlin, wrote in with a case this page had not
covered: clip a big GeoParquet of points down to a single country
polygon. The PostGIS reflex is a
containment join, JOIN region r ON ST_Contains(r.geom, ST_Point(lon, lat)), or
pulling the clip box from a CTE. On DuckDB that reflex is slow for two reasons at once: it
plans as the §1 SPATIAL_JOIN and streams the whole file, and a
bbox computed at run time (from a CTE or subquery) never reaches the Parquet reader's
statistics, so nothing prunes and ST_Contains fires on every row.
Hagen's fix, which holds up under test: resolve the region once, then inline it as literal constants, a literal bbox and a literal polygon. No join, no subquery, and both problems vanish at the same time.
-- Stage 1: resolve the region to plain values (run once, cheap)
SELECT ST_AsHEXWKB(geom) AS wkb,
ST_XMin(geom) AS xmin, ST_YMin(geom) AS ymin,
ST_XMax(geom) AS xmax, ST_YMax(geom) AS ymax
FROM regions WHERE name = 'Canada';
-- Stage 2: splice the literals straight in (no join, no CTE)
SELECT count(*)
FROM read_parquet('points/*.parquet')
WHERE lon BETWEEN -141.0 AND -52.3 -- literal bbox: prunes row groups + pre-filters
AND lat BETWEEN 41.7 AND 83.3
AND ST_Contains(ST_GeomFromHEXWKB('0106...'), -- literal polygon: exact clip
ST_Point(lon, lat)); Same clip, two ways. The whole difference lives in the plan:
DuckDB · the join reflex
SELECT count(*)
FROM read_parquet('points/*.parquet') p
JOIN regions r ON ST_Contains(r.geom, ST_Point(p.lon,p.lat))
WHERE r.name = 'Canada'; SPATIAL_JOIN Join Type: INNER
├─ SEQ_SCAN read_parquet (every row group)
└─ SEQ_SCAN regions 1 (→ R-tree on the one polygon)
(runtime bbox: no row-group pruning;
ST_Contains on every point) DuckDB · inlined literals
SELECT count(*)
FROM read_parquet('points/*.parquet')
WHERE lon BETWEEN -141.0 AND -52.3
AND lat BETWEEN 41.7 AND 83.3
AND ST_Contains(ST_GeomFromHEXWKB('0106...'),
ST_Point(lon, lat)); FILTER ST_Contains(<const poly>, ST_Point(lon,lat))
└─ PARQUET_SCAN points/*.parquet
Filters: lon>=-141.0 AND lon<=-52.3
AND lat>=41.7 AND lat<=83.3
(literal bounds prune row groups by stats;
ST_Contains only on the survivors) 50M global points → Canada (GeoParquet): join ~41 s · CTE bbox ~41 s · inlined literals ~2.3 s (shuffled) → ~0.6 s (sorted) (same 1.77M rows. ~17× from the literals, another ~4× once the file prunes.)
And the difference is not only speed, it is whether the query runs at all. The
SPATIAL_JOIN is a blocking, non-spilling operator: it materialises
its build side and its R-tree in memory and never spills to disk. With a large or
complex clip polygon (or a big input the planner puts on the build side), that buffer
can exceed RAM, and the query does not slow down, it OOM-kills. Hagen's production case is the
sharp version: the same clip ran for 30 minutes and died at 32 GiB as a join, then finished
in ~2 seconds once the literals were inlined. Inlining sidesteps the join
entirely, so it is not just the fastest option, it is the safest one: the
difference between "completes" and "cannot run at all."
Why it wins, part one: predicate ordering. Take the worst case first, a
shuffled file where no row group can be pruned (every group's lon/lat min-max spans the
whole planet, so all 499/499 are read). The literal version still runs in 2.3 s against
41 s for both the join and the CTE bbox. The literal bbox is a cheap range test the
optimizer applies first, so the costly ST_Contains only ever sees the ~5.7% of
points inside the box. A runtime bbox cannot be reordered ahead of the polygon test the same
way, so ST_Contains fires on all 50M. (And note what the slow path is not:
the scan is parallel either way. The cost is the predicate, not single-threading.)
Part two: row-group pruning, and it depends on the file. Sort the GeoParquet by
lat, lon (or a space-filling curve) and the same literal query reads just 116 of
499 row groups and drops to 0.6 s; leave it shuffled and the bbox stats overlap everything
and prune nothing. The extra speedup is real but it lives in how the file was written,
not in the query (see the GeoParquet cookbook).
bbox for speed, polygon for correctness. The literal bbox alone is the fastest thing here, but it is a rectangle: Canada's box also keeps a slab of the northern US, ~61% more points than the true polygon (2.85M vs 1.77M). Keep both predicates, the bbox to cut the search down to a few row groups, the polygon to make the clip exact.
When the data is indexed, the fork flips with size. Against a real flood
dataset of ~14.9M polygons stored as a .ddb with an R-tree, the literal polygon
becomes a constant-geometry predicate, the one shape that fires the index (§2), so the plan is
an RTREE_INDEX_SCAN. But that scan is single-threaded and its cost grows with the
clip size, while the SPATIAL_JOIN is parallel and roughly flat. So inlining wins a
small clip and loses a large one:
14.9M polygons → one US state (.ddb + R-tree):
Montana: join ~150 s vs inlined ~35 s ·
Texas: join ~160 s vs inlined ~394 s (inlined count matches the join exactly. Small clip → inline the geometry; continent-scale clip → let the parallel join run.)
Translation cheat-sheet
| PostGIS | DuckDB | Watch out |
|---|---|---|
| GiST index, used in every shape automatically | RTREE index, used only for a constant-geometry predicate | Build it explicitly; it is ignored in joins |
| Nested Loop + Index Scan (few outer rows) | loop constant-geometry RTREE_INDEX_SCAN | no planner-chosen index nested loop |
| Nested Loop / hash for big joins | SPATIAL_JOIN (streams the left, R-tree on the right) | put the compact side on the right; it must fit in memory (non-spilling: a large side OOM-kills) |
bbox prefilter applied automatically (shown as && in EXPLAIN) | implied inside ST_Intersects | neither engine makes you write && by hand |
ST_DWithin(geom, g, m) (indexed anywhere; ::geography for metres) | ST_DWithin only in a join, only reprojected to a metric CRS | single-row ST_DWithin = SEQ_SCAN; use ST_Intersects(ST_Buffer(...)) |
index filter + exact refine in WHERE | none (forces BLOCKWISE_NL_JOIN) | one spatial predicate per join, total |
geom <-> pt ORDER BY … LIMIT k (indexed KNN) | bound with ST_Buffer, then ORDER BY ST_Distance | no <->; LIMIT doesn't short-circuit (#20113) |
clip to a region: JOIN region ON ST_Contains(region, pt) | resolve the region once, inline a literal bbox + ST_GeomFromHEXWKB(…) polygon (§7) | a runtime bbox never prunes; literal bounds prune only if the file is spatially sorted; bbox loose, polygon exact |
Pitfalls: the maturity gap
- No automatic spatial index. DuckDB full-scans until you
CREATE INDEX … USING RTREE, and even then only for constant-geometry predicates. ST_DWithinin a single-rowWHERE→SEQ_SCAN. UseST_Intersects(geom, ST_Buffer(<const>, r))for a radius lookup.- A second spatial predicate (refine in
WHERE, or compoundON) →BLOCKWISE_NL_JOIN. One spatial predicate per join. - Degrees vs metres. A radius in EPSG:4326 is degrees; reproject to a
metric CRS (EPSG:5070 in the US) for the join. PostGIS just casts
::geography. ST_Distance_Spherereads(lat, lon), not(lon, lat); wrap inST_FlipCoordinatesor reproject.- No KNN.
ORDER BY ST_Distance LIMITscans everything (#20113). - A runtime bbox (from a CTE or subquery) does not prune Parquet row groups. Inline literal bounds so the reader can use column stats, and sort the file spatially or nothing prunes anyway (§7).
- The
SPATIAL_JOINis non-spilling. It buffers its build side and R-tree in memory; a large or complex clip polygon (or a big build side) can exceed RAM and OOM-kill the query rather than slowing down. Inlining a literal polygon avoids the join, so it is the safest choice for a large clip (§7). - Reading a
.ddbacross DuckDB versions can report "Corrupt database file"; the native format is storage-version-sensitive. Pin the version.
Sources
- PostGIS Intro Workshop, "Spatial Joins". the reference for
ST_Contains/ST_Intersects/ST_DWithinjoins translated here. - DuckDB, "Spatial Joins in DuckDB". the
SPATIAL_JOINoperator and its run-time R-tree. - DuckDB issue #20113.
ORDER BY <function>+LIMITevaluates the function for all rows.
Plans captured with EXPLAIN (ANALYZE) on PostgreSQL 15 / PostGIS 3.3 and DuckDB 1.4.x, both against the same FEMA flood data. Results identical on both engines; only the plans differ.
Next
To produce files these queries prune cheaply, see the GeoParquet Writing cookbook and the Reading cookbook. The index covers shared setup and publishing; the background is on the Cloud-Native Geospatial page.