geomermaids · PostGIS spatial joins, translated to DuckDB


I've been struggling quite a bit with spatial joins in DuckDB. Recipes that felt natural after 25 years of PostGIS suddenly hit wall after wall when I tried to apply them to the Duck. Not that it doesn't work, but the Duck has to be treated a certain way, since he's not as geo-aware as PostGIS can be. So I put together a small cookbook to help you avoid the same pitfalls.

PostGIS is twenty years mature: GiST is woven into the PostgreSQL planner, so the database thinks spatially as a first-class citizen: you write the join, the index is used, you never think about it. DuckDB's spatial extension is young and moving fast, but it is still an extension bolted onto a columnar core: the spatial index is not woven into join planning the same way, and there are sharp caveats. This page takes the spatial joins you already write in PostGIS and translates each one to DuckDB, query plan beside query plan, against the same FEMA data, so you can see exactly where the two agree and where DuckDB makes you drive by hand. The results are identical on both engines throughout; only the plans differ.

The two tables

Every example joins two tables: flood_zones, US-wide flood-hazard polygons published by FEMA, each with a flood risk score (0 = minimal, 25 = 0.2%-annual zone, 35 = 1%-annual zone) and reference_points, a small table of points we want to score (5 sitting inside real zones, 2 in open country). PostGIS holds ~5.9M polygons; the DuckDB copy is the same data subdivided into ~14.9M finer polygons, stored as a .ddb (~76 GB with its R-tree). Same data, fair fight.

First divergence: the index you forget vs the index you choose

In PostGIS you build a GiST index once and never think about it again. The planner reaches for it in every query shape, single-row or join, containment or distance:

-- PostGIS: build once, the planner uses it everywhere, forever
CREATE INDEX flood_zones_geom_idx ON flood_zones USING GIST (geometry);

DuckDB has an R-tree too, but two things are different, and they are the whole story of this page. First, you must create it explicitly. Second, DuckDB gives you two unrelated spatial execution paths rather than one planner-integrated index: an RTREE_INDEX_SCAN that only fires for a constant geometry, and a SPATIAL_JOIN operator that builds a throwaway R-tree at run time and streams one whole side. Which one you get depends on exactly how you phrase the query, and the optimizer will not rescue a bad phrasing the way PostgreSQL does.

-- DuckDB: same idea, but only consulted in some query shapes (see below)
CREATE INDEX flood_zones_geom_idx ON flood_zones USING RTREE (geometry);

1. Point-in-polygon: which zone is each point in?

The workshop's first spatial join ("what neighborhood is this station in?", PostGIS Intro · Spatial Joins) is a containment join, written as a bare ST_Contains / ST_Intersects in the ON clause. You write the same idiomatic SQL on both engines; nothing in the query mentions an index. The whole difference lives in the plan underneath, and the bbox prefilter you never type is PostGIS's doing, visible only in EXPLAIN.

PostGIS

SELECT r.id, MAX(f.score) AS score
FROM flood_zones f
JOIN reference_points r
  ON ST_Intersects(f.geometry, r.geom)
GROUP BY r.id;
GroupAggregate                       (rows=7)
└─ Nested Loop                       (rows=7)
   ├─ Seq Scan reference_points r    (rows=7)
   └─ Index Scan flood_zones_geom_idx (×7)
        Index Cond: geometry && r.geom
        Filter: ST_Intersects(…)
        Rows Removed by Filter: 3

DuckDB

SELECT r.id, MAX(f.score) AS score
FROM flood_zones f
JOIN reference_points r
  ON ST_Intersects(f.geometry, r.geom)
GROUP BY r.id;
PERFECT_HASH_GROUP_BY
└─ SPATIAL_JOIN   Join Type: INNER
   │  ST_Intersects(f.geometry, r.geom)
   ├─ SEQ_SCAN flood_zones   ~14.9M  (streamed)
   └─ SEQ_SCAN reference_points  7   (→ R-tree)

PostGIS ~23 ms  vs  DuckDB ~155 s (same answer, 7 points (~6,800× slower). PostGIS: 7 indexed probes. DuckDB: scans all ~14.9M polygons.)

Identical results:

id  | PostGIS | DuckDB
----+---------+--------
 1  |   35    |   35     ← inside a 1% (high-risk) zone
 2  |   35    |   35
 3  |   35    |   35
 4  |   25    |   25     ← inside a 0.2% zone
 5  |   25    |   25
101 |    0    |    0     ← open country
102 |    0    |    0

Read the two plans. You wrote ST_Intersects on both. PostGIS's planner turned it into a GiST-driven Nested Loop from the small side: a Nested Loop that, for each of the 7 points, does a GiST Index Scan on the 5.9M-polygon table. The bbox prefilter is the planner's, shown as Index Cond: geometry && r.geom (you never typed &&); it then rechecks the exact geometry as a Filter (Rows Removed by Filter: 3, the loose bbox, then the tight recheck). Seven index probes, ~23 ms. DuckDB's SPATIAL_JOIN builds its R-tree on the right input (the 7 points) and streams all ~14.9M polygons through it. It never touches the persistent R-tree in a join. For 7 probe points that is a full-table scan, ~155 s.

The shape of the difference matters more than the one number, and it runs through this whole page: PostGIS's cost grows with the number of points (one indexed probe each), while DuckDB's SPATIAL_JOIN cost is roughly flat: it scans the big table once whether you give it 7 points or 7 million. At 7 points that flat ~155 s scan is a catastrophe next to 23 ms; at millions of points it becomes the winning plan (§6). The rest of the sections are variations on this.

Does join order matter? No. Writing it the other way around (FROM reference_points r JOIN flood_zones f …) produces a byte-identical plan and the same ~155 s: DuckDB's optimizer reorders the SPATIAL_JOIN to build its R-tree on the smaller table regardless of how you write it, exactly as PostGIS's planner ignores written order. (When both sides are large and similar, the build-side choice does start to matter; see the cheat-sheet.)

2. The one query shape that does use DuckDB's index

So why could PostGIS probe its index 7 times while DuckDB scanned everything? Not a quirk; it's structural. PostgreSQL's planner treats GiST as an access path it can use anywhere, including "for each row of the join, probe the index" (the nested-loop-index-scan you just saw). DuckDB's optimizer does not plug the persistent R-tree into joins at all. That R-tree fires in exactly one situation: when you compare the geometry column to a single constant geometry. Watch it work: this is the one shape where the two engines do the same thing:

PostGIS · one point

SELECT score FROM flood_zones
WHERE ST_Intersects(geometry,
        ST_SetSRID(ST_MakePoint(-88.61,33.50),4326));
Index Scan flood_zones_geom_idx      (rows=1)
  Index Cond: geometry && <point>
  Filter: ST_Intersects(geometry, <point>)

DuckDB · one point

SELECT score FROM flood_zones
WHERE ST_Intersects(geometry,
        ST_GeomFromText('POINT(-88.61 33.50)'));
PROJECTION
└─ FILTER ST_Intersects(geometry, <point>)
   └─ RTREE_INDEX_SCAN  Index: flood_zones_geom_idx

PostGIS ~1.1 ms  vs  DuckDB ~0.2 ms (a tie. One constant geometry is the single shape where both reach for the index.)

That's the key: DuckDB's index works perfectly for one constant point at a time. What it won't do is what PostGIS did in §1: drive that index from the rows of a join. So the fork PostgreSQL hides becomes yours to make:

PostGIS makes this choice for you (per-probe index for small inputs, a different plan for large ones); in DuckDB it's a decision you own, and getting it wrong costs orders of magnitude with no warning.

3. Within a distance: ST_DWithin

The workshop's distance join, "census blocks within 200 m of a subway line", ON ST_DWithin(census.geom, subways.geom, 200). In PostGIS ST_DWithin expands the search box (ST_Expand) and rides the GiST index, no extra syntax from you. In DuckDB ST_DWithin only accelerates inside the join operator, and only after you reproject both sides to a metric CRS (EPSG:5070), because ST_DWithin's threshold is in the geometry's own units, degrees in EPSG:4326. The same is true in PostGIS: on lon/lat data the degree threshold above is only approximate (0.0015° is ~140-170 m at this latitude); for an exact metre cutoff you cast ::geography or reproject, exactly as DuckDB must. The points here sit well inside or well outside the zones, so the small threshold difference does not change which zone each falls in, and both engines return the same rows.

PostGIS

SELECT r.id, MAX(f.score) AS score
FROM flood_zones f
JOIN reference_points r
  ON ST_DWithin(f.geometry, r.geom, 0.0015)   -- degrees: ~140-170 m here
GROUP BY r.id;
Nested Loop
├─ Seq Scan reference_points r
└─ Index Scan flood_zones_geom_idx   (rows=3 ×7)
     Index Cond: geometry && ST_Expand(r.geom, 0.0015)
     Filter: ST_DWithin(…)
     Rows Removed by Filter: 2

DuckDB

WITH z AS (SELECT score,
   ST_Transform(geometry,'EPSG:4326','EPSG:5070',always_xy:=true) g FROM flood_zones),
     p AS (SELECT id,
   ST_Transform(geom,'EPSG:4326','EPSG:5070',always_xy:=true) g FROM reference_points)
SELECT p.id, MAX(z.score) AS score
FROM z JOIN p ON ST_DWithin(z.g, p.g, 125)    -- exact metres
GROUP BY p.id;
PERFECT_HASH_GROUP_BY
└─ SPATIAL_JOIN   Join Type: INNER
   │  ST_DWithin(z.g, p.g, 125)
   ├─ SEQ_SCAN flood_zones  (reprojected 5070)
   └─ SEQ_SCAN reference_points (reprojected)

PostGIS ~36 ms  vs  DuckDB ~154 s (same answer, 7 points (~4,200×). Same constant-vs-linear story as §1.)

PostGIS rides the same GiST index it always uses, just with an expanded search box. DuckDB gets its SPATIAL_JOIN, but only because the join condition is a single spatial predicate on metric geometries. Phrase it as a single-row WHERE ST_DWithin(geometry, <point>, …) instead of a join and DuckDB drops to a SEQ_SCAN (no index): ST_DWithin simply does not trigger the R-tree outside the join operator. PostGIS uses the index in either shape.

4. The trap that proves the thesis: refining in a WHERE

Here is a query every PostGIS user writes: filter cheaply with the index, then refine with the exact distance. Run the same query pattern on both engines and watch what happens.

PostGIS · folds both predicates

SELECT r.id, MAX(f.score) AS score
FROM flood_zones f
JOIN reference_points r ON ST_DWithin(f.geometry, r.geom, 0.0015)
WHERE ST_Distance(f.geometry, r.geom) <= 0.0015   -- exact refine
GROUP BY r.id;
Nested Loop
└─ Index Scan flood_zones_geom_idx   (rows=3 ×7)
     Index Cond: geometry && ST_Expand(r.geom, 0.0015)
     Filter: ST_DWithin(…)
             AND ST_Distance(…) <= 0.0015
     Rows Removed by Filter: 2        ← still indexed, still fast

DuckDB · gives up the operator

SELECT r.id, MAX(z.score) AS score
FROM z JOIN p ON ST_DWithin(z.g, p.g, 125)
WHERE ST_Distance(z.g, p.g) <= 125        -- second spatial predicate
GROUP BY r.id;
PERFECT_HASH_GROUP_BY
└─ BLOCKWISE_NL_JOIN   Join Type: INNER   ← nested loop!
   │  (R-tree dropped: 2 spatial predicates)
   ├─ SEQ_SCAN flood_zones   ~14.9M
   └─ SEQ_SCAN reference_points

PostGIS ~41 ms  vs  DuckDB ~160 s (same pattern. The extra predicate barely changes PostGIS; it makes DuckDB abandon the operator.)

This is the maturity gap in one query. PostgreSQL folds both spatial predicates into one indexed nested loop: the && ST_Expand still drives the GiST scan, the two functions ride along as filters. DuckDB sees a second spatial predicate and abandons the SPATIAL_JOIN entirely, falling back to BLOCKWISE_NL_JOIN, a row-by-row nested loop over 14.9M rows. The rule: exactly one spatial predicate per join in DuckDB; a non-spatial WHERE (e.g. score > 0) is fine, a second spatial one is not. And you do not need the refine: DuckDB's metric ST_DWithin is already exact (no planar/geography split to recheck). Drop the WHERE and the operator returns.

5. Nearest neighbour: the <-> operator has no equivalent

"The single nearest zone." PostGIS has an indexed KNN operator: GiST streams rows in distance order, so LIMIT 1 stops after one. DuckDB has neither the operator nor the early-stop.

PostGIS · indexed KNN

SELECT score FROM flood_zones
ORDER BY geometry <-> ST_MakePoint(-88.61,33.50)
LIMIT 1;
Limit                                (rows=1)
└─ Index Scan flood_zones_geom_idx   (rows=1)
     Order By: geometry <-> <point>   ← GiST, distance-ordered

DuckDB · full scan, no early stop

SELECT score FROM flood_zones
ORDER BY ST_Distance(geometry, ST_GeomFromText('POINT(-88.61 33.50)'))
LIMIT 1;
TOP_N   (Order By #0 ASC)
└─ SEQ_SCAN flood_zones   ~14.9M
   (ST_Distance computed for every row;
    LIMIT does not short-circuit; issue #20113)

PostGIS ~0.25 ms  vs  DuckDB ~149 s (same answer (~600,000×). PostGIS stops at the first index row; DuckDB computes the distance for all ~14.9M.)

No <-> KNN operator exists on DuckDB's R-tree, and worse, ORDER BY ST_Distance(...) LIMIT 1 evaluates the distance for every row regardless of the LIMIT, a reproduced bug, duckdb/duckdb#20113. The workaround is bound-then-rank: narrow with the indexable ST_Intersects(geometry, ST_Buffer(<point>, r)) first (an RTREE_INDEX_SCAN returning a handful of rows), then ORDER BY ST_Distance(...) LIMIT 1 over that tiny set.

6. Where DuckDB wins: big against big

5.4M points × ~9.9M polygons: DuckDB ~96 s  vs  PostGIS = 5.4M indexed probes (now the flat scan wins. DuckDB's ~constant cost finally beats PostGIS's per-point linear cost.)

The thesis is not "PostGIS always wins". It is "PostGIS is integrated and forgiving; DuckDB is powerful but you must drive it." There is one regime where DuckDB is decisively the right tool: when both sides are large. Streaming one big side once and probing a run-time R-tree is exactly what you want when the probe set is also huge. Scoring 5.4M points against the ~9.9M risk-bearing polygons (the real batch enrichment this came from) runs as a single SPATIAL_JOIN in ~96 s on one machine. The same workload in PostGIS would be 5.4M nested-loop index probes; the columnar, set-based SPATIAL_JOIN wins handily. (The mechanics are in DuckDB's own spatial joins announcement.) Few points → PostGIS-style per-probe index. Millions of points → DuckDB's join. Knowing which is which is the whole skill.

7. A reader's recipe: clip to one region by inlining the geometry

Hagen Hübel, a reader from Berlin, wrote in with a case this page had not covered: clip a big GeoParquet of points down to a single country polygon. The PostGIS reflex is a containment join, JOIN region r ON ST_Contains(r.geom, ST_Point(lon, lat)), or pulling the clip box from a CTE. On DuckDB that reflex is slow for two reasons at once: it plans as the §1 SPATIAL_JOIN and streams the whole file, and a bbox computed at run time (from a CTE or subquery) never reaches the Parquet reader's statistics, so nothing prunes and ST_Contains fires on every row.

Hagen's fix, which holds up under test: resolve the region once, then inline it as literal constants, a literal bbox and a literal polygon. No join, no subquery, and both problems vanish at the same time.

-- Stage 1: resolve the region to plain values (run once, cheap)
SELECT ST_AsHEXWKB(geom) AS wkb,
       ST_XMin(geom) AS xmin, ST_YMin(geom) AS ymin,
       ST_XMax(geom) AS xmax, ST_YMax(geom) AS ymax
FROM regions WHERE name = 'Canada';

-- Stage 2: splice the literals straight in (no join, no CTE)
SELECT count(*)
FROM read_parquet('points/*.parquet')
WHERE lon BETWEEN -141.0 AND -52.3        -- literal bbox: prunes row groups + pre-filters
  AND lat BETWEEN   41.7 AND  83.3
  AND ST_Contains(ST_GeomFromHEXWKB('0106...'),   -- literal polygon: exact clip
                  ST_Point(lon, lat));

Same clip, two ways. The whole difference lives in the plan:

DuckDB · the join reflex

SELECT count(*)
FROM read_parquet('points/*.parquet') p
JOIN regions r ON ST_Contains(r.geom, ST_Point(p.lon,p.lat))
WHERE r.name = 'Canada';
SPATIAL_JOIN   Join Type: INNER
├─ SEQ_SCAN read_parquet   (every row group)
└─ SEQ_SCAN regions  1     (→ R-tree on the one polygon)
   (runtime bbox: no row-group pruning;
    ST_Contains on every point)

DuckDB · inlined literals

SELECT count(*)
FROM read_parquet('points/*.parquet')
WHERE lon BETWEEN -141.0 AND -52.3
  AND lat BETWEEN   41.7 AND  83.3
  AND ST_Contains(ST_GeomFromHEXWKB('0106...'),
                  ST_Point(lon, lat));
FILTER  ST_Contains(<const poly>, ST_Point(lon,lat))
└─ PARQUET_SCAN  points/*.parquet
     Filters: lon>=-141.0 AND lon<=-52.3
              AND lat>=41.7 AND lat<=83.3
     (literal bounds prune row groups by stats;
      ST_Contains only on the survivors)

50M global points → Canada (GeoParquet): join ~41 s · CTE bbox ~41 s · inlined literals ~2.3 s (shuffled) → ~0.6 s (sorted) (same 1.77M rows. ~17× from the literals, another ~4× once the file prunes.)

And the difference is not only speed, it is whether the query runs at all. The SPATIAL_JOIN is a blocking, non-spilling operator: it materialises its build side and its R-tree in memory and never spills to disk. With a large or complex clip polygon (or a big input the planner puts on the build side), that buffer can exceed RAM, and the query does not slow down, it OOM-kills. Hagen's production case is the sharp version: the same clip ran for 30 minutes and died at 32 GiB as a join, then finished in ~2 seconds once the literals were inlined. Inlining sidesteps the join entirely, so it is not just the fastest option, it is the safest one: the difference between "completes" and "cannot run at all."

Why it wins, part one: predicate ordering. Take the worst case first, a shuffled file where no row group can be pruned (every group's lon/lat min-max spans the whole planet, so all 499/499 are read). The literal version still runs in 2.3 s against 41 s for both the join and the CTE bbox. The literal bbox is a cheap range test the optimizer applies first, so the costly ST_Contains only ever sees the ~5.7% of points inside the box. A runtime bbox cannot be reordered ahead of the polygon test the same way, so ST_Contains fires on all 50M. (And note what the slow path is not: the scan is parallel either way. The cost is the predicate, not single-threading.)

Part two: row-group pruning, and it depends on the file. Sort the GeoParquet by lat, lon (or a space-filling curve) and the same literal query reads just 116 of 499 row groups and drops to 0.6 s; leave it shuffled and the bbox stats overlap everything and prune nothing. The extra speedup is real but it lives in how the file was written, not in the query (see the GeoParquet cookbook).

bbox for speed, polygon for correctness. The literal bbox alone is the fastest thing here, but it is a rectangle: Canada's box also keeps a slab of the northern US, ~61% more points than the true polygon (2.85M vs 1.77M). Keep both predicates, the bbox to cut the search down to a few row groups, the polygon to make the clip exact.

When the data is indexed, the fork flips with size. Against a real flood dataset of ~14.9M polygons stored as a .ddb with an R-tree, the literal polygon becomes a constant-geometry predicate, the one shape that fires the index (§2), so the plan is an RTREE_INDEX_SCAN. But that scan is single-threaded and its cost grows with the clip size, while the SPATIAL_JOIN is parallel and roughly flat. So inlining wins a small clip and loses a large one:

14.9M polygons → one US state (.ddb + R-tree): Montana: join ~150 s vs inlined ~35 s · Texas: join ~160 s vs inlined ~394 s (inlined count matches the join exactly. Small clip → inline the geometry; continent-scale clip → let the parallel join run.)

Translation cheat-sheet

PostGISDuckDBWatch out
GiST index, used in every shape automaticallyRTREE index, used only for a constant-geometry predicateBuild it explicitly; it is ignored in joins
Nested Loop + Index Scan (few outer rows)loop constant-geometry RTREE_INDEX_SCANno planner-chosen index nested loop
Nested Loop / hash for big joinsSPATIAL_JOIN (streams the left, R-tree on the right)put the compact side on the right; it must fit in memory (non-spilling: a large side OOM-kills)
bbox prefilter applied automatically (shown as && in EXPLAIN)implied inside ST_Intersectsneither engine makes you write && by hand
ST_DWithin(geom, g, m) (indexed anywhere; ::geography for metres)ST_DWithin only in a join, only reprojected to a metric CRSsingle-row ST_DWithin = SEQ_SCAN; use ST_Intersects(ST_Buffer(...))
index filter + exact refine in WHEREnone (forces BLOCKWISE_NL_JOIN)one spatial predicate per join, total
geom <-> pt ORDER BY … LIMIT k (indexed KNN)bound with ST_Buffer, then ORDER BY ST_Distanceno <->; LIMIT doesn't short-circuit (#20113)
clip to a region: JOIN region ON ST_Contains(region, pt)resolve the region once, inline a literal bbox + ST_GeomFromHEXWKB(…) polygon (§7)a runtime bbox never prunes; literal bounds prune only if the file is spatially sorted; bbox loose, polygon exact

Pitfalls: the maturity gap

Sources

Plans captured with EXPLAIN (ANALYZE) on PostgreSQL 15 / PostGIS 3.3 and DuckDB 1.4.x, both against the same FEMA flood data. Results identical on both engines; only the plans differ.

Next

To produce files these queries prune cheaply, see the GeoParquet Writing cookbook and the Reading cookbook. The index covers shared setup and publishing; the background is on the Cloud-Native Geospatial page.