geomermaids · GeoParquet Reading Cookbook
Producing a good GeoParquet file is now mostly a matter of setting the right flags (see the GeoParquet Writing Cookbook). Reading the file back over a network, at interactive speed, under real load, from an arbitrary client, is where most cloud-native implementations actually stumble. This page covers the client-side half in three parts: setup, non-spatial queries, spatial queries.
Why reading is the hard half
The generation side has converged on a small set of best practices. The reading side has a much wider surface: HTTP semantics, statistics-aware predicate pushdown, client memory behaviour, CORS, CDN caching, range-request support. Any one of these can turn a 400 MB file that should yield a 1 MB query into a full download.
1. Getting started with DuckDB
DuckDB is the pragmatic query engine for cloud-native geospatial: one binary, no
server, spatial extension, native Parquet reader, HTTP + S3 support in
httpfs. The only setup worth doing properly is the init file — do it
once and every session is ready to hit a remote catalog.
The init file: ~/.duckdbrc
The duckdb CLI runs ~/.duckdbrc at startup if the file
exists. Stick a comprehensive session setup in there once and every invocation starts
with extensions loaded, HTTP tuned, and the geoparquet catalog's S3 endpoint
pre-configured. The same statements work inside a DuckDB script, a Python notebook,
or a DuckDB-WASM session — .duckdbrc is just the CLI convenience wrapper.
A recommended ~/.duckdbrc for cloud-native geospatial work:
-- ~/.duckdbrc -- runs at startup of every `duckdb` CLI session
---------- 1. Extensions ----------
INSTALL spatial; LOAD spatial; -- GEOMETRY type, ST_* functions, GeoParquet read
INSTALL httpfs; LOAD httpfs; -- HTTPS + S3 remote file support
INSTALL postgres; LOAD postgres; -- ATTACH postgres:// for PostGIS bridges
-- INSTALL h3 FROM community; LOAD h3; -- optional: H3 cell functions
---------- 2. Anonymous S3 endpoint for geoparquet.geomermaids.com ----------
-- Needed for glob-based catalog queries. Single-file HTTPS reads work without it.
SET s3_endpoint = 's3.geomermaids.com';
SET s3_url_style = 'path';
SET s3_use_ssl = true;
SET s3_access_key_id = '';
SET s3_secret_access_key = '';
---------- 3. HTTP tuning ----------
SET http_keep_alive = true; -- reuse TCP connections across requests
SET http_timeout = 60; -- seconds; default 30 is tight for large reads
SET http_retries = 5; -- default 3; cold object storage can stutter
SET http_retry_wait_ms = 500; -- default 100; back off a bit harder
SET http_retry_backoff = 4; -- default; exponential backoff factor
SET enable_http_metadata_cache = true; -- reuse HEAD / footer results across queries
SET httpfs_connection_caching = true; -- reuse HTTP connection handles
---------- 4. Memory / parallelism ----------
-- DuckDB picks sensible defaults from the host. Override only if you know why.
-- SET memory_limit = '8GB';
-- SET threads = 8;
---------- 5. Quality of life ----------
-- Always show real time for each query.
-- Dot-commands don't accept inline comments, so keep the note on its own line.
.timer on
Verify it loaded by opening a fresh duckdb session and running
SELECT current_setting('s3_endpoint'); — it should print
s3.geomermaids.com.
Note: .duckdbrc is a CLI-only convention. DuckDB
embedded in Python, Node, or WASM does not auto-load it; run the same
SET / INSTALL / LOAD statements explicitly at
connection time.
How DuckDB reads a remote file
What DuckDB actually does when you call
read_parquet('https://…'):
- HEAD request to the URL. Confirms the server supports range
requests (
Accept-Ranges: bytes) and learns the content length. - Ranged GET for the Parquet footer (last few KB of the file). This holds the schema and the list of row groups with per-column min/max statistics.
- Statistics-driven row-group pruning. DuckDB combines your
WHEREclause with the footer stats and decides which row groups can be skipped entirely. - Ranged GETs for the surviving column chunks. Multiple columns × the remaining row groups, fired in parallel over a few keep-alive TCP connections.
- Decompress, materialize, return.
For a file where the WHERE clause lines up with Parquet statistics, a
400 MB file commonly yields a query that moves a few MB across the wire. When
statistics don't help, you pay for the full table scan.
This whole dance depends on HTTP range requests. If the server does not advertise
Accept-Ranges: bytes, or a client sits between you and the data that
strips the header (some corporate proxies, some CDN edge configurations), DuckDB
falls back to a full download and the rest of this page is irrelevant. Verify with
curl -I <url> before anything else.
2. Common queries: when statistics do the work
For non-spatial WHERE clauses, DuckDB's predicate pushdown is
straightforward: a comparison on a stored column is matched against the per-row-group
min/max in the footer, and groups that can't satisfy the predicate are skipped before
any data page is fetched.
Rule of thumb: statistics-based pruning works when the WHERE clause is a
direct comparison on a stored column. Anything that wraps the column
in a function, a pattern match without a fixed prefix, or a regex defeats the pruning
and falls back to a full scan.
What engages pushdown
| Query fragment | Pushed down? | Why |
|---|---|---|
WHERE state = 'MA' | yes | Direct column comparison, per-row-group min/max handles it. |
WHERE state IN ('MA','NH','VT') | yes | Lowered to an OR of equalities. |
WHERE state > 'L' | yes | Range comparison against min/max. |
WHERE state LIKE 'M%' | partial | Prefix patterns use min/max; mid-string wildcards do not. |
WHERE LOWER(state) = 'ma' | no | Function applied to the column; stats are for the raw column only. |
WHERE name LIKE '%street%' | no | Substring search; nothing to prune on. |
The practical consequence: do your column projection and normalization at write time, not query time. Keep the columns clients will filter on in their raw form so statistics can do their job. If you need case-insensitive search, store a lowercase copy of the column.
Pushed down vs not: measured
EXPLAIN ANALYZE is the fastest way to see the difference. Two queries
against the same 413 MB OSM buildings file for New York State, same 770-row result,
only the predicate shape changes.
-- Query A: predicate engages column statistics
EXPLAIN ANALYZE
SELECT COUNT(*)
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE addr_postcode = '10001';
-- 770 rows, 0.41s cold. Reads only the row groups whose addr_postcode range overlaps '10001'. -- Query B: same result, predicate defeats pushdown
EXPLAIN ANALYZE
SELECT COUNT(*)
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE LOWER(addr_postcode) = '10001';
-- 770 rows, 1.15s cold. Full table scan; LOWER() makes the column stats useless. A ~2.5× slowdown on the same file, same answer, same client. At catalog scale (hundreds of files per query) that factor compounds into unusable.
Tooling
EXPLAIN ANALYZE is a good starting point (tree shows which row groups
were read and per-operator times), but for remote reads the more interesting tool is
PRAGMA
enable_profiling. Unlike EXPLAIN ANALYZE, its output includes
HTTP-level counters: bytes transferred in and out, and a count of HEAD / GET / PUT /
POST / DELETE calls. On a network-bound query that's the number you actually want.
PRAGMA enable_profiling = 'query_tree';
SELECT COUNT(*)
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE addr_postcode = '10001'; The profile output above the query result includes a block like:
HTTPFS HTTP Stats
in: 263.5 KiB
out: 0 bytes
#HEAD: 1
#GET: 3
Total Time: 0.612s
263.5 KiB transferred to answer a query against a 413 MB file is 0.06%
of the total — that is what successful row-group pruning looks like on the wire. When a
query should be fast and isn't, this is the first number to check; a multi-MB
in: on a simple predicate means pushdown did not engage.
Other pragmas worth knowing:
PRAGMA enable_profiling = 'json'+PRAGMA profiling_output = 'out.json': structured output for scripts or long-running benchmarks.PRAGMA disable_profiling;: turn it off again.
And a few non-DuckDB checks worth having in the pocket:
curl -I <url>: confirmsAccept-Ranges: bytesis present. Without it, DuckDB cannot do range requests and falls back to a full download — your query shape no longer matters.- Browser DevTools Network tab (when using DuckDB-WASM): every ranged GET shows up as a separate request with its byte range and size. Fastest way to confirm that row-group pruning is actually happening in the wild.
- A cold first query is always slower than the second. Always benchmark at least the second run; the first pays the HEAD + footer fetch and any CDN cache warming.
3. Spatial queries: tricks and tips
Spatial queries follow a different cost model from column predicates. The "push the predicate down to statistics" idea still applies, but the statistics can only bound where geometries are, not their exact shape. Every spatial query is therefore a two-stage filter: cheap row-group pruning by bbox, then expensive geometry-by-geometry evaluation within the surviving groups.
The two-stage model
Neither GeoParquet 1.1 (bbox column) nor 2.0 (Parquet-native geometry stats) embeds a per-row spatial index inside a row group. What they provide is per-row-group min/max coordinates, which drives two stages:
- Row-group pruning (cheap, stats-based): skip any row group whose bbox cannot overlap the query geometry. No data pages fetched for pruned groups.
- Per-row evaluation (expensive, WKB parse + predicate test):
for every surviving row group, decode each geometry and run the actual
ST_Intersects/ST_Contains/ST_Distancepredicate row-by-row.
Spatial queries are linear in the number of features that survive the bbox filter, not in the total file size. Good row group sizing and Hilbert sorting on the write side matter because they tighten the bbox of each group, so more groups are prunable before the engine reaches the expensive stage. See row group sizing and spatial sorting in the GeoParquet cookbook for the write-time knobs.
Verify your file carries per-row-group geometry stats
On a properly-written GeoParquet 2.0 file, each row group carries a geo_bbox
struct (xmin, ymin, xmax, ymax, plus optional zmin/zmax/mmin/mmax) alongside the
geometry column's data pages. DuckDB exposes these via the parquet_metadata
function:
SELECT row_group_id, geo_bbox, geo_types
FROM parquet_metadata('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE path_in_schema = 'geometry'
ORDER BY row_group_id
LIMIT 5; What you want to see is a distinct, tight bbox per row group, not NULLs. The parquetry catalog's NY buildings file returns exactly that, Hilbert-clustered (adjacent row groups cover adjacent geographic areas):
row_group_id │ geo_bbox
─────────────┼────────────────────────────────────────────────────────────────────
0 │ xmin=-77.78, xmax=-76.80, ymin=41.999, ymax=42.19 [multipolygon]
1 │ xmin=-77.04, xmax=-75.81, ymin=41.998, ymax=42.47 [multipolygon]
2 │ xmin=-76.58, xmax=-75.81, ymin=42.19, ymax=42.75 [multipolygon]
3 │ xmin=-77.29, xmax=-76.30, ymin=42.47, ymax=42.75 [multipolygon]
Tight row-group bboxes mean the file is in principle ready for per-row-group
spatial pruning. If your own files show NULL geo_bbox everywhere, the
writer didn't emit geometry stats: rewrite with best-practice defaults (see the
GeoParquet cookbook).
Which version, which syntax?
GeoParquet 1.1 and 2.0 store the per-row-group bbox in different places, and DuckDB pushes predicates down through different code paths. The SQL you write has to match the file; use the wrong syntax and pushdown silently disengages (1.1 → full scan) or the query errors out on a missing column (2.0).
| Aspect | GeoParquet 1.1 | GeoParquet 2.0 |
|---|---|---|
| Where the bbox lives | Separate bbox STRUCT column, fields xmin /
ymin / xmax / ymax | Parquet-native geo_bbox statistics on the Geometry logical type |
| Row-group pruning filter | WHERE bbox.xmin < qxmax AND bbox.xmax > qxmin AND bbox.ymin < qymax AND bbox.ymax > qymin | WHERE st_intersects_extent(geometry, envelope) |
| Why it pushes down | Scalar comparisons on struct sub-fields → standard Parquet min/max stats | DuckDB planner recognises the function and reads
geo_bbox from the column chunk metadata directly |
| Precise predicate on survivors | AND ST_Intersects(geometry, envelope) | |
| Penalty for wrong syntax | st_intersects_extent still runs but finds no
geo_bbox stats to consume → full scan | bbox.xmin column does not exist → query errors out |
Detecting which version you have:
SELECT decode(value) AS geo_metadata
FROM parquet_kv_metadata('your-file.parquet')
WHERE decode(key) = 'geo';
-- The returned JSON has a "version" key: "1.1.0" or "2.0.0". Why it matters: the single most common cause of "my spatial query
is slow" on a cloud-native catalog is mismatching the predicate style to the file's
version. Spend one query on parquet_kv_metadata before you write the
rest.
GeoParquet 1.1: bbox struct filter
On a V1.1 file the spatial pre-filter is just scalar comparisons on the
bbox STRUCT sub-fields. DuckDB pushes them down to the per-row-group
min/max stats like any other scalar predicate; only the surviving row groups reach
the precise ST_Intersects check.
SELECT COUNT(*)
FROM read_parquet('your_v1_1_file.parquet')
WHERE bbox.xmin < -71.9 AND bbox.xmax > -72.1 -- cheap: column stats prune row groups
AND bbox.ymin < 41.2 AND bbox.ymax > 41.0
AND ST_Intersects(geometry, ST_MakeEnvelope(-72.1, 41.0, -71.9, 41.2)); -- on survivors Measured on a 224 MB / 5 M-row file with 98 row groups, native V2 geometry plus an explicit bbox STRUCT column (the same layout V1.1 mandates for the covering bbox):
| Query | Time | Result |
|---|---|---|
Plain ST_Intersects(geometry, envelope) | ~445 ms | 5,510 rows |
Same, with bbox.xmin/xmax/ymin/ymax filter first | ~6 ms | 5,510 rows |
~74× speedup, identical result. The plain
ST_Intersects form parses every WKB in every row group; the
bbox-filtered form prunes the vast majority at the row-group level via standard
Parquet min/max stats and only parses the survivors.
This pattern is the workhorse for files written by GDAL with
WRITE_COVERING_BBOX=YES, by Overture Maps, and by any tool emitting
the V1.1 spec faithfully. On V2 native files (no bbox STRUCT column),
use the recipe below instead.
GeoParquet 2.0: st_intersects_extent, not ST_Intersects
This is the single most important trick on the page. DuckDB Spatial ships two
spatial-intersection functions, and only one of them engages row-group pruning against
V2 geo_bbox stats:
ST_Intersects(geometry, env)— precise predicate. Parses every WKB in every row group and tests true intersection. Not pushed down to row-group stats today.st_intersects_extent(geometry, env)— bbox-bbox comparison only. Cheap per-row, and the DuckDB planner can push it down to the Parquetgeo_bboxstats to skip whole row groups.
Measured on a local 5 M-row GeoParquet 2.0 file with native geometry (98 row groups,
no bbox struct column — just the standard Parquet geometry logical type
with per-row-group geo_bbox statistics as the spec intends):
| Query | Time | Behavior |
|---|---|---|
WHERE ST_Intersects(geometry, envelope) (selective) | ~445 ms | Scans every row group, parses every WKB. |
WHERE st_intersects_extent(geometry, envelope) (selective) | ~48 ms | Planner uses geo_bbox stats to prune row groups; only survivors are decoded. |
WHERE st_intersects_extent(geometry, far_envelope) (0 results) | 0.000 s | All 98 row groups pruned at the statistics level. Metadata-only query. |
The out-of-range row (0.000 s) is the conclusive proof that pushdown is happening — DuckDB didn't decode a single data page. Same engine, same file, same envelope shape, just a different function call.
The two-stage pattern: extent first, precise check on survivors
st_intersects_extent only compares bounding boxes. If you need a precise
predicate (true polygon intersection, distance, containment), combine both — the
cheap one prunes row groups, the precise one runs only on the survivors:
SELECT COUNT(*)
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE st_intersects_extent(geometry, ST_MakeEnvelope(-72.1, 41.0, -71.9, 41.2)) -- cheap, pushed down
AND ST_Intersects(geometry, ST_MakeEnvelope(-72.1, 41.0, -71.9, 41.2)); -- precise, on survivors
Measured on the same 5 M-row local file, this combined form runs in
~48 ms: essentially the same as st_intersects_extent alone,
because the precise ST_Intersects only touches the few hundred features
that survived the extent filter.
In practice, for a rectangular query envelope, st_intersects_extent alone
is already precise enough: a feature whose bbox intersects a rectangle really does
intersect the rectangle. The two-stage form matters when the query geometry is a
complex polygon or a buffered point, where bbox overlap doesn't imply geometry
intersection.
Why plain ST_Intersects is slow on remote V2 files
To close the loop on the measured numbers from earlier in this section: the same 413 MB NY buildings file, two cold spatial queries with different function choices.
| Query | Bytes transferred | HTTP GETs | Cold time |
|---|---|---|---|
WHERE ST_Intersects(geometry, envelope) | 344.3 MiB | 88 | ~65 s |
WHERE st_intersects_extent(geometry, far_envelope) (out of range) | 256 KiB | 1 | 0.5 s |
Two spatial queries, same file, same client — the difference between
"scan every byte" and "look at the metadata, pick nothing" is the
function call. On selective (non-out-of-range) queries, st_intersects_extent
also prunes, just partially; actual numbers depend on how many row groups overlap the
query envelope and on the writer's spatial clustering.
Rule: on V2 files with native geometry, reach for
st_intersects_extent first. Pair it with ST_Intersects only
when bbox-level overlap is not precise enough for your predicate.
Distance queries: pre-filter by expanded extent
ST_Distance cannot be pushed down directly. The practical pattern is to
bound the search region with an envelope expanded by the radius, prune row groups
with st_intersects_extent, and only then compute the precise distance
on survivors.
Two gotchas on geographic (lon/lat) data: ST_Distance returns the
Euclidean distance in coordinate-system units — i.e. degrees on CRS84 —
not meters. And ST_Distance_Sphere (the meters-returning variant) only
accepts POINTs, so polygons need to be summarised to a centroid first.
-- Buildings within ~1 km of a point in Manhattan.
-- Column is MultiPolygon, so we centroid for the meter-returning distance.
SET geometry_always_xy = true; -- silence the CRS axis-order warning
SELECT osm_id, name,
ST_Distance_Sphere(ST_Centroid(geometry), ST_Point(-73.9, 40.7)) AS d_m
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE st_intersects_extent( -- cheap, pushed down to geo_bbox
geometry,
ST_MakeEnvelope(-73.913, 40.691, -73.887, 40.709) -- ±0.013°/±0.009° ≈ ±1 km at lat 40.7
)
AND ST_Distance_Sphere(ST_Centroid(geometry), ST_Point(-73.9, 40.7)) < 1000
ORDER BY d_m
LIMIT 20;
For production, compute the envelope programmatically from the radius and the
latitude rather than hardcoding degrees. On projected data (meters-native CRS), the
math is linear and ST_Distance already returns meters.
What defeats spatial pushdown
ST_Transform(geometry, 3857)in theWHERE. Transforming the stored geometry column at query time makes every row group's stats useless, same issue asLOWER()on a string column. Transform the query geometry once in a CTE instead, and keep the stored column in its native CRS.ST_Buffer(geometry, 1000)at query time. Same problem: wrapping the stored column in a function defeats stats-based pruning. Either pre-buffer at write time or use the distance + bbox trick above.- Large query geometries that span many row groups. If your query bbox covers half the file's extent, pruning can't help much. This is a write-time problem (row groups too large or features not spatially sorted), not a query-time one.
- Mixing geometry types on GeoArrow-encoded files. A polygon query against a points-only column won't match anything. Obvious but easy to forget across a partitioned catalog.
Next
Generating files with the right properties for fast reads is covered in the GeoParquet cookbook. The conceptual context for why this all matters is on the Cloud-Native Geospatial page. If you're stuck on a specific slow query or want a second pair of eyes on a reading pipeline, get in touch.