geomermaids · GeoParquet Reading Cookbook

Producing a good GeoParquet file is now mostly a matter of setting the right flags (see the GeoParquet Writing Cookbook). Reading the file back over a network, at interactive speed, under real load, from an arbitrary client, is where most cloud-native implementations actually stumble. This page covers the client-side half in three parts: setup, non-spatial queries, spatial queries.

Why reading is the hard half

The generation side has converged on a small set of best practices. The reading side has a much wider surface: HTTP semantics, statistics-aware predicate pushdown, client memory behaviour, CORS, CDN caching, range-request support. Any one of these can turn a 400 MB file that should yield a 1 MB query into a full download.


1. Getting started with DuckDB

DuckDB is the pragmatic query engine for cloud-native geospatial: one binary, no server, spatial extension, native Parquet reader, HTTP + S3 support in httpfs. The only setup worth doing properly is the init file — do it once and every session is ready to hit a remote catalog.

The init file: ~/.duckdbrc

The duckdb CLI runs ~/.duckdbrc at startup if the file exists. Stick a comprehensive session setup in there once and every invocation starts with extensions loaded, HTTP tuned, and the geoparquet catalog's S3 endpoint pre-configured. The same statements work inside a DuckDB script, a Python notebook, or a DuckDB-WASM session — .duckdbrc is just the CLI convenience wrapper.

A recommended ~/.duckdbrc for cloud-native geospatial work:

-- ~/.duckdbrc -- runs at startup of every `duckdb` CLI session

---------- 1. Extensions ----------
INSTALL spatial;   LOAD spatial;    -- GEOMETRY type, ST_* functions, GeoParquet read
INSTALL httpfs;    LOAD httpfs;     -- HTTPS + S3 remote file support
INSTALL postgres;  LOAD postgres;   -- ATTACH postgres:// for PostGIS bridges
-- INSTALL h3 FROM community; LOAD h3;  -- optional: H3 cell functions

---------- 2. Anonymous S3 endpoint for geoparquet.geomermaids.com ----------
-- Needed for glob-based catalog queries. Single-file HTTPS reads work without it.
SET s3_endpoint            = 's3.geomermaids.com';
SET s3_url_style           = 'path';
SET s3_use_ssl             = true;
SET s3_access_key_id       = '';
SET s3_secret_access_key   = '';

---------- 3. HTTP tuning ----------
SET http_keep_alive             = true;    -- reuse TCP connections across requests
SET http_timeout                = 60;      -- seconds; default 30 is tight for large reads
SET http_retries                = 5;       -- default 3; cold object storage can stutter
SET http_retry_wait_ms          = 500;     -- default 100; back off a bit harder
SET http_retry_backoff          = 4;       -- default; exponential backoff factor
SET enable_http_metadata_cache  = true;    -- reuse HEAD / footer results across queries
SET httpfs_connection_caching   = true;    -- reuse HTTP connection handles

---------- 4. Memory / parallelism ----------
-- DuckDB picks sensible defaults from the host. Override only if you know why.
-- SET memory_limit = '8GB';
-- SET threads      = 8;

---------- 5. Quality of life ----------
-- Always show real time for each query.
-- Dot-commands don't accept inline comments, so keep the note on its own line.
.timer on

Verify it loaded by opening a fresh duckdb session and running SELECT current_setting('s3_endpoint'); — it should print s3.geomermaids.com.

Note: .duckdbrc is a CLI-only convention. DuckDB embedded in Python, Node, or WASM does not auto-load it; run the same SET / INSTALL / LOAD statements explicitly at connection time.

How DuckDB reads a remote file

What DuckDB actually does when you call read_parquet('https://…'):

  1. HEAD request to the URL. Confirms the server supports range requests (Accept-Ranges: bytes) and learns the content length.
  2. Ranged GET for the Parquet footer (last few KB of the file). This holds the schema and the list of row groups with per-column min/max statistics.
  3. Statistics-driven row-group pruning. DuckDB combines your WHERE clause with the footer stats and decides which row groups can be skipped entirely.
  4. Ranged GETs for the surviving column chunks. Multiple columns × the remaining row groups, fired in parallel over a few keep-alive TCP connections.
  5. Decompress, materialize, return.

For a file where the WHERE clause lines up with Parquet statistics, a 400 MB file commonly yields a query that moves a few MB across the wire. When statistics don't help, you pay for the full table scan.

This whole dance depends on HTTP range requests. If the server does not advertise Accept-Ranges: bytes, or a client sits between you and the data that strips the header (some corporate proxies, some CDN edge configurations), DuckDB falls back to a full download and the rest of this page is irrelevant. Verify with curl -I <url> before anything else.


2. Common queries: when statistics do the work

For non-spatial WHERE clauses, DuckDB's predicate pushdown is straightforward: a comparison on a stored column is matched against the per-row-group min/max in the footer, and groups that can't satisfy the predicate are skipped before any data page is fetched.

Rule of thumb: statistics-based pruning works when the WHERE clause is a direct comparison on a stored column. Anything that wraps the column in a function, a pattern match without a fixed prefix, or a regex defeats the pruning and falls back to a full scan.

What engages pushdown

Query fragmentPushed down?Why
WHERE state = 'MA'yesDirect column comparison, per-row-group min/max handles it.
WHERE state IN ('MA','NH','VT')yesLowered to an OR of equalities.
WHERE state > 'L'yesRange comparison against min/max.
WHERE state LIKE 'M%'partialPrefix patterns use min/max; mid-string wildcards do not.
WHERE LOWER(state) = 'ma'noFunction applied to the column; stats are for the raw column only.
WHERE name LIKE '%street%'noSubstring search; nothing to prune on.

The practical consequence: do your column projection and normalization at write time, not query time. Keep the columns clients will filter on in their raw form so statistics can do their job. If you need case-insensitive search, store a lowercase copy of the column.

Pushed down vs not: measured

EXPLAIN ANALYZE is the fastest way to see the difference. Two queries against the same 413 MB OSM buildings file for New York State, same 770-row result, only the predicate shape changes.

-- Query A: predicate engages column statistics
EXPLAIN ANALYZE
SELECT COUNT(*)
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE addr_postcode = '10001';
-- 770 rows, 0.41s cold.  Reads only the row groups whose addr_postcode range overlaps '10001'.
-- Query B: same result, predicate defeats pushdown
EXPLAIN ANALYZE
SELECT COUNT(*)
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE LOWER(addr_postcode) = '10001';
-- 770 rows, 1.15s cold.  Full table scan; LOWER() makes the column stats useless.

A ~2.5× slowdown on the same file, same answer, same client. At catalog scale (hundreds of files per query) that factor compounds into unusable.

Tooling

EXPLAIN ANALYZE is a good starting point (tree shows which row groups were read and per-operator times), but for remote reads the more interesting tool is PRAGMA enable_profiling. Unlike EXPLAIN ANALYZE, its output includes HTTP-level counters: bytes transferred in and out, and a count of HEAD / GET / PUT / POST / DELETE calls. On a network-bound query that's the number you actually want.

PRAGMA enable_profiling = 'query_tree';
SELECT COUNT(*)
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE addr_postcode = '10001';

The profile output above the query result includes a block like:

HTTPFS HTTP Stats
     in:  263.5 KiB
     out: 0 bytes
     #HEAD: 1
     #GET:  3

Total Time: 0.612s

263.5 KiB transferred to answer a query against a 413 MB file is 0.06% of the total — that is what successful row-group pruning looks like on the wire. When a query should be fast and isn't, this is the first number to check; a multi-MB in: on a simple predicate means pushdown did not engage.

Other pragmas worth knowing:

And a few non-DuckDB checks worth having in the pocket:


3. Spatial queries: tricks and tips

Spatial queries follow a different cost model from column predicates. The "push the predicate down to statistics" idea still applies, but the statistics can only bound where geometries are, not their exact shape. Every spatial query is therefore a two-stage filter: cheap row-group pruning by bbox, then expensive geometry-by-geometry evaluation within the surviving groups.

The two-stage model

Neither GeoParquet 1.1 (bbox column) nor 2.0 (Parquet-native geometry stats) embeds a per-row spatial index inside a row group. What they provide is per-row-group min/max coordinates, which drives two stages:

  1. Row-group pruning (cheap, stats-based): skip any row group whose bbox cannot overlap the query geometry. No data pages fetched for pruned groups.
  2. Per-row evaluation (expensive, WKB parse + predicate test): for every surviving row group, decode each geometry and run the actual ST_Intersects / ST_Contains / ST_Distance predicate row-by-row.

Spatial queries are linear in the number of features that survive the bbox filter, not in the total file size. Good row group sizing and Hilbert sorting on the write side matter because they tighten the bbox of each group, so more groups are prunable before the engine reaches the expensive stage. See row group sizing and spatial sorting in the GeoParquet cookbook for the write-time knobs.

Verify your file carries per-row-group geometry stats

On a properly-written GeoParquet 2.0 file, each row group carries a geo_bbox struct (xmin, ymin, xmax, ymax, plus optional zmin/zmax/mmin/mmax) alongside the geometry column's data pages. DuckDB exposes these via the parquet_metadata function:

SELECT row_group_id, geo_bbox, geo_types
FROM parquet_metadata('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE path_in_schema = 'geometry'
ORDER BY row_group_id
LIMIT 5;

What you want to see is a distinct, tight bbox per row group, not NULLs. The parquetry catalog's NY buildings file returns exactly that, Hilbert-clustered (adjacent row groups cover adjacent geographic areas):

row_group_id │ geo_bbox
─────────────┼────────────────────────────────────────────────────────────────────
           0 │ xmin=-77.78, xmax=-76.80, ymin=41.999, ymax=42.19   [multipolygon]
           1 │ xmin=-77.04, xmax=-75.81, ymin=41.998, ymax=42.47   [multipolygon]
           2 │ xmin=-76.58, xmax=-75.81, ymin=42.19,  ymax=42.75   [multipolygon]
           3 │ xmin=-77.29, xmax=-76.30, ymin=42.47,  ymax=42.75   [multipolygon]

Tight row-group bboxes mean the file is in principle ready for per-row-group spatial pruning. If your own files show NULL geo_bbox everywhere, the writer didn't emit geometry stats: rewrite with best-practice defaults (see the GeoParquet cookbook).

Which version, which syntax?

GeoParquet 1.1 and 2.0 store the per-row-group bbox in different places, and DuckDB pushes predicates down through different code paths. The SQL you write has to match the file; use the wrong syntax and pushdown silently disengages (1.1 → full scan) or the query errors out on a missing column (2.0).

AspectGeoParquet 1.1GeoParquet 2.0
Where the bbox lives Separate bbox STRUCT column, fields xmin / ymin / xmax / ymax Parquet-native geo_bbox statistics on the Geometry logical type
Row-group pruning filter WHERE bbox.xmin < qxmax AND bbox.xmax > qxmin AND bbox.ymin < qymax AND bbox.ymax > qymin WHERE st_intersects_extent(geometry, envelope)
Why it pushes down Scalar comparisons on struct sub-fields → standard Parquet min/max stats DuckDB planner recognises the function and reads geo_bbox from the column chunk metadata directly
Precise predicate on survivors AND ST_Intersects(geometry, envelope)
Penalty for wrong syntax st_intersects_extent still runs but finds no geo_bbox stats to consume → full scan bbox.xmin column does not exist → query errors out

Detecting which version you have:

SELECT decode(value) AS geo_metadata
FROM parquet_kv_metadata('your-file.parquet')
WHERE decode(key) = 'geo';
-- The returned JSON has a "version" key: "1.1.0" or "2.0.0".

Why it matters: the single most common cause of "my spatial query is slow" on a cloud-native catalog is mismatching the predicate style to the file's version. Spend one query on parquet_kv_metadata before you write the rest.

GeoParquet 1.1: bbox struct filter

On a V1.1 file the spatial pre-filter is just scalar comparisons on the bbox STRUCT sub-fields. DuckDB pushes them down to the per-row-group min/max stats like any other scalar predicate; only the surviving row groups reach the precise ST_Intersects check.

SELECT COUNT(*)
FROM read_parquet('your_v1_1_file.parquet')
WHERE bbox.xmin < -71.9 AND bbox.xmax > -72.1   -- cheap: column stats prune row groups
  AND bbox.ymin <  41.2 AND bbox.ymax >  41.0
  AND ST_Intersects(geometry, ST_MakeEnvelope(-72.1, 41.0, -71.9, 41.2));  -- on survivors

Measured on a 224 MB / 5 M-row file with 98 row groups, native V2 geometry plus an explicit bbox STRUCT column (the same layout V1.1 mandates for the covering bbox):

QueryTimeResult
Plain ST_Intersects(geometry, envelope) ~445 ms 5,510 rows
Same, with bbox.xmin/xmax/ymin/ymax filter first ~6 ms 5,510 rows

~74× speedup, identical result. The plain ST_Intersects form parses every WKB in every row group; the bbox-filtered form prunes the vast majority at the row-group level via standard Parquet min/max stats and only parses the survivors.

This pattern is the workhorse for files written by GDAL with WRITE_COVERING_BBOX=YES, by Overture Maps, and by any tool emitting the V1.1 spec faithfully. On V2 native files (no bbox STRUCT column), use the recipe below instead.

GeoParquet 2.0: st_intersects_extent, not ST_Intersects

This is the single most important trick on the page. DuckDB Spatial ships two spatial-intersection functions, and only one of them engages row-group pruning against V2 geo_bbox stats:

Measured on a local 5 M-row GeoParquet 2.0 file with native geometry (98 row groups, no bbox struct column — just the standard Parquet geometry logical type with per-row-group geo_bbox statistics as the spec intends):

QueryTimeBehavior
WHERE ST_Intersects(geometry, envelope) (selective) ~445 ms Scans every row group, parses every WKB.
WHERE st_intersects_extent(geometry, envelope) (selective) ~48 ms Planner uses geo_bbox stats to prune row groups; only survivors are decoded.
WHERE st_intersects_extent(geometry, far_envelope) (0 results) 0.000 s All 98 row groups pruned at the statistics level. Metadata-only query.

The out-of-range row (0.000 s) is the conclusive proof that pushdown is happening — DuckDB didn't decode a single data page. Same engine, same file, same envelope shape, just a different function call.

The two-stage pattern: extent first, precise check on survivors

st_intersects_extent only compares bounding boxes. If you need a precise predicate (true polygon intersection, distance, containment), combine both — the cheap one prunes row groups, the precise one runs only on the survivors:

SELECT COUNT(*)
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE st_intersects_extent(geometry, ST_MakeEnvelope(-72.1, 41.0, -71.9, 41.2))  -- cheap, pushed down
  AND ST_Intersects(geometry, ST_MakeEnvelope(-72.1, 41.0, -71.9, 41.2));         -- precise, on survivors

Measured on the same 5 M-row local file, this combined form runs in ~48 ms: essentially the same as st_intersects_extent alone, because the precise ST_Intersects only touches the few hundred features that survived the extent filter.

In practice, for a rectangular query envelope, st_intersects_extent alone is already precise enough: a feature whose bbox intersects a rectangle really does intersect the rectangle. The two-stage form matters when the query geometry is a complex polygon or a buffered point, where bbox overlap doesn't imply geometry intersection.

Why plain ST_Intersects is slow on remote V2 files

To close the loop on the measured numbers from earlier in this section: the same 413 MB NY buildings file, two cold spatial queries with different function choices.

QueryBytes transferredHTTP GETsCold time
WHERE ST_Intersects(geometry, envelope) 344.3 MiB88~65 s
WHERE st_intersects_extent(geometry, far_envelope) (out of range) 256 KiB10.5 s

Two spatial queries, same file, same client — the difference between "scan every byte" and "look at the metadata, pick nothing" is the function call. On selective (non-out-of-range) queries, st_intersects_extent also prunes, just partially; actual numbers depend on how many row groups overlap the query envelope and on the writer's spatial clustering.

Rule: on V2 files with native geometry, reach for st_intersects_extent first. Pair it with ST_Intersects only when bbox-level overlap is not precise enough for your predicate.

Distance queries: pre-filter by expanded extent

ST_Distance cannot be pushed down directly. The practical pattern is to bound the search region with an envelope expanded by the radius, prune row groups with st_intersects_extent, and only then compute the precise distance on survivors.

Two gotchas on geographic (lon/lat) data: ST_Distance returns the Euclidean distance in coordinate-system units — i.e. degrees on CRS84 — not meters. And ST_Distance_Sphere (the meters-returning variant) only accepts POINTs, so polygons need to be summarised to a centroid first.

-- Buildings within ~1 km of a point in Manhattan.
-- Column is MultiPolygon, so we centroid for the meter-returning distance.
SET geometry_always_xy = true;   -- silence the CRS axis-order warning

SELECT osm_id, name,
       ST_Distance_Sphere(ST_Centroid(geometry), ST_Point(-73.9, 40.7)) AS d_m
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE st_intersects_extent(                         -- cheap, pushed down to geo_bbox
        geometry,
        ST_MakeEnvelope(-73.913, 40.691, -73.887, 40.709)   -- ±0.013°/±0.009° ≈ ±1 km at lat 40.7
      )
  AND ST_Distance_Sphere(ST_Centroid(geometry), ST_Point(-73.9, 40.7)) < 1000
ORDER BY d_m
LIMIT 20;

For production, compute the envelope programmatically from the radius and the latitude rather than hardcoding degrees. On projected data (meters-native CRS), the math is linear and ST_Distance already returns meters.

What defeats spatial pushdown


Next

Generating files with the right properties for fast reads is covered in the GeoParquet cookbook. The conceptual context for why this all matters is on the Cloud-Native Geospatial page. If you're stuck on a specific slow query or want a second pair of eyes on a reading pipeline, get in touch.