geomermaids · GeoParquet Writing cookbook
GeoParquet is a great format, but getting it right can be tricky. The ecosystem is still young, and navigating versions, tools, and encoding options can feel like a maze. This page covers best practices, explains the different versions and encoding options, and provides concrete recipes for GDAL, DuckDB, and gpio. Please report any issues or missing pieces through the contact form so we can keep this page accurate and up-to-date, and help grow the GeoParquet community.
Tools and setup
Four tools cover most GeoParquet workflows. Choose the one you feel the most comfortable with or the one already there in your pipeline. Of course you can use Apache Arrow Python bindings for low-level parquet operations but it's not the purpose of this cookbook so we'll forget about it for now on.
GDAL
The default choice for existing pipelines built on GDAL / OGR. The Parquet driver
writes GeoParquet 1.1 by default from 3.9 onwards, reads any version via
/vsicurl/. Version cutoffs: 3.8 added 1.0 support, 3.9 added 1.1 (bbox +
CRS), 3.12 adds 2.0 via USE_PARQUET_GEO_TYPES=YES.
brew install gdal # macOS
sudo apt install gdal-bin # Ubuntu / Debian
gdalinfo --version # expect 3.9 or newer DuckDB
The modern query engine: SQL directly against remote GeoParquet over HTTP, joins across URLs, predicate pushdown, and a growing spatial function library. Also the most ergonomic way to read PostGIS and write GeoParquet in one step.
brew install duckdb # macOS
curl https://install.duckdb.org | sh # Linux
duckdb --version # expect 1.5.2 or newer
# inside DuckDB, once per session:
INSTALL spatial; LOAD spatial;
INSTALL httpfs; LOAD httpfs; gpio — Swiss-army knife
geoparquet-io is the Swiss-army knife for GeoParquet: convert, validate, inspect, partition, upgrade. Opinionated defaults (Hilbert sort, ZSTD, bbox column, sensible row groups). If you do not care about individual flags, this is the fastest path to a spec-compliant file. Recommended as the default unless you need a specific GDAL or DuckDB feature it does not expose.
uv tool install geoparquet-io # recommended
# or
pip install geoparquet-io
gpio --version
gpio convert --help
gpio describe --help GeoPandas — use indirectly
The original Python GeoParquet implementation. It works, but its default is still
GeoParquet 1.0 (no bbox column, no CRS metadata), and it trails the spec by a cycle or
two. Do not use gdf.to_parquet() directly in production.
Instead, export to GeoPackage from GeoPandas, then run gpio convert on the
GPKG. You keep the GeoPandas ergonomics for analysis and get a spec-compliant,
opinionated output file:
# Python: write GeoPackage from your GeoDataFrame
import geopandas as gpd
gdf.to_file('scratch.gpkg', layer='data', driver='GPKG') # Shell: convert to GeoParquet with sensible defaults
gpio convert scratch.gpkg out.parquet Best practices: making queries fast
Writing a valid GeoParquet file is easy. Writing one that a query engine can filter cheaply takes five pieces that work together:
- Spatial sort. Features ordered by a
space-filling curve (Hilbert) or at least by bbox, so row-group bboxes are tight and
prune well on
ST_Intersectsqueries. - Right-sized row groups. 50 k to 100 k rows is the sweet spot for spatial queries.
- Bbox column (GeoParquet 1.1) or Parquet-native geometry statistics (2.0), so engines can prune row groups by bounding box before parsing any geometries. (Within surviving row groups every feature is still parsed and tested — neither version carries a per-row spatial index yet.) See versions and encoding.
- ZSTD compression. Better ratio than Parquet's default SNAPPY, with comparable decode speed. Wins on any network-bound read (so: every cloud-native scenario). Rarely comes with the default settings.
- Attribute or grid partitioning. Hive-style directories
(
state=MA/, or a coarse H3/S2 cell likeh3_r3=83f5.../) so queries filtering on the partition key skip whole files before opening them.
The asymmetry between tools: gpio applies all four by default, which is why
it is the opinionated path. GDAL and DuckDB need explicit opt-ins for each. GeoPandas
leaves most of this to the caller.
| Best practice | gpio | GDAL (ogr2ogr) | DuckDB |
|---|---|---|---|
| Spatial sort | default (Hilbert) | -lco SORT_BY_BBOX=YES | manual ORDER BY ST_Hilbert(geom, bounds) |
| Row group size | data-driven | -lco ROW_GROUP_SIZE=100000 | ROW_GROUP_SIZE 100_000 |
| Bbox column | default | WRITE_COVERING_BBOX=AUTO (default in 3.9+) | manual columns, or GEOPARQUET_VERSION 'V2' |
| ZSTD compression | default | -lco COMPRESSION=ZSTD | COMPRESSION zstd |
| Attribute partitioning | --partition-by state | scripted loop (one call per value) | PARTITION_BY (state) |
Rule of thumb: unless you have a specific reason to customize or an existing pipeline, use
gpio. It bakes in the four best practices above and tracks the ecosystem
as it evolves. Reach for the GDAL or
DuckDB recipes when you need a flag gpio
does not expose, or when you are already deep in those toolchains.
Which version: 1.0, 1.1, or 2.0?
GeoParquet has three versions in the wild, and different tools write different defaults. Worth knowing which one you are actually producing, because clients downstream may or may not read them.
| Version | What it has | Status |
|---|---|---|
1.0 | Parquet with a WKB geometry column. No bbox column, no CRS metadata. | Minimum viable. Readable by everything, but spatial queries are full scans. |
1.1 | WKB geometry + bbox column + CRS metadata + geo metadata key. | Official OGC spec. The safe default for publishing today. |
2.0 | Native Parquet geometry type (no WKB conversion on read), CRS, bbox. | Delivers the format's full potential: Parquet-native coordinate statistics, zero-copy reads into DuckDB and Arrow clients, no separate bbox column needed. Still no spatial index inside the file. |
Tool support and what each writes by default:
| Tool | 1.0 | 1.1 | 2.0 |
|---|---|---|---|
| GDAL 3.8 | default | — | — |
| GDAL 3.9+ | WRITE_COVERING_BBOX=NO | default | USE_PARQUET_GEO_TYPES=YES (≥3.12) |
| DuckDB (Parquet writer) | default | — | GEOPARQUET_VERSION 'V2' |
gpio | — | default | --geoparquet-version 2.0 |
| GeoPandas | default | schema_version='1.1.0' | — |
Practical advice: write 2.0 whenever you control the
reader. It is the first version that delivers the format's full potential. Fall back to
1.1 for publication to a wide, mixed audience. 1.0 essentially
never, except for clients stuck on pre-2024 tooling.
Encoding: how geometries are stored
The encoding field in GeoParquet metadata can point to three very different
storage strategies. Worth understanding because it determines read speed,
interoperability, and whether you need a separate bbox column.
- WKB (GeoParquet 1.0 and 1.1 default). Each geometry is a binary
blob in a single
BYTE_ARRAYcolumn, identical to whatST_AsBinaryreturns in PostGIS. Universally portable, but opaque to Parquet: no row-group statistics on coordinates, which is why 1.1 adds a companion bbox column. - Native GeoArrow (GeoParquet 1.1, opt-in via
GEOMETRY_ENCODING=GEOARROWin GDAL 3.9+). Raw coordinates as columnar fields. A Point column becomesstruct<x, y>. A MultiPolygon becomeslist<list<list<struct<x, y>>>>(multipolygon → polygons → rings → points). Parquet computes min/max directly on x and y, so statistics work without a bbox column. Faster reads, but one geometry type per column — you cannot mix types. - Parquet Geometry / Geography logical types (GeoParquet 2.0,
libarrow 21+). A single geometry column, but Parquet itself knows it holds geometry
and computes coordinate statistics natively. Zero-copy into Arrow clients, with the
portability of a single WKB-like column. This is what
GEOPARQUET_VERSION 'V2'in DuckDB andUSE_PARQUET_GEO_TYPES=YESin GDAL 3.12+ produce.
Rule of thumb: reach for the Parquet Geometry logical types (2.0) when your toolchain supports libarrow 21+. Drop down to WKB (1.0 / 1.1) only for compatibility with older clients. Native GeoArrow is a middle ground: fast columnar reads, but locks each column to one geometry type.
Row group sizing
A Parquet file is a sequence of row groups, each carrying per-column
min/max statistics in its metadata. On a spatial predicate like
ST_Intersects, a query engine reads those statistics first and skips any
row group whose bbox does not overlap the query geometry. Only the remaining groups are
decompressed and scanned.
Because GeoParquet (any version) does not yet embed a per-row spatial index, the per-row-group bbox is the only cheap mechanism available during a spatial filter. It prunes whole row groups before any geometry is decoded. Within surviving row groups, the engine still parses every feature and runs the real predicate row-by-row — so row group size decides how fine-grained the cheap stage of pruning is.
- Too large (~1 M rows per group): the bbox for each group covers a wide area, so queries like "features near this point" end up decompressing most of the file anyway.
- Too small (under 10 k rows): metadata overhead dominates, compression ratio worsens, and the cost of reading the metadata itself starts to hurt.
- Sweet spot: 50 k to 100 k rows per group, combined with spatial sorting so each group actually contains spatially close features.
Writer defaults vary. Set the option explicitly if you care:
| Tool | Option | Default |
|---|---|---|
| GDAL | -lco ROW_GROUP_SIZE=100000 | 65 536 |
| DuckDB | ROW_GROUP_SIZE 100_000 in the COPY ... TO options list | ~122 880 |
| GeoPandas | row_group_size=100_000 | no default (follows PyArrow) |
gpio | --row-group-size 100000 | data-driven |
Spatial sorting
Right-sized row groups only help if features are laid out so that each group's bbox is tight. Without a spatial sort, features in insertion order leave each row group's bbox spanning the entire dataset, and the row-group pruning from the previous section collapses.
Two practical options:
- Hilbert curve sort. Space-filling curve that maps 2D coordinates to a 1D index preserving spatial locality. The best known default for general spatial workloads.
- Bbox min-corner sort. Simpler: sort features by
(xmin, ymin)of their bounding box. Not as locally-coherent as Hilbert but close enough in practice, and much cheaper to compute. This is what GDAL'sSORT_BY_BBOX=YESdoes.
Per-tool behavior:
gpio: Hilbert sort by default. Nothing to configure.- GDAL:
-lco SORT_BY_BBOX=YES(GDAL 3.9+). Bbox-based sort, not strictly Hilbert, but effective for row-group pruning. - DuckDB: no built-in option. Sort manually in the source query
with
ST_Hilbertfrom the spatial extension:
The full "perfect" DuckDB recipe hits every best practice at once: Hilbert sort on the geometry, covering bbox struct column, ZSTD, right-sized row groups. Source here is a PostGIS table, but any input table or query result works.
INSTALL spatial; LOAD spatial;
INSTALL postgres; LOAD postgres;
ATTACH 'postgresql://user@localhost/mydb' AS pg (TYPE postgres);
-- compute the overall extent once; ST_Hilbert needs it as the reference bounds
WITH bounds AS (
SELECT ST_Extent_Agg(geom) AS ext FROM pg.public.parcels
)
COPY (
SELECT
p.id, p.name, p.state_code,
ST_AsWKB(p.geom) AS geometry, -- GeoParquet WKB geometry column
{
'xmin': ST_XMin(p.geom),
'ymin': ST_YMin(p.geom),
'xmax': ST_XMax(p.geom),
'ymax': ST_YMax(p.geom)
} AS bbox -- GeoParquet 1.1 covering bbox column
FROM pg.public.parcels p, bounds b
ORDER BY ST_Hilbert(p.geom, b.ext) -- Hilbert spatial sort
) TO 'parcels.parquet' (
FORMAT parquet,
COMPRESSION zstd,
ROW_GROUP_SIZE 100_000
); What each piece does:
WITH bounds AS (... ST_Extent_Agg(geom) ...)computes the overall dataset extent once.ST_Hilbert(geom, extent)uses that reference box to produce a Hilbert curve index scalar per row.ORDER BY ST_Hilbert(...)sorts features by spatial proximity, so consecutive rows in the output are spatially close. The Parquet writer then places them in the same row groups, which is exactly what lets row-group bbox pruning work.- The
bboxstruct column (xmin,ymin,xmax,ymax) matches the GeoParquet 1.1 covering bbox convention, so query engines that support it can push predicates without parsing WKB. ROW_GROUP_SIZE 100_000+COMPRESSION zstdare the read-performance defaults from the best practices.
Version caveat: DuckDB writes this as GeoParquet 1.0
metadata (no geo key marking it as 1.1), even though the bbox column is
laid out the way 1.1 clients expect. If strict 1.1 compliance matters, pipe through
gpio convert to rewrite the metadata.
Prefer the newer 2.0 logical types? Drop the bbox struct column (Parquet
computes coordinate statistics natively for the geometry logical type), add
GEOPARQUET_VERSION 'V2', and combine with PARTITION_BY (state_code)
to Hilbert-sort within each partition in one go:
INSTALL spatial; LOAD spatial;
INSTALL postgres; LOAD postgres;
ATTACH 'postgresql://user@localhost/mydb' AS pg (TYPE postgres);
-- compute the overall extent once, same as before
WITH bounds AS (
SELECT ST_Extent_Agg(geom) AS ext FROM pg.public.parcels
)
COPY (
SELECT
p.id, p.name, p.state_code,
ST_AsWKB(p.geom) AS geometry
FROM pg.public.parcels p, bounds b
ORDER BY ST_Hilbert(p.geom, b.ext)
) TO 'out' (
FORMAT parquet,
COMPRESSION zstd,
ROW_GROUP_SIZE 100_000,
PARTITION_BY (state_code),
GEOPARQUET_VERSION 'V2',
OVERWRITE_OR_IGNORE
); The output is a Hive-partitioned directory:
out/
state_code=CA/data_0.parquet
state_code=TX/data_0.parquet
state_code=MA/data_0.parquet
...
Each file is a GeoParquet 2.0 file with Parquet's native geometry logical type,
ZSTD-compressed, 100 k-row row groups, Hilbert-sorted within the partition. Readers
filtering by state_code skip whole files at list time; readers filtering
by bbox use the per-row-group coordinate statistics to prune within the partition.
Two layers of pruning, no bbox column to maintain.
Compression
Parquet supports a handful of codecs. The default is SNAPPY, which was
chosen for Hadoop-era workloads where CPU was the bottleneck. For modern geospatial
data on S3 where network is the bottleneck, ZSTD is the right
default: better ratio, comparable decompression speed, and every modern
Parquet client reads it.
| Codec | When to use |
|---|---|
ZSTD | Default. Best ratio-to-speed trade-off. |
SNAPPY | Parquet default; legacy Hadoop/Spark ecosystems where it is universally supported. |
GZIP | When you need older clients that only speak GZIP. |
LZ4_RAW | Decode-speed-critical workloads; lower ratio than ZSTD. |
BROTLI | Archival; best ratio but slow to write. |
NONE | Already-compressed sources (pre-JPEG imagery columns, encrypted bytes). |
For a catalog that gets scanned repeatedly from the edge, ZSTD level 3 (the default) already gives you ~2-3× smaller files than SNAPPY with negligible decode penalty. gpio applies ZSTD by default.
Recipes
Concrete commands for each path. All assume recent GDAL and DuckDB — see setup on the cookbook index.
The opinionated path: gpio
gpio
(geoparquet-io) is a Python CLI and library, built on DuckDB, GDAL, PyArrow,
and obstore. It applies the four best practices above automatically: Hilbert sort, bbox
column, ZSTD, sensible row groups. Fastest way to a spec-compliant output.
# convert anything OGR reads to GeoParquet 1.1 (default)
gpio convert in.shp out.parquet
# write GeoParquet 2.0 with native geometry type
gpio convert in.shp out.parquet --geoparquet-version 2.0
# with attribute partitioning (Hive-style directory)
gpio convert in.shp out/ --partition-by state
# convert and validate in one go
gpio convert in.gpkg out.parquet
gpio describe out.parquet gpio describe prints the version, CRS, row groups, and whether a bbox
column is present. Use it to sanity-check files produced by other tools as well.
From Shapefile or FileGeodatabase (with GDAL)
GDAL 3.9+ writes GeoParquet 1.1 (WKB + bbox + CRS) by default. GDAL 3.8 writes 1.0; anything older does not support the spec. The driver takes Shapefile, FGDB, GeoPackage, or anything else OGR can read.
ogr2ogr -f Parquet out.parquet in.shp \
-lco COMPRESSION=ZSTD \
-lco ROW_GROUP_SIZE=100000 \
-lco GEOMETRY_ENCODING=WKB \
-lco SORT_BY_BBOX=YES \
-lco WRITE_COVERING_BBOX=AUTO This turns on all five best practices the driver supports (ZSTD, 100k row groups, WKB encoding, bbox sort, bbox column). For attribute partitioning, you need to run one ogr2ogr call per partition value in a shell loop — GDAL has no single-command equivalent.
Version control: WRITE_COVERING_BBOX=AUTO (default) gives you 1.1; set
NO to produce 1.0. USE_PARQUET_GEO_TYPES=YES (GDAL 3.12+,
libarrow 21+) writes the new Parquet Geometry / Geography logical types, which is the
GDAL-side path to GeoParquet 2.0.
From PostGIS (via DuckDB)
DuckDB's postgres extension lets you COPY a PostGIS query
straight to GeoParquet, no intermediate dump on disk.
INSTALL postgres; LOAD postgres;
INSTALL spatial; LOAD spatial;
ATTACH 'postgresql://user@localhost/mydb' AS pg (TYPE postgres);
COPY (
SELECT id, name, ST_AsWKB(geom) AS geometry
FROM pg.public.parcels
WHERE state_code = 'MA'
) TO 'parcels.parquet' (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000); ST_AsWKB converts the PostGIS geometry to the binary encoding GeoParquet
expects. The file will round-trip cleanly through DuckDB, GeoPandas, and GDAL.
Version caveat: by default this produces GeoParquet 1.0
(WKB column, no bbox, no CRS metadata). To write 2.0 directly, add the
GEOPARQUET_VERSION option:
COPY (
SELECT id, name, ST_AsWKB(geom) AS geometry
FROM pg.public.parcels
WHERE state_code = 'MA'
) TO 'parcels.parquet' (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000, GEOPARQUET_VERSION 'V2');
For 1.1 output (bbox + CRS, still WKB), route the 1.0 file through gpio —
1.1 is its default, so no version flag is needed:
gpio convert parcels.parquet parcels-1.1.parquet For a properly Hilbert-sorted DuckDB output, see the spatial sorting recipe above.
Partitioning by attribute
For large multi-region datasets, partition the output into a Hive-style directory. Queries filtering on the partition key only read the relevant files.
INSTALL spatial; LOAD spatial;
COPY (
SELECT id, state, name, ST_AsWKB(geom) AS geometry
FROM read_parquet('in.parquet')
) TO 'out' (
FORMAT parquet,
PARTITION_BY (state),
COMPRESSION zstd,
ROW_GROUP_SIZE 100_000,
OVERWRITE_OR_IGNORE
); The result is a directory laid out like:
out/
state=CA/data_0.parquet
state=TX/data_0.parquet
state=MA/data_0.parquet
...
DuckDB, GeoPandas, and GDAL can all query out/ as a single virtual dataset
and will only read the partitions that match a WHERE state = 'MA' filter.
This is exactly how Overture Maps and the
OSM GeoParquet site on this domain
lay out their data.
Grid-based partitioning. When there is no natural attribute to split on (global datasets, imagery-derived features, or any case where the obvious key is too skewed), partition by a coarse discrete global grid cell instead. H3 (Uber) and S2 (Google) are the two common choices. Compute a low-resolution cell index per feature (H3 resolution 2 or 3 gives you sub-continent cells) and use it as the partition key:
INSTALL h3 FROM community; LOAD h3; -- DuckDB H3 community extension
INSTALL spatial; LOAD spatial;
COPY (
SELECT
*,
h3_latlng_to_cell_string(
ST_Y(ST_Centroid(geom)),
ST_X(ST_Centroid(geom)),
3 -- resolution: coarser = fewer partitions
) AS h3_r3,
ST_AsWKB(geom) AS geometry
FROM read_parquet('in.parquet')
) TO 'out' (
FORMAT parquet,
PARTITION_BY (h3_r3),
COMPRESSION zstd,
ROW_GROUP_SIZE 100_000,
OVERWRITE_OR_IGNORE
); The output directory is structured like out/h3_r3=83f5.../data_0.parquet. Clients that know the H3 index can filter to a coarse cell before reading any data files. The same pattern works with S2 (s2_from_latlng) or a plain geohash.
The choice between attribute and grid partitioning usually comes down to the query
pattern: if users always filter by an administrative key, partition by that key. If
they filter by arbitrary bounding boxes or proximity, a coarse grid is the right
choice. Sometimes both at once (nested: state=CA/h3_r5=.../).
Reading a GeoParquet back
Verification is the other half of the job. If you cannot read the file over HTTP from
at least DuckDB, it is not published yet. The examples below run against the
live OpenStreetMap GeoParquet catalog
this site publishes — daily snapshots, 98 regions × 16 themes, free and keyless. Copy
and paste directly. For the production-grade companion (session init file, predicate
pushdown deep-dive, EXPLAIN ANALYZE with real numbers, DuckDB-WASM in
the browser), see the GeoParquet Reading Cookbook.
DuckDB against a single GeoParquet URL
Count every OSM building in New York State, straight from the URL, no download:
INSTALL httpfs; LOAD httpfs;
INSTALL spatial; LOAD spatial;
SELECT COUNT(*) AS buildings
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet');
With predicate pushdown and Parquet statistics, DuckDB fetches only the row groups it
needs. The buildings theme carries columns like tags, levels,
addr_street, addr_postcode, and state_iso; pick
whatever the pipeline promoted for your filter. Confirm how many bytes actually moved:
EXPLAIN ANALYZE
SELECT COUNT(*)
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE addr_postcode = '10001'; -- Chelsea, Manhattan Look at the bytes_read line. A well-laid-out file returns a tiny fraction of total file size.
Catalog-wide queries via the S3 endpoint
DuckDB's httpfs cannot expand glob patterns (*) over plain
HTTPS because generic HTTP has no directory-listing primitive. The geoparquet catalog
exposes an anonymous S3-compatible endpoint at s3.geomermaids.com that
speaks just enough of the S3 API (ListObjectsV2 + ranged
GetObject) for glob expansion. Set it up once per session:
INSTALL httpfs; LOAD httpfs;
INSTALL spatial; LOAD spatial;
SET s3_endpoint = 's3.geomermaids.com';
SET s3_url_style = 'path';
SET s3_use_ssl = true;
SET s3_access_key_id = ''; SET s3_secret_access_key = ''; Now wildcards work. Count buildings per US state in one query:
SELECT state_iso, COUNT(*) AS buildings
FROM read_parquet('s3://parquetry/latest/country=US/state=*/buildings.parquet')
GROUP BY state_iso
ORDER BY buildings DESC
LIMIT 10; state_iso is a literal column inside every file, so you group by it
directly without needing Hive-style partitioning flags. See the
catalog-wide queries
section on geoparquet.geomermaids.com for more examples (continent-wide
airports, wind turbines, etc.).
GDAL command line
ogrinfo /vsicurl/https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet -so -al
The /vsicurl/ prefix lets GDAL CLI tools read the remote file with range
requests, same mechanism as with COG. Unlike in QGIS, this path is streaming all the
way down. See the next section for why QGIS itself does not share that property.
QGIS specifics
QGIS is the most common desktop client for cloud-native data, and it handles the two main formats very unevenly. The raster side (COG) streams over HTTP with range requests and is first-class. The vector side (GeoParquet) is not. Knowing the asymmetry up front saves a lot of confused users.
GeoParquet: downloads the whole file
QGIS 3.36+ reads GeoParquet from URLs, but not via range requests. It downloads the entire file to a local cache before rendering anything. For a 100 MB file that is fine. For a multi-GB world-scale dataset it is often unworkable: the whole payload crosses the wire before a single feature draws, and the user experience on a corporate VPN or a flaky connection is miserable.
The reason is that QGIS's vector rendering pipeline expects to load features into
memory and build its own spatial index, rather than stream partial reads from object
storage. The /vsicurl/ virtual file system that works so well for COG
(and via ogrinfo / ogr2ogr on GeoParquet too) does not get
the same plumbing inside QGIS for vector sources. GeoParquet 2.0's Parquet-native
geometry statistics would make true streaming feasible, but the client work has not
caught up yet.
Workarounds until QGIS catches up:
- Filter with DuckDB first, then load the slimmer output. One SQL
query against the remote URL, write a local
.parquetwith just the features and columns you need, open that in QGIS. Two steps but keeps the interactive part snappy. - Vector tile server in front of your GeoParquet. Martin or
pg_tileservif the source can live in PostGIS; a small Cloudflare Worker in front of static Parquet archives for cases where a live database is overkill. - DuckDB-WASM in the browser. For truly interactive exploration across multi-GB datasets without a server, paired with a STAC catalog and Deck.gl or MapLibre for the map.
- Convert to GeoPackage or FlatGeobuf for the analyst workflow. GeoPackage for local editing, FlatGeobuf if you want a single file with a built-in spatial index that QGIS can read lazily over HTTP.
None are as seamless as the COG story. It is the single biggest rough edge of the cloud-native vector stack for desktop users today.
GeoParquet-specific pitfalls
- Tiny Parquet row groups. Under 10k rows per group, metadata
overhead eats the benefit of predicate pushdown. If your writer defaults to
1024 rows per group (old PyArrow does), set
row_group_sizeexplicitly. - Plain pandas on a GeoDataFrame.
pandas.to_parquet()drops the geometry metadata. Usegdf.to_parquet()from GeoPandas withschema_version='1.1.0'. - Unsorted features with large row groups. Bbox-based pruning
collapses if features are in insertion order rather than spatial order. Either
SORT_BY_BBOX=YESin GDAL, manualST_Hilbertin DuckDB, or letgpiodo it for you. - DuckDB default version confusion. A GeoParquet file produced
by a plain
COPY ... TO ... (FORMAT parquet)is 1.0 (no bbox column, no CRS). Do not rely on downstream tools to "figure it out".
Publishing pitfalls (Content-Type, CORS, mutable filenames) are on the cookbook index, since they apply to GeoParquet and COG alike.
Next
Working with rasters? Head to the COG cookbook. The index has shared setup, publishing, and a cross-format common-pitfalls list. The conceptual background to all of this is on the Cloud-Native Geospatial page.