geomermaids · GeoParquet Writing cookbook


GeoParquet is a great format, but getting it right can be tricky. The ecosystem is still young, and navigating versions, tools, and encoding options can feel like a maze. This page covers best practices, explains the different versions and encoding options, and provides concrete recipes for GDAL, DuckDB, and gpio. Please report any issues or missing pieces through the contact form so we can keep this page accurate and up-to-date, and help grow the GeoParquet community.

Tools and setup

Four tools cover most GeoParquet workflows. Choose the one you feel the most comfortable with or the one already there in your pipeline. Of course you can use Apache Arrow Python bindings for low-level parquet operations but it's not the purpose of this cookbook so we'll forget about it for now on.

GDAL

The default choice for existing pipelines built on GDAL / OGR. The Parquet driver writes GeoParquet 1.1 by default from 3.9 onwards, reads any version via /vsicurl/. Version cutoffs: 3.8 added 1.0 support, 3.9 added 1.1 (bbox + CRS), 3.12 adds 2.0 via USE_PARQUET_GEO_TYPES=YES.

brew install gdal              # macOS
sudo apt install gdal-bin       # Ubuntu / Debian
gdalinfo --version             # expect 3.9 or newer

DuckDB

The modern query engine: SQL directly against remote GeoParquet over HTTP, joins across URLs, predicate pushdown, and a growing spatial function library. Also the most ergonomic way to read PostGIS and write GeoParquet in one step.

brew install duckdb                       # macOS
curl https://install.duckdb.org | sh       # Linux
duckdb --version                           # expect 1.5.2 or newer

# inside DuckDB, once per session:
INSTALL spatial; LOAD spatial;
INSTALL httpfs; LOAD httpfs;

gpio — Swiss-army knife

geoparquet-io is the Swiss-army knife for GeoParquet: convert, validate, inspect, partition, upgrade. Opinionated defaults (Hilbert sort, ZSTD, bbox column, sensible row groups). If you do not care about individual flags, this is the fastest path to a spec-compliant file. Recommended as the default unless you need a specific GDAL or DuckDB feature it does not expose.

uv tool install geoparquet-io   # recommended
# or
pip install geoparquet-io

gpio --version
gpio convert --help
gpio describe --help

GeoPandas — use indirectly

The original Python GeoParquet implementation. It works, but its default is still GeoParquet 1.0 (no bbox column, no CRS metadata), and it trails the spec by a cycle or two. Do not use gdf.to_parquet() directly in production. Instead, export to GeoPackage from GeoPandas, then run gpio convert on the GPKG. You keep the GeoPandas ergonomics for analysis and get a spec-compliant, opinionated output file:

# Python: write GeoPackage from your GeoDataFrame
import geopandas as gpd
gdf.to_file('scratch.gpkg', layer='data', driver='GPKG')
# Shell: convert to GeoParquet with sensible defaults
gpio convert scratch.gpkg out.parquet

Best practices: making queries fast

Writing a valid GeoParquet file is easy. Writing one that a query engine can filter cheaply takes five pieces that work together:

The asymmetry between tools: gpio applies all four by default, which is why it is the opinionated path. GDAL and DuckDB need explicit opt-ins for each. GeoPandas leaves most of this to the caller.

Best practicegpioGDAL (ogr2ogr)DuckDB
Spatial sort default (Hilbert) -lco SORT_BY_BBOX=YES manual ORDER BY ST_Hilbert(geom, bounds)
Row group size data-driven -lco ROW_GROUP_SIZE=100000 ROW_GROUP_SIZE 100_000
Bbox column default WRITE_COVERING_BBOX=AUTO (default in 3.9+) manual columns, or GEOPARQUET_VERSION 'V2'
ZSTD compression default -lco COMPRESSION=ZSTD COMPRESSION zstd
Attribute partitioning --partition-by state scripted loop (one call per value) PARTITION_BY (state)

Rule of thumb: unless you have a specific reason to customize or an existing pipeline, use gpio. It bakes in the four best practices above and tracks the ecosystem as it evolves. Reach for the GDAL or DuckDB recipes when you need a flag gpio does not expose, or when you are already deep in those toolchains.

Which version: 1.0, 1.1, or 2.0?

GeoParquet has three versions in the wild, and different tools write different defaults. Worth knowing which one you are actually producing, because clients downstream may or may not read them.

VersionWhat it hasStatus
1.0 Parquet with a WKB geometry column. No bbox column, no CRS metadata. Minimum viable. Readable by everything, but spatial queries are full scans.
1.1 WKB geometry + bbox column + CRS metadata + geo metadata key. Official OGC spec. The safe default for publishing today.
2.0 Native Parquet geometry type (no WKB conversion on read), CRS, bbox. Delivers the format's full potential: Parquet-native coordinate statistics, zero-copy reads into DuckDB and Arrow clients, no separate bbox column needed. Still no spatial index inside the file.

Tool support and what each writes by default:

Tool1.01.12.0
GDAL 3.8 default
GDAL 3.9+ WRITE_COVERING_BBOX=NO default USE_PARQUET_GEO_TYPES=YES (≥3.12)
DuckDB (Parquet writer) default GEOPARQUET_VERSION 'V2'
gpio default --geoparquet-version 2.0
GeoPandas default schema_version='1.1.0'

Practical advice: write 2.0 whenever you control the reader. It is the first version that delivers the format's full potential. Fall back to 1.1 for publication to a wide, mixed audience. 1.0 essentially never, except for clients stuck on pre-2024 tooling.

Encoding: how geometries are stored

The encoding field in GeoParquet metadata can point to three very different storage strategies. Worth understanding because it determines read speed, interoperability, and whether you need a separate bbox column.

Rule of thumb: reach for the Parquet Geometry logical types (2.0) when your toolchain supports libarrow 21+. Drop down to WKB (1.0 / 1.1) only for compatibility with older clients. Native GeoArrow is a middle ground: fast columnar reads, but locks each column to one geometry type.

Row group sizing

A Parquet file is a sequence of row groups, each carrying per-column min/max statistics in its metadata. On a spatial predicate like ST_Intersects, a query engine reads those statistics first and skips any row group whose bbox does not overlap the query geometry. Only the remaining groups are decompressed and scanned.

Because GeoParquet (any version) does not yet embed a per-row spatial index, the per-row-group bbox is the only cheap mechanism available during a spatial filter. It prunes whole row groups before any geometry is decoded. Within surviving row groups, the engine still parses every feature and runs the real predicate row-by-row — so row group size decides how fine-grained the cheap stage of pruning is.

Writer defaults vary. Set the option explicitly if you care:

ToolOptionDefault
GDAL-lco ROW_GROUP_SIZE=10000065 536
DuckDBROW_GROUP_SIZE 100_000 in the COPY ... TO options list~122 880
GeoPandasrow_group_size=100_000no default (follows PyArrow)
gpio--row-group-size 100000data-driven

Spatial sorting

Right-sized row groups only help if features are laid out so that each group's bbox is tight. Without a spatial sort, features in insertion order leave each row group's bbox spanning the entire dataset, and the row-group pruning from the previous section collapses.

Two practical options:

Per-tool behavior:

The full "perfect" DuckDB recipe hits every best practice at once: Hilbert sort on the geometry, covering bbox struct column, ZSTD, right-sized row groups. Source here is a PostGIS table, but any input table or query result works.

INSTALL spatial; LOAD spatial;
INSTALL postgres; LOAD postgres;

ATTACH 'postgresql://user@localhost/mydb' AS pg (TYPE postgres);

-- compute the overall extent once; ST_Hilbert needs it as the reference bounds
WITH bounds AS (
  SELECT ST_Extent_Agg(geom) AS ext FROM pg.public.parcels
)
COPY (
  SELECT
    p.id, p.name, p.state_code,
    ST_AsWKB(p.geom) AS geometry,          -- GeoParquet WKB geometry column
    {                                 
      'xmin': ST_XMin(p.geom),
      'ymin': ST_YMin(p.geom),
      'xmax': ST_XMax(p.geom),
      'ymax': ST_YMax(p.geom)
    } AS bbox                              -- GeoParquet 1.1 covering bbox column
  FROM pg.public.parcels p, bounds b
  ORDER BY ST_Hilbert(p.geom, b.ext)       -- Hilbert spatial sort
) TO 'parcels.parquet' (
  FORMAT parquet,
  COMPRESSION zstd,
  ROW_GROUP_SIZE 100_000
);

What each piece does:

Version caveat: DuckDB writes this as GeoParquet 1.0 metadata (no geo key marking it as 1.1), even though the bbox column is laid out the way 1.1 clients expect. If strict 1.1 compliance matters, pipe through gpio convert to rewrite the metadata.

Prefer the newer 2.0 logical types? Drop the bbox struct column (Parquet computes coordinate statistics natively for the geometry logical type), add GEOPARQUET_VERSION 'V2', and combine with PARTITION_BY (state_code) to Hilbert-sort within each partition in one go:

INSTALL spatial; LOAD spatial;
INSTALL postgres; LOAD postgres;

ATTACH 'postgresql://user@localhost/mydb' AS pg (TYPE postgres);

-- compute the overall extent once, same as before
WITH bounds AS (
  SELECT ST_Extent_Agg(geom) AS ext FROM pg.public.parcels
)
COPY (
  SELECT
    p.id, p.name, p.state_code,
    ST_AsWKB(p.geom) AS geometry
  FROM pg.public.parcels p, bounds b
  ORDER BY ST_Hilbert(p.geom, b.ext)
) TO 'out' (
  FORMAT parquet,
  COMPRESSION zstd,
  ROW_GROUP_SIZE 100_000,
  PARTITION_BY (state_code),
  GEOPARQUET_VERSION 'V2',
  OVERWRITE_OR_IGNORE
);

The output is a Hive-partitioned directory:

out/
  state_code=CA/data_0.parquet
  state_code=TX/data_0.parquet
  state_code=MA/data_0.parquet
  ...

Each file is a GeoParquet 2.0 file with Parquet's native geometry logical type, ZSTD-compressed, 100 k-row row groups, Hilbert-sorted within the partition. Readers filtering by state_code skip whole files at list time; readers filtering by bbox use the per-row-group coordinate statistics to prune within the partition. Two layers of pruning, no bbox column to maintain.

Compression

Parquet supports a handful of codecs. The default is SNAPPY, which was chosen for Hadoop-era workloads where CPU was the bottleneck. For modern geospatial data on S3 where network is the bottleneck, ZSTD is the right default: better ratio, comparable decompression speed, and every modern Parquet client reads it.

CodecWhen to use
ZSTDDefault. Best ratio-to-speed trade-off.
SNAPPYParquet default; legacy Hadoop/Spark ecosystems where it is universally supported.
GZIPWhen you need older clients that only speak GZIP.
LZ4_RAWDecode-speed-critical workloads; lower ratio than ZSTD.
BROTLIArchival; best ratio but slow to write.
NONEAlready-compressed sources (pre-JPEG imagery columns, encrypted bytes).

For a catalog that gets scanned repeatedly from the edge, ZSTD level 3 (the default) already gives you ~2-3× smaller files than SNAPPY with negligible decode penalty. gpio applies ZSTD by default.


Recipes

Concrete commands for each path. All assume recent GDAL and DuckDB — see setup on the cookbook index.

The opinionated path: gpio

gpio (geoparquet-io) is a Python CLI and library, built on DuckDB, GDAL, PyArrow, and obstore. It applies the four best practices above automatically: Hilbert sort, bbox column, ZSTD, sensible row groups. Fastest way to a spec-compliant output.


# convert anything OGR reads to GeoParquet 1.1 (default)
gpio convert in.shp out.parquet

# write GeoParquet 2.0 with native geometry type
gpio convert in.shp out.parquet --geoparquet-version 2.0

# with attribute partitioning (Hive-style directory)
gpio convert in.shp out/ --partition-by state

# convert and validate in one go
gpio convert in.gpkg out.parquet
gpio describe out.parquet

gpio describe prints the version, CRS, row groups, and whether a bbox column is present. Use it to sanity-check files produced by other tools as well.

From Shapefile or FileGeodatabase (with GDAL)

GDAL 3.9+ writes GeoParquet 1.1 (WKB + bbox + CRS) by default. GDAL 3.8 writes 1.0; anything older does not support the spec. The driver takes Shapefile, FGDB, GeoPackage, or anything else OGR can read.

ogr2ogr -f Parquet out.parquet in.shp \
  -lco COMPRESSION=ZSTD \
  -lco ROW_GROUP_SIZE=100000 \
  -lco GEOMETRY_ENCODING=WKB \
  -lco SORT_BY_BBOX=YES \
  -lco WRITE_COVERING_BBOX=AUTO

This turns on all five best practices the driver supports (ZSTD, 100k row groups, WKB encoding, bbox sort, bbox column). For attribute partitioning, you need to run one ogr2ogr call per partition value in a shell loop — GDAL has no single-command equivalent.

Version control: WRITE_COVERING_BBOX=AUTO (default) gives you 1.1; set NO to produce 1.0. USE_PARQUET_GEO_TYPES=YES (GDAL 3.12+, libarrow 21+) writes the new Parquet Geometry / Geography logical types, which is the GDAL-side path to GeoParquet 2.0.

From PostGIS (via DuckDB)

DuckDB's postgres extension lets you COPY a PostGIS query straight to GeoParquet, no intermediate dump on disk.

INSTALL postgres; LOAD postgres;
INSTALL spatial; LOAD spatial;

ATTACH 'postgresql://user@localhost/mydb' AS pg (TYPE postgres);

COPY (
  SELECT id, name, ST_AsWKB(geom) AS geometry
  FROM pg.public.parcels
  WHERE state_code = 'MA'
) TO 'parcels.parquet' (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000);

ST_AsWKB converts the PostGIS geometry to the binary encoding GeoParquet expects. The file will round-trip cleanly through DuckDB, GeoPandas, and GDAL.

Version caveat: by default this produces GeoParquet 1.0 (WKB column, no bbox, no CRS metadata). To write 2.0 directly, add the GEOPARQUET_VERSION option:

COPY (
  SELECT id, name, ST_AsWKB(geom) AS geometry
  FROM pg.public.parcels
  WHERE state_code = 'MA'
) TO 'parcels.parquet' (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000, GEOPARQUET_VERSION 'V2');

For 1.1 output (bbox + CRS, still WKB), route the 1.0 file through gpio — 1.1 is its default, so no version flag is needed:

gpio convert parcels.parquet parcels-1.1.parquet

For a properly Hilbert-sorted DuckDB output, see the spatial sorting recipe above.

Partitioning by attribute

For large multi-region datasets, partition the output into a Hive-style directory. Queries filtering on the partition key only read the relevant files.

INSTALL spatial; LOAD spatial;

COPY (
  SELECT id, state, name, ST_AsWKB(geom) AS geometry
  FROM read_parquet('in.parquet')
) TO 'out' (
  FORMAT parquet,
  PARTITION_BY (state),
  COMPRESSION zstd,
  ROW_GROUP_SIZE 100_000,
  OVERWRITE_OR_IGNORE
);

The result is a directory laid out like:

out/
  state=CA/data_0.parquet
  state=TX/data_0.parquet
  state=MA/data_0.parquet
  ...

DuckDB, GeoPandas, and GDAL can all query out/ as a single virtual dataset and will only read the partitions that match a WHERE state = 'MA' filter. This is exactly how Overture Maps and the OSM GeoParquet site on this domain lay out their data.

Grid-based partitioning. When there is no natural attribute to split on (global datasets, imagery-derived features, or any case where the obvious key is too skewed), partition by a coarse discrete global grid cell instead. H3 (Uber) and S2 (Google) are the two common choices. Compute a low-resolution cell index per feature (H3 resolution 2 or 3 gives you sub-continent cells) and use it as the partition key:

INSTALL h3 FROM community; LOAD h3;  -- DuckDB H3 community extension
INSTALL spatial; LOAD spatial;

COPY (
  SELECT
    *,
    h3_latlng_to_cell_string(
      ST_Y(ST_Centroid(geom)),
      ST_X(ST_Centroid(geom)),
      3                                -- resolution: coarser = fewer partitions
    ) AS h3_r3,
    ST_AsWKB(geom) AS geometry
  FROM read_parquet('in.parquet')
) TO 'out' (
  FORMAT parquet,
  PARTITION_BY (h3_r3),
  COMPRESSION zstd,
  ROW_GROUP_SIZE 100_000,
  OVERWRITE_OR_IGNORE
);

The output directory is structured like out/h3_r3=83f5.../data_0.parquet. Clients that know the H3 index can filter to a coarse cell before reading any data files. The same pattern works with S2 (s2_from_latlng) or a plain geohash.

The choice between attribute and grid partitioning usually comes down to the query pattern: if users always filter by an administrative key, partition by that key. If they filter by arbitrary bounding boxes or proximity, a coarse grid is the right choice. Sometimes both at once (nested: state=CA/h3_r5=.../).


Reading a GeoParquet back

Verification is the other half of the job. If you cannot read the file over HTTP from at least DuckDB, it is not published yet. The examples below run against the live OpenStreetMap GeoParquet catalog this site publishes — daily snapshots, 98 regions × 16 themes, free and keyless. Copy and paste directly. For the production-grade companion (session init file, predicate pushdown deep-dive, EXPLAIN ANALYZE with real numbers, DuckDB-WASM in the browser), see the GeoParquet Reading Cookbook.

DuckDB against a single GeoParquet URL

Count every OSM building in New York State, straight from the URL, no download:

INSTALL httpfs; LOAD httpfs;
INSTALL spatial; LOAD spatial;

SELECT COUNT(*) AS buildings
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet');

With predicate pushdown and Parquet statistics, DuckDB fetches only the row groups it needs. The buildings theme carries columns like tags, levels, addr_street, addr_postcode, and state_iso; pick whatever the pipeline promoted for your filter. Confirm how many bytes actually moved:

EXPLAIN ANALYZE
SELECT COUNT(*)
FROM read_parquet('https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet')
WHERE addr_postcode = '10001';   -- Chelsea, Manhattan

Look at the bytes_read line. A well-laid-out file returns a tiny fraction of total file size.

Catalog-wide queries via the S3 endpoint

DuckDB's httpfs cannot expand glob patterns (*) over plain HTTPS because generic HTTP has no directory-listing primitive. The geoparquet catalog exposes an anonymous S3-compatible endpoint at s3.geomermaids.com that speaks just enough of the S3 API (ListObjectsV2 + ranged GetObject) for glob expansion. Set it up once per session:

INSTALL httpfs; LOAD httpfs;
INSTALL spatial; LOAD spatial;

SET s3_endpoint = 's3.geomermaids.com';
SET s3_url_style = 'path';
SET s3_use_ssl = true;
SET s3_access_key_id = ''; SET s3_secret_access_key = '';

Now wildcards work. Count buildings per US state in one query:

SELECT state_iso, COUNT(*) AS buildings
FROM read_parquet('s3://parquetry/latest/country=US/state=*/buildings.parquet')
GROUP BY state_iso
ORDER BY buildings DESC
LIMIT 10;

state_iso is a literal column inside every file, so you group by it directly without needing Hive-style partitioning flags. See the catalog-wide queries section on geoparquet.geomermaids.com for more examples (continent-wide airports, wind turbines, etc.).

GDAL command line

ogrinfo /vsicurl/https://parquetry.geomermaids.com/latest/country=US/state=US-NY/buildings.parquet -so -al

The /vsicurl/ prefix lets GDAL CLI tools read the remote file with range requests, same mechanism as with COG. Unlike in QGIS, this path is streaming all the way down. See the next section for why QGIS itself does not share that property.

QGIS specifics

QGIS is the most common desktop client for cloud-native data, and it handles the two main formats very unevenly. The raster side (COG) streams over HTTP with range requests and is first-class. The vector side (GeoParquet) is not. Knowing the asymmetry up front saves a lot of confused users.

GeoParquet: downloads the whole file

QGIS 3.36+ reads GeoParquet from URLs, but not via range requests. It downloads the entire file to a local cache before rendering anything. For a 100 MB file that is fine. For a multi-GB world-scale dataset it is often unworkable: the whole payload crosses the wire before a single feature draws, and the user experience on a corporate VPN or a flaky connection is miserable.

The reason is that QGIS's vector rendering pipeline expects to load features into memory and build its own spatial index, rather than stream partial reads from object storage. The /vsicurl/ virtual file system that works so well for COG (and via ogrinfo / ogr2ogr on GeoParquet too) does not get the same plumbing inside QGIS for vector sources. GeoParquet 2.0's Parquet-native geometry statistics would make true streaming feasible, but the client work has not caught up yet.

Workarounds until QGIS catches up:

None are as seamless as the COG story. It is the single biggest rough edge of the cloud-native vector stack for desktop users today.

GeoParquet-specific pitfalls

Publishing pitfalls (Content-Type, CORS, mutable filenames) are on the cookbook index, since they apply to GeoParquet and COG alike.

Next

Working with rasters? Head to the COG cookbook. The index has shared setup, publishing, and a cross-format common-pitfalls list. The conceptual background to all of this is on the Cloud-Native Geospatial page.