
How can DuckDB leverage GDAL/OGR in everyday use ?
Guillaume Sueur
The new duck on the block
I think you may all know about both DuckDB and GDAL, so I won’t provide an expansive introduction here. Yesterday the DuckDB / PostGIS interaction was brilliantly showcased by Matt Forrest during his online event you can discover here.
What I appreciated the most was the chosen angle not to make DuckDB and PostGIS competitors but different tools which can take advantage of each other. And it’s quite the same angle I would like to consider today to delve into how DuckDB can be helpful in running GDAL/OGR tasks.
GDAL is a fantastic toolkit / library but is quite difficult to comprehend for the newbies and can even be tricky for well-seasoned silverbacks. The name itself is hard to pronounce, can come with an doppelgänger named OGR which applies to vector datasets only but is not something independent from GDAL (which is thus dedicated to raster datasets…) and so you end up using GDAL, OGR, GDAL/OGR expressions which makes things sound harder than they are. And then you have the options… i.e build options, configuration options, layer creation options, open options… which definitely bring the whole thing to a different level, closer to black magic than a simple, straightforward input to have the right thing done. While the consummate GIS specialist will take full advantage of these numerous options, the average users may feel overwhelmed by having to delve into the comprehensive and sometimes cryptic documentation to activate such power at their fingertips.
Meanwhile, DuckDB provides an easy-to-install, battery included executable file which comes with a spatial extension relying heavily on GDAL/OGR for managing dataset formats and GEOS for some other internals. Saying so we can easily figure out how DuckDB can leverage the usual ogr2ogr dance. Instead of using GDAL/OGR instructions, you can directly go through DuckDB to translate your data into a new format :
duckdb -s "INSTALL spatial;LOAD spatial;COPY (SELECT * FROM ST_Read('Downloads/USA_ZIP_Code.geojson')) TO 'Downloads/USA_ZIP_Code.gpkg' WITH (FORMAT GDAL, DRIVER 'GPKG');"
This command line instruction will give you the same result than your usual
ogr2ogr -f GPKG Downloads/USA_ZIP_Code.gpkg Downloads/USA_ZIP_Code.geojson
Few remarks then :
The ogr/gdal instruction is simpler and smaller. The DuckDB instruction requires the addition of the bummer INSTALL/LOAD spatial extension with its uppercase lowercase snare. Fewer options can be passed to OGR from DuckDB as the export function allows only to add LAYER_CREATION_OPTIONS and no other level of options which can be helpful to craft your final file. DuckDB comes with a smaller (older ?) set of accepted ogr formats. When brew directly installs a brand new GDAL 3.9 version with GeoParquet included, this Geowunderkind is still not present in the DuckDB spatial extension, which is a bit of a shame for a software heavily promoting the parquet format. So for simple geoformat translation one could say that DuckDB adds too much complication for no direct added-value.
When it comes to getting insights about your dataset, things get a bit different :
ogrinfo -al -so Downloads/USA_ZIP_Code.gpkg
INFO: Open of `Downloads/USA_ZIP_Code.gpkg'
using driver `GPKG' successful.
Layer name: USA_ZIP_Code
Geometry: Unknown (any)
Feature Count: 32268
Extent: (-179.147340, -14.548692) - (179.778465, 71.390482)
Layer SRS WKT:
GEOGCRS["Undefined geographic SRS",
DATUM["unknown",
ELLIPSOID["unknown",6378137,298.257223563,
LENGTHUNIT["metre",1,
ID["EPSG",9001]]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433,
ID["EPSG",9122]]],
CS[ellipsoidal,2],
AXIS["latitude",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433,
ID["EPSG",9122]]],
AXIS["longitude",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433,
ID["EPSG",9122]]]]
Data axis to CRS axis mapping: 2,1
FID Column = fid
Geometry Column = geom
OBJECTID: Integer (0.0)
ZIP_CODE: String (0.0)
PO_NAME: String (0.0)
STATE: String (0.0)
POPULATION: Integer (0.0)
POP_SQMI: Real (0.0)
SQMI: Real (0.0)
Shape__Area: Real (0.0)
Shape__Length: Real (0.0)
DuckDB result showed as print screen to preserve column alignment The ogrinfo all layers summary only is always tricky to remember and you may end up with some missing information if you forget the -al flag. DuckDB makes you forget about projections, because DuckDB is impervious to projections, and you may consider this as a good or bad thing. Overall I really like the simple and neat DuckDB’s description of the dataset which can moreover provide more detailed information than ogrinfo :
The SUMMARIZE statement brings in interesting statistics on the dataset that you can export easily to CSV with a command like this :
duckdb -s "INSTALL spatial;LOAD spatial;COPY (SELECT FROM (SUMMARIZE (SELECT EXCLUDE geom FROM ST_Read('Downloads/USA_ZIP_Code.gpkg')))) TO 'stats.csv' (HEADER, DELIMITER ',');"
And of course you can dig a little bit into the data :
Which eventually brings us to the conclusion and to my point (thanks for still being here) that far from being a replacement for GDAL/OGR if you have complex reprojections / translations of your dataset to achieve, areas where GDAL/OGR really excels, DuckDB will be an every day companion to those who, after having located a dataset online, will want to explore / extract / download a subset of it before even going further in their analysis still using the same tool. Using the same tool is great, because doing so you build knowledge, confidence and expertise among the users, which are domains most GDAL/OGR users don’t really feel comfortable with because how fussy it can be.