The Cloud-Native Geospatial Stack

If you’ve worked with geospatial data the traditional way, you know the workflow: find the dataset, kick off a download that takes the better part of an afternoon, unzip a folder full of shapefiles or GeoTIFFs, load it into QGIS or a PostGIS table, and then — finally — start asking questions. The data lives on your machine. Your pipeline is glued to that copy. When something upstream changes, you start over.

Cloud-native geospatial formats are a bet that this model is backwards. The question shouldn’t be where is the data, it should be what do I need from it. The formats are designed so that a client can answer a spatial or temporal query against a file sitting in object storage without downloading the whole thing. HTTP range requests, chunked encoding, spatial indexing baked into the file layout — these are the mechanisms. The result is that your pipeline and the data can live in completely different places, and that’s fine.

COG: the raster case

A Cloud-Optimized GeoTIFF is a regular GeoTIFF with two structural additions: overviews (pre-computed lower-resolution versions of the image) stored inside the file, and tiles arranged so that a range request can fetch just the pixels you need for a given bounding box and zoom level. The file can sit in S3 or Google Cloud Storage; a client fetches the header, sees the layout, and pulls only the relevant byte ranges.

The practical effect is that you can work with a 50 GB Sentinel-2 scene without ever downloading 50 GB. You fetch the overview when you need a thumbnail. You fetch the full-resolution tiles when you need radiometric accuracy over a 10 km² AOI. Everything else stays in the bucket.

GDAL added COG support in 3.1. STAC catalogs link to COGs directly. Most modern raster libraries — rioxarray, stackstac, odc-stac — know how to stream from them. At this point, writing a new raster pipeline that downloads first is a choice you have to consciously make.

STAC: the catalog layer

The SpatioTemporal Asset Catalog spec defines a common JSON schema for describing geospatial datasets — a bounding box, a datetime, a set of asset links, and a collection of properties. That’s almost all of it. The simplicity is the point.

Before STAC, every data provider had its own catalog API: different query parameters, different response shapes, different auth schemes. Writing a pipeline that could ingest from both Planet and Sentinel-2 meant writing two separate clients. STAC standardizes the query surface — if a catalog is STAC-compliant, pystac-client can talk to it — and the assets it describes are typically COGs.

The result is a discoverable, queryable layer over a distributed set of raster data that lives in object storage. A single item_search call with a bbox and a date range returns links to exactly the scenes you need, ready to stream.

GeoParquet and PMTiles: the vector side

The raster world got COG and STAC. The vector world is catching up.

GeoParquet is Parquet with a geometry column encoded in WKB and spatial metadata in the file footer. Parquet’s columnar layout means you can read just the geometry and a few attribute columns without touching the rest of the file. Pair it with DuckDB’s spatial extension and you can run ST_Within queries against a GeoParquet file in S3 with syntax that looks like ordinary SQL and performance that is, genuinely, fast — faster than loading the same data into PostGIS for most analytical queries under a few hundred million rows.

PMTiles is a single-file archive format for map tiles — raster or vector — with a directory structure that lets HTTP range requests fetch individual tiles directly. The practical consequence is that you can host a full vector tileset (say, all OSM roads for a country) as one file in a public S3 bucket, with no tile server in the loop. The client fetches the directory entry, then the tile bytes. That’s the whole stack.

What changes

The thing these formats have in common is that they push intelligence into the file layout and push the query into the client. The server becomes dumb object storage — S3, GCS, Azure Blob — and the compute stays wherever you need it.

For software engineers coming from the non-geo world, this looks familiar. It’s the same trade-off that Parquet made for analytics workloads, that Zarr is making for n-dimensional arrays, that FlatBuffers made for serialization. The geospatial community arrived at it later because the legacy formats (Shapefile, classic GeoTIFF, Esri geodatabases) were designed for a world where data lived on local disk and the format didn’t need to care about network access patterns.

That world is over. The petabyte-scale EO archives — Landsat, Sentinel, commercial constellations — live in object storage now, and the tools for working with them are being redesigned from the ground up with that assumption baked in. Learning the cloud-native stack isn’t a nice-to-have; it’s the prerequisite for working at the scale where geospatial data is actually interesting.