Workflows¶
This page describes the four end-to-end pipelines that produce the openpois
dataset, in the order they are executed. Each pipeline is implemented as a
series of scripts in the scripts/ directory; the scripts call library
functions documented in the API Reference.
All scripts read their configuration from config.yaml via
config_versioned.Config. See the individual script docstrings for the
exact config keys each script uses.
Prerequisites¶
Python environment. Install the conda env from environment.yml and
the package itself in editable mode:
make build_env # conda env create -f environment.yml (env name: openpois)
conda activate openpois
make install_package # pip install -e .
Geofabrik OAuth (Pipeline 2 only). Pipeline 2 downloads full-history
PBFs from Geofabrik’s OAuth-protected internal server. Any OSM account grants
access. Generate a Netscape-format cookie jar by logging in at
https://osm-internal.download.geofabrik.de/ and exporting cookies, or by
running Geofabrik’s oauth_cookie_client.py. Save the cookie jar at the
path configured in config.yaml under download.osm.history_cookie_file.
Source Cooperative credentials (publishing only). Publishing the data
back to Source Cooperative requires short-lived AWS-style credentials in a
JSON file at the path configured under publish.credentials_file (default:
.env.json at the repo root). The format is documented in the
scripts/publish/upload_to_source_coop.py docstring. Replicators who do
not intend to publish can stop after Pipeline 4 Step 3.
Pipeline 1: POI Snapshot Downloads¶
These two scripts are independent and can be run in any order (or in parallel). Each downloads a current US-wide POI snapshot from one data source and saves it as a GeoParquet file.
OSM snapshot
python scripts/osm_snapshot/download.py
Downloads the Geofabrik North America PBF extract (~11 GB), filters with
osmium, and parses with pyosmium. Output: osm_snapshot.parquet
(~7.8 M POIs).
Overture Maps snapshot
python scripts/overture/download.py
Queries Overture Maps GeoParquet files on public S3 via DuckDB. No
credentials required. Output: overture_snapshot.parquet (~13 M POIs).
See openpois.io.overture.
Quick schema inspection (optional)
python scripts/snapshots/load_samples.py
Reads the first 100 rows of each snapshot without loading the full files,
saving snippet CSVs to the testing/ directory for column inspection.
Pipeline 2: OSM Historical Change-Rate Model¶
This pipeline downloads OpenStreetMap full-history PBFs (US + Puerto Rico) and fits a Poisson change-rate model to estimate how quickly different POI categories become outdated.
Step 1 — Download full-history PBFs
python scripts/osm_data/download_history.py
Requires the Geofabrik OAuth cookie jar described in Prerequisites above.
Downloads the US-mainland and Puerto Rico full-history extracts, filters
each with osmium tags-filter (POI tag keys only) and osmium
time-filter (the download.osm.start_date / end_date window), then
parses with pyosmium into per-version and per-change Parquet tables.
Outputs: osm_versions.parquet and osm_changes.parquet.
See openpois.io.osm_history_pbf.
Step 2 — Reformat into observations
python scripts/osm_data/format_tabular.py
Converts raw version histories into one-row-per-observation records, each
flagged for whether the configured osm_data.tag_key changed, then
assigns a shared taxonomy label and explodes rows for POIs mapping to
multiple labels. Output: osm_observations.parquet.
See openpois.osm.format_observations.
Step 3 — Fit the change-rate model
python scripts/models/osm_turnover.py
Fits an empirical Bayes JAX model (constant or random-effects by type)
estimating the Poisson change rate λ per group via BlackJAX NUTS. Outputs
fitted_params.csv and predictions.csv (and optionally
param_draws.csv).
See openpois.models.model_fitter, openpois.models.osm_models,
and openpois.models.setup.
Step 4 — Visualise stability curves (optional)
python scripts/osm_data/data_viz.py
Produces Kaplan-Meier-style survival curve plots saved to
osm_data/viz/.
See openpois.osm.change_plots.
Pipeline 3: Rate the OSM Snapshot¶
This pipeline applies the fitted change-rate model (Pipeline 2) to the OSM snapshot (Pipeline 1) to assign a confidence score to every POI.
Prerequisites: Pipeline 2 (model fitted) and Pipeline 1 OSM snapshot.
Step 1 — Apply model predictions
python scripts/osm_snapshot/apply_model.py
python scripts/osm_snapshot/apply_model.py --test # first 10 k rows only
Matches each POI to its best-fit model group (by tag key priority), then
looks up the predicted change probability at the POI’s age. Adds columns
conf_mean, conf_lower, conf_upper, t2_years,
model_version, and model_group. Output: osm_snapshot_rated.parquet.
Step 2 — Partition for upload
python scripts/osm_snapshot/format_for_upload.py
Adds geohash columns and writes a Hive-style partitioned dataset so the web
map can fetch only the tiles it needs. Output: osm_snapshot_partitioned/.
See openpois.io.geohash_partition.
Step 3 — Build OSM PMTiles
python scripts/osm_snapshot/prepare_pmtiles.py
Generates a single-zoom (z14) PMTiles archive from the partitioned dataset
for use by the web map. Output: osm_snapshot.pmtiles.
See openpois.io.pmtiles.
Pipeline 4: Conflation and Publishing¶
This pipeline conflates the rated OSM snapshot with the Overture Maps snapshot into a single unified POI dataset and publishes it to Source Cooperative.
Prerequisites: Pipeline 3 rated OSM snapshot and Pipeline 1 Overture snapshot.
Step 1 — Conflate
python scripts/conflation/conflate.py
python scripts/conflation/conflate.py --test # Seattle bbox only
Assigns shared taxonomy labels, finds spatial candidates via BallTree, scores
on distance + name + type + identifiers, performs greedy one-to-one matching,
and saves a unified GeoParquet. Output: conflated.parquet.
See openpois.conflation.match, openpois.conflation.merge, and
openpois.conflation.taxonomy.
Step 2 — Summarise (optional)
python scripts/conflation/summarize.py
Produces a summary CSV with match counts and average match scores per
shared taxonomy label. Output: summary_by_label.csv.
Step 3 — Partition and build conflated PMTiles
python scripts/conflation/format_for_upload.py
python scripts/conflation/prepare_pmtiles.py
Adds geohash columns and writes a Hive-style partitioned dataset, then
builds a single-zoom (z14) PMTiles archive of the conflated points.
Outputs: conflated_partitioned/ and conflated.pmtiles.
See openpois.io.geohash_partition and openpois.io.pmtiles.
Step 4 — Publish to Source Cooperative (optional)
python scripts/publish/upload_to_source_coop.py
Uploads the partitioned conflated dataset, the partitioned OSM dataset,
both PMTiles archives, and a per-version README to Source Cooperative
under the versions.source_coop folder. Requires the credentials file
described in Prerequisites. Skip this step if you only want the data
locally.
See openpois.io.source_coop and openpois.publish.build_readme.