API Reference¶
conflation¶
openpois.conflation.match¶
Spatial candidate matching and composite scoring for POI conflation. Provides a BallTree-based radius search to find nearby OSM–Overture candidate pairs within category-specific thresholds, a multi-component scorer (distance, name similarity, taxonomy agreement, shared identifiers), and a greedy one-to-one assignment step that filters below a minimum composite score.
openpois.conflation.merge¶
Merge matched and unmatched POIs into a unified conflated GeoDataFrame. Produces a superset containing matched OSM–Overture pairs with blended confidence scores, unmatched OSM POIs at their original confidence, and unmatched Overture POIs at downweighted confidence. Uses a disk-backed split-then-concat pattern to avoid peak memory issues at CONUS scale.
Merge matched and unmatched POIs into a unified conflated dataset.
- Produces a GeoDataFrame superset:
Matched pairs (OSM + Overture) with blended confidence.
Unmatched OSM POIs with their original confidence.
Unmatched Overture POIs with downweighted confidence.
- Three entry points:
merge_matched_pois: in-memory, for tests/small datasets.build_merge_parts: disk-backed, row-sliced. Writes multiple part parquets so peak memory is bounded by slice size.build_merge_parts_chunked: disk-backed, spatial-chunk-sliced. Reuses theosm_primary/overture_primaryarrays produced by the chunked matching driver so each per-chunk part is small and independent.
- openpois.conflation.merge.build_merge_parts(
- osm_gdf: geopandas.GeoDataFrame,
- overture_gdf: geopandas.GeoDataFrame,
- matches: pandas.DataFrame,
- osm_shared_labels: numpy.ndarray,
- overture_shared_labels: numpy.ndarray,
- overture_confidence_weight: float = 0.7,
- n_slices: int = 4,
Build each merge subset, writing to temp parquet files.
Unmatched OSM and Overture rows are split into
n_slicescontiguous row ranges each, and each slice is built and written independently. This caps peak memory at roughly(1 / n_slices)of the full-dataset footprint for unmatched parts. The matched part is written as a single file (it is the smallest and already bounded by the number of matches).- Returns:
List of temp parquet file paths in concat order.
- openpois.conflation.merge.build_merge_parts_chunked(
- osm_gdf: geopandas.GeoDataFrame,
- overture_gdf: geopandas.GeoDataFrame,
- matches: pandas.DataFrame,
- osm_shared_labels: numpy.ndarray,
- overture_shared_labels: numpy.ndarray,
- osm_primary: numpy.ndarray,
- overture_primary: numpy.ndarray,
- n_chunks: int,
- overture_confidence_weight: float = 0.7,
Build per-spatial-chunk merge parts, writing one parquet per chunk.
Reuses the KD-bisected chunks produced by the chunked matching driver: for each chunk
cwe emit matched pairs whose OSM POI hasosm_primary == c(the same OSM-anchored emit rule used during matching), unmatched OSM POIs withosm_primary == c, and unmatched Overture POIs withoverture_primary == c.Peak memory per chunk is bounded by chunk size × 18-column schema, so this stays within a few hundred MB for ~200k-POI chunks regardless of total dataset size.
- Parameters:
osm_gdf – Full source frames.
overture_gdf – Full source frames.
matches – Post-dedup match DataFrame (osm_idx unique).
osm_shared_labels – Parallel to source frames.
overture_shared_labels – Parallel to source frames.
osm_primary –
(n,)int arrays assigning each row to its primary chunk. Produced bychunking.assign_primary_chunk.overture_primary –
(n,)int arrays assigning each row to its primary chunk. Produced bychunking.assign_primary_chunk.n_chunks – Total number of chunks; used for offset arrays.
overture_confidence_weight – Blend weight
w(see_build_matched_gdf).
- Returns:
List of per-chunk part file paths, in ascending chunk order.
- openpois.conflation.merge.merge_matched_pois(
- osm_gdf: geopandas.GeoDataFrame,
- overture_gdf: geopandas.GeoDataFrame,
- matches: pandas.DataFrame,
- osm_shared_labels: numpy.ndarray,
- overture_shared_labels: numpy.ndarray,
- overture_confidence_weight: float = 0.7,
Build the unified conflated dataset from matches + unmatched.
This in-memory version is suitable for tests and small datasets. For large datasets, use
build_merge_parts(row-sliced) orbuild_merge_parts_chunked(spatial-chunk-sliced) +save_conflated_from_parts.- Returns:
Conflated GeoDataFrame with unified schema.
- openpois.conflation.merge.save_conflated(
- gdf: geopandas.GeoDataFrame,
- output_path: Path,
Hilbert-sort and save as GeoParquet (zstd, 50k row groups).
- openpois.conflation.merge.save_conflated_from_parts(
- part_paths: list[Path],
- output_path: Path,
Stream temp parquet parts into the final output file.
Opens each part sequentially, unifies its schema against the writer, and appends its row groups. Only one part is held in memory at a time, so peak memory is bounded by the largest part — independent of the number of parts or the total dataset size. Skips Hilbert sorting to stay within memory limits.
- Returns:
Number of POIs written.
openpois.conflation.taxonomy¶
Taxonomy crosswalk between OSM tags and the Overture Maps category hierarchy.
Loads four CSV reference files (OSM crosswalk, Overture crosswalk, match radii,
and top-level key-to-L0 mappings) and provides functions to assign each POI a
shared_label string, a per-category spatial match radius, and an L0 bitmask
used for type-agreement scoring.
Taxonomy crosswalk between OSM tags and Overture Maps taxonomy.
Loads four CSV files that map OSM tag key/value pairs and Overture
(L0, L1) categories to a unified shared_label, plus per-label
match radii and top-level OSM-key-to-Overture-L0 mappings.
- gdf: pandas.DataFrame,
- osm_crosswalk: pandas.DataFrame,
- match_radii: pandas.DataFrame,
- filter_keys: list[str],
- default_radius_m: float = 100.0,
- return_all: bool = False,
Assign shared taxonomy labels to each OSM POI.
Two modes, selected by
return_all:return_all=False(default) — produces a single label per row. Usesfilter_keysin priority order (first non-null match wins), falling back to the per-key wildcard row if the specific value is not in the crosswalk. Returns(label, radius)as object / float64 ndarrays of lengthlen(gdf). Unmatched rows havelabel == ""andradius == default_radius_m. This is the path used by the conflation pipeline and snapshot model application.return_all=True— produces zero or more labels per row, used by the model-training pipeline which duplicates observations across every applicable taxonomy category.Pass 1 (specific matches): for every
filter_key, every row whose value for that key is in the crosswalk receives that label. A row can collect multiple specific labels.Pass 2 (wildcard fallback): applied only to rows that had zero specific matches in pass 1. Within such a row, wildcard keys are walked in the order they appear in the crosswalk CSV (
_build_osm_label_lookupspopulates thewildcardsdict viaiterrows, preserving CSV order via dict insertion order); the first wildcard key with a non-null/non-empty value wins and is the only wildcard label assigned.Returns
(labels_per_row, radii_per_row)as lists of lists; each inner list has>=0entries and is de-duplicated (if two keys map to the same label, it appears once).
- gdf: pandas.DataFrame,
- overture_crosswalk: pandas.DataFrame,
- match_radii: pandas.DataFrame,
- default_radius_m: float = 100.0,
Assign a
shared_labelandmatch_radius_mto each Overture POI using a 4-tier cascade from most to least specific.Tiers (applied in order, each only to unmatched rows):
(L0, L1, L2) — crosswalk rows with all three populated.
(L0, L2) — L1 empty in crosswalk; matches any L1.
(L0, L1) — L2 empty in crosswalk; catch-all for an L1.
L0-only — both L1 and L2 empty in crosswalk.
Backward-compatible: if the GeoDataFrame has no
taxonomy_l2column, tiers 1-2 produce no matches and behaviour falls back to the old (L0, L1) + L0 logic.- Returns:
(shared_label ndarray of object, match_radius_m ndarray of float)
- openpois.conflation.taxonomy.compute_osm_l0_bits(
- gdf: pandas.DataFrame,
- top_level_matches: pandas.DataFrame,
For each OSM POI, compute a uint16 bitmask encoding which Overture L0 categories it broadly matches.
A non-null value in an OSM tag key (e.g.
amenity) sets the bit(s) for every L0 linked to that key via top_level_matches. For example,amenitymaps to botharts_and_entertainment(bit 1) andfood_and_drink(bit 2), so any POI with a non-nullamenityvalue gets1 | 2 = 3.
- openpois.conflation.taxonomy.compute_overture_l0_bits(
- l0_array: numpy.ndarray,
For each Overture POI, compute a uint16 bitmask from its
taxonomy_l0value. Each POI has at most one L0 category, so a single bit is set.
- openpois.conflation.taxonomy.load_match_radii() pandas.DataFrame[source]¶
Load the match-radii CSV.
Columns:
shared_label, match_radius_m.
- openpois.conflation.taxonomy.load_osm_crosswalk() pandas.DataFrame[source]¶
Load the OSM taxonomy crosswalk CSV.
Columns:
osm_key, osm_value, shared_label.
io¶
openpois.io.osm_history_pbf¶
Download Geofabrik full-history PBFs (US + Puerto Rico), filter to POI tags
with osmium tags-filter, time-window with osmium time-filter, and parse
with pyosmium into per-version and per-change Parquet tables suitable for the
change-rate model. Uses an OAuth cookie jar against Geofabrik’s internal
server.
This module downloads US+PR full-history OpenStreetMap data for POI change-rate modelling using Geofabrik full-history PBF extracts, osmium-tool CLI pre-filtering, and pyosmium streaming.
It is broken into the following functions:
- download_history_pbf: Downloads a .osh.pbf file (optionally authenticated via
a Geofabrik OAuth cookie jar) via streaming HTTP.
- filter_history_pbf: Runs osmium tags-filter –omit-referenced to produce a
reduced POI-only history PBF.
- time_filter_history_pbf: Runs osmium time-filter FROM TO to slice the history
PBF to versions active in a given date range.
- parse_history_pbf: Streams a filtered history PBF with pyosmium and writes
per-version metadata (osm_versions.parquet) and per-version tag diffs (osm_changes.parquet).
- download_osm_history: End-to-end orchestrator. Downloads both the US-mainland
and Puerto Rico history extracts, filters and time-filters each, parses each, concatenates the results, and writes final versions/changes Parquets.
- Data sources:
US mainland (all 50 states incl. AK + HI, ~11 GB): https://osm-internal.download.geofabrik.de/north-america/us-internal.osh.pbf
Puerto Rico (separate extract): https://osm-internal.download.geofabrik.de/north-america/us/puerto-rico-internal.osh.pbf
Both URLs live on Geofabrik’s OAuth-protected internal server and require a valid OSM-account cookie jar. Any OSM account grants access; see the README section on cookie acquisition for details.
osmium-tool CLI must be installed (conda install -c conda-forge osmium-tool).
Note: This module is separate from openpois.io.osm_snapshot, which extracts the current POI snapshot only.
- openpois.io.osm_history_pbf.download_history_pbf(
- url: str,
- output_path: Path,
- cookie_file: Path | None = None,
- overwrite: bool = False,
Downloads a full-history PBF file from the given URL via streaming HTTP.
Writes to a temporary file in the same directory and renames atomically on success so a partial download never masquerades as a complete file.
- Parameters:
url – URL of the history PBF file to download.
output_path – Local path to save the downloaded PBF.
cookie_file – Path to a Netscape-format cookie jar for Geofabrik OAuth, or None for an unauthenticated session (fine for public extracts, required for the internal server).
overwrite – If False and output_path already exists, skip the download.
- Returns:
Path to the downloaded PBF file.
- Raises:
requests.HTTPError – If the HTTP request fails.
FileNotFoundError – If cookie_file is given but does not exist.
- openpois.io.osm_history_pbf.download_osm_history(
- pbf_url: str,
- raw_pbf_path: Path,
- filtered_pbf_path: Path,
- time_filtered_pbf_path: Path,
- us_versions_path: Path,
- us_changes_path: Path,
- pr_pbf_url: str,
- raw_pr_pbf_path: Path,
- filtered_pr_pbf_path: Path,
- time_filtered_pr_pbf_path: Path,
- pr_versions_path: Path,
- pr_changes_path: Path,
- output_versions_path: Path,
- output_changes_path: Path,
- filter_keys: list[str],
- start_date: datetime | date,
- end_date: datetime | date,
- cookie_file: Path | None = None,
- overwrite_download: bool = False,
- overwrite_filter: bool = False,
- overwrite_parse: bool = False,
- chunk_size: int = 500000,
- verbose: bool = True,
End-to-end orchestrator: download the US-mainland and PR Geofabrik full-history PBFs, filter and time-filter each, parse each to Parquets, and concatenate into the final versions + changes files.
- Parameters:
pbf_url – URL of the US-mainland full-history PBF (Geofabrik internal).
raw_pbf_path – Local path for the raw US PBF.
filtered_pbf_path – Local path for the tags-filtered US PBF.
time_filtered_pbf_path – Local path for the time-filtered US PBF.
us_versions_path – Intermediate Parquet for US versions.
us_changes_path – Intermediate Parquet for US changes.
pr_pbf_url – URL of the Puerto Rico full-history PBF.
raw_pr_pbf_path – Local path for the raw PR PBF.
filtered_pr_pbf_path – Local path for the tags-filtered PR PBF.
time_filtered_pr_pbf_path – Local path for the time-filtered PR PBF.
pr_versions_path – Intermediate Parquet for PR versions.
pr_changes_path – Intermediate Parquet for PR changes.
output_versions_path – Final concatenated osm_versions.parquet.
output_changes_path – Final concatenated osm_changes.parquet.
filter_keys – OSM tag keys passed to
tags-filter.start_date – Start of the time-filter window.
end_date – End of the time-filter window.
cookie_file – Netscape-format cookie jar for Geofabrik OAuth.
overwrite_download – Re-download raw PBFs even if present.
overwrite_filter – Re-run tags-filter and time-filter even if present.
overwrite_parse – Re-run parse even if Parquets are present.
chunk_size – Rows per Parquet-writer flush in the streaming parser.
verbose – Print progress during parsing.
- Returns:
Tuple
(output_versions_path, output_changes_path).
- openpois.io.osm_history_pbf.filter_history_pbf(
- input_pbf: Path,
- output_pbf: Path,
- osm_keys: list[str],
- overwrite: bool = False,
Runs osmium tags-filter –omit-referenced on a full-history PBF.
–omit-referenced is required for history files: multi-pass filtering (the default mode used by filter_pbf in osm_snapshot.py) fails on .osh.pbf. –omit-referenced does a single pass and emits only the matching objects themselves; referenced node coordinates are not retained. This is fine for the change-rate pipeline because we never resolve way geometry downstream.
The
--output-format=osh.pbfflag keeps the output in history format.- Parameters:
input_pbf – Path to the raw history PBF.
output_pbf – Path to write the filtered history PBF.
osm_keys – OSM tag keys to retain (e.g., [‘amenity’, ‘shop’]).
overwrite – If False and output_pbf exists, skip filtering.
- Returns:
Path to the filtered PBF file.
- Raises:
subprocess.CalledProcessError – If osmium exits with non-zero status.
- openpois.io.osm_history_pbf.parse_history_pbf(
- pbf_path: Path,
- versions_path: Path,
- changes_path: Path,
- chunk_size: int = 500000,
- overwrite: bool = False,
- verbose: bool = True,
Stream a filtered full-history PBF and write versions + changes Parquets.
The pyosmium FileProcessor emits every version of every element in
(type, id, version)order for a history PBF. For each version we compare its tag set against the previous version of the same element (reset whenever(type, id)changes) and emit:one row per version to
versions_pathwithid, version, changeset, timestamp, user, uid, type;one row per tag change (Added / Changed / Deleted) to
changes_pathwithkey, value, change, id, version.
Includes
visible,lat, andlonas pseudo-tags so that visibility toggles (deletions) and coordinate edits are captured in osm_changes — matches the behaviour of the existing Overpass-based pipeline.- Parameters:
pbf_path – Path to the (tag-filtered and optionally time-filtered) history PBF.
versions_path – Destination Parquet for per-version metadata.
changes_path – Destination Parquet for per-version tag diffs.
chunk_size – Number of rows to buffer before each flush. Same value is applied independently to the versions and changes buffers.
overwrite – If False and both destinations already exist, skip parsing.
verbose – If True, print progress every chunk_size versions.
- Returns:
Tuple
(versions_path, changes_path).
- openpois.io.osm_history_pbf.time_filter_history_pbf(
- input_pbf: Path,
- output_pbf: Path,
- start_date: datetime | date,
- end_date: datetime | date,
- overwrite: bool = False,
Runs osmium time-filter FROM TO on a full-history PBF.
With two ISO-formatted timestamps,
osmium time-filterpreserves every version active during the window and keeps the output in history format. A single-timestamp call would collapse the file to a snapshot — that is not what we want here.- Parameters:
input_pbf – Path to the tag-filtered history PBF.
output_pbf – Path to write the time-filtered history PBF.
start_date – Start of the window (inclusive). datetime or date.
end_date – End of the window (exclusive per osmium semantics).
overwrite – If False and output_pbf exists, skip the filter.
- Returns:
Path to the time-filtered PBF file.
- Raises:
subprocess.CalledProcessError – If osmium exits with non-zero status.
openpois.io.osm_snapshot¶
Download a current US-wide OSM POI snapshot from a Geofabrik PBF extract.
Streams the PBF (~11 GB), runs osmium tags-filter to reduce it to
matching tag keys, then parses nodes and way centroids with pyosmium into
a GeoParquet file. The osmium binary is resolved from the conda environment
rather than the system PATH.
This module downloads a current/latest POI snapshot for the US + Puerto Rico from OpenStreetMap using Geofabrik PBF extracts, osmium-tool CLI pre-filtering, and pyosmium parsing.
It is broken into the following functions:
download_pbf: Downloads a PBF file from a URL via streaming HTTP.
filter_pbf: Runs osmium tags-filter to produce a reduced POI-only PBF.
- parse_pbf_to_geodataframe: Parses the filtered PBF with pyosmium into a
GeoDataFrame of nodes (Points) and ways (Polygons or Points).
- download_osm_snapshot: End-to-end orchestrator. Downloads and parses both
the mainland US extract and the Puerto Rico extract, then concatenates the results.
- Data sources:
US mainland (all 50 states incl. AK + HI, ~11 GB): https://download.geofabrik.de/north-america/us-latest.osm.pbf
Puerto Rico (separate extract, ~tens of MB): https://download.geofabrik.de/north-america/us/puerto-rico-latest.osm.pbf
Geofabrik extracts are cut along administrative boundaries, so no polygon post-filter is applied here — the two extracts together cover exactly the US + PR footprint.
osmium-tool CLI must be installed (conda install -c conda-forge osmium-tool).
Note: This module is separate from openpois.io.osm_history_pbf, which fetches full-history PBFs for change-rate modeling. This module downloads a current snapshot only.
- openpois.io.osm_snapshot.download_osm_snapshot(
- pbf_url: str,
- raw_pbf_path: Path,
- filtered_pbf_path: Path,
- output_path: Path,
- filter_keys: list[str],
- extract_keys: list[str],
- pr_pbf_url: str,
- raw_pr_pbf_path: Path,
- filtered_pr_pbf_path: Path,
- overwrite_download: bool = False,
- overwrite_filter: bool = False,
- source_label: str = 'osm',
- keep_all_keys: bool = False,
- chunk_size: int = 500000,
- max_area_nodes: int | None = None,
- chunk_dir: Path | None = None,
- verbose: bool = True,
End-to-end orchestrator: download both the US-mainland and Puerto Rico Geofabrik PBFs, filter each to POIs, parse each, concat, and save as GeoParquet.
For each PBF the steps are:
download_pbf — streams the PBF to the raw_pbf path.
filter_pbf — runs osmium tags-filter to produce a POI-only PBF.
parse_pbf_to_geodataframe — parses with pyosmium into a GeoDataFrame.
The two GeoDataFrames are concatenated and written to output_path.
Steps 1 and 2 are skipped if the target files already exist unless overwrite_download / overwrite_filter are True.
- Parameters:
pbf_url – URL of the US-mainland PBF extract (Geofabrik us-latest, all 50 states including AK + HI).
raw_pbf_path – Local path to store the US-mainland raw PBF.
filtered_pbf_path – Local path to store the US-mainland filtered PBF.
output_path – Path to write the output GeoParquet file.
filter_keys – OSM tag keys used to filter elements in the PBF. Elements lacking all of these keys are excluded.
extract_keys – OSM tag keys to include as output columns. If None, all tags on accepted elements are extracted.
pr_pbf_url – URL of the Puerto Rico PBF extract (Geofabrik puerto-rico-latest). Geofabrik serves this separately from the US extract.
raw_pr_pbf_path – Local path to store the PR raw PBF.
filtered_pr_pbf_path – Local path to store the PR filtered PBF.
overwrite_download – Re-download even if raw paths exist.
overwrite_filter – Re-filter even if filtered paths exist.
source_label – Value for the output ‘source’ column.
keep_all_keys – If True, all OSM tags are retained as columns in the output GeoDataFrame, not just those in extract_keys. filter_keys is still used to filter which elements are included.
chunk_size – Number of POI records per parquet chunk during parsing. Lower values reduce peak memory usage.
max_area_nodes – If set, relation-derived areas with more than this many total coordinate nodes are skipped before any Shapely geometry is built. Useful for excluding large multipolygons (parks, admin boundaries) that can exhaust memory. None disables the check.
chunk_dir – Directory under which a
parse_chunks/subdirectory is created to hold intermediate chunk files. Defaults to the parent of each filtered PBF. See parse_pbf_to_geodataframe for details.verbose – If True, log progress after each chunk is flushed.
- Returns:
Path to the written GeoParquet file (same as output_path).
- openpois.io.osm_snapshot.download_pbf(
- url: str,
- output_path: Path,
- overwrite: bool = False,
Downloads a PBF file from the given URL to output_path via streaming HTTP.
- Parameters:
url – URL of the PBF file to download (e.g., a Geofabrik extract).
output_path – Local path to save the downloaded PBF.
overwrite – If False and output_path already exists, skip the download.
- Returns:
Path to the downloaded PBF file.
- Raises:
requests.HTTPError – If the HTTP request fails.
- openpois.io.osm_snapshot.filter_pbf(
- input_pbf: Path,
- output_pbf: Path,
- osm_keys: list[str],
- overwrite: bool = False,
Runs osmium tags-filter to extract nodes, ways, and relations matching the given keys.
- Constructs and runs a command of the form:
osmium tags-filter -o {output_pbf} {input_pbf} nwr/{key1} nwr/{key2} …
The referenced nodes for matched ways are retained so that way geometries can be resolved by pyosmium in a subsequent step.
- Parameters:
input_pbf – Path to the full PBF extract.
output_pbf – Path to write the filtered output PBF.
osm_keys – OSM tag keys to retain (e.g., [‘amenity’, ‘shop’]).
overwrite – If False and output_pbf exists, skip filtering.
- Returns:
Path to the filtered PBF file.
- Raises:
subprocess.CalledProcessError – If osmium exits with non-zero status.
FileNotFoundError – If osmium is not installed or not on PATH.
- openpois.io.osm_snapshot.parse_pbf_to_geodataframe(
- pbf_path: Path,
- filter_keys: list[str] | None = None,
- extract_keys: list[str] | None = None,
- source_label: str = 'osm',
- chunk_size: int = 500000,
- max_area_nodes: int | None = None,
- chunk_dir: Path | None = None,
- verbose: bool = True,
Parses a filtered PBF file with pyosmium and returns a GeoDataFrame.
Thin wrapper around parse_pbf_to_parquet that loads the written parquet into a GeoDataFrame. For very large extracts (e.g. a full US PBF), prefer parse_pbf_to_parquet and consume the parquet with PyArrow streaming to avoid holding all records in memory.
See parse_pbf_to_parquet for parameter documentation.
- Returns:
source, osm_id (int64), osm_type (‘node’|’way’|’relation’), tag columns, name, geometry. CRS is EPSG:4326.
- Return type:
GeoDataFrame with columns
- openpois.io.osm_snapshot.parse_pbf_to_parquet(
- pbf_path: Path,
- out_path: Path,
- filter_keys: list[str] | None = None,
- extract_keys: list[str] | None = None,
- source_label: str = 'osm',
- chunk_size: int = 500000,
- max_area_nodes: int | None = None,
- chunk_dir: Path | None = None,
- verbose: bool = True,
Parses a filtered PBF file with pyosmium and writes the result as a single GeoParquet file at out_path.
Memory-efficient alternative to parse_pbf_to_geodataframe: records are flushed to per-chunk parquet files on disk, then merged directly to out_path via a PyArrow streaming writer. A full GeoDataFrame is never materialised in memory. Peak memory is one chunk’s worth of records.
- Args: see parse_pbf_to_geodataframe. out_path is written with the same
schema that parse_pbf_to_geodataframe would produce in a GeoParquet round-trip (columns: source, osm_id, osm_type, name, geometry, plus any extract_keys tag columns).
- Returns:
out_path.
openpois.io.overture¶
Download a current US-wide Overture Maps Places snapshot. Uses DuckDB’s
httpfs and spatial extensions to query Overture GeoParquet files
directly from public S3, filtering by bounding box and L0 taxonomy category.
No authentication is required. Auto-detects the latest Overture release date
from S3 if a specific date is not pinned.
This module downloads a current/latest Overture Maps Places snapshot for the US + Puerto Rico, filtered to a set of taxonomy categories.
Download strategy: a wide single-query scan of the full US+PR footprint
crashed DuckDB on memory-constrained hosts (it materialized 6M+ rows before
the spatial filter). This module instead iterates the 16 part-*.parquet
files that make up a release, queries each one with a bounded DuckDB session,
and writes a plain parquet intermediate per part. Intermediates survive across
invocations, so a crashed run can be resumed by re-running the script.
After every part is present on local disk, a single DuckDB COPY applies
the exact US+PR polygon filter (reading the boundary via the spatial
extension), builds the geometry column with ST_Point, and writes the
final GeoParquet without ever materializing rows in Python. The output file
is valid GeoParquet (readable by gpd.read_parquet with CRS preserved).
Spatial filter strategy (two-stage, all inside DuckDB):
Per-part
WHEREuses predicate pushdown on Overture’sbboxstruct column, OR-ing across one or more coarse bboxes. Multiple bboxes are required to capture the Alaskan Near Islands (+172 E) without scanning longitudes the main US bbox would miss.The final
COPYdoes an exactST_Withincheck against the dissolved US+PR polygon to drop Canadian and Mexican border slivers.
Data source: s3://overturemaps-us-west-2/release/ (public, no auth required).
Category filtering uses the taxonomy.hierarchy array. The first element
(taxonomy.hierarchy[1] in SQL 1-based indexing) is the L0 category. The
deprecated categories.primary field must NOT be used; it is removed in
June 2026.
Memory knobs: duckdb_memory_limit and duckdb_threads are per
DuckDB connection. workers parallelizes per-part downloads via a
ThreadPoolExecutor. Peak host RAM ≈ workers × duckdb_memory_limit
and peak CPU ≈ workers × duckdb_threads — scale the per-worker knobs
down if raising workers beyond the default.
- openpois.io.overture.build_overture_s3_path(
- release_date: str,
- bucket: str,
Returns the S3 glob path for all Places Parquet files in a given release.
- Parameters:
release_date – Release identifier as returned by get_latest_release_date (e.g., ‘2026-02-18.0’).
bucket – The S3 bucket name.
- Returns:
S3 path string suitable for DuckDB
read_parquet(), e.g.s3://overturemaps-us-west-2/release/2026-02-18.0/theme=places/type=place/
- openpois.io.overture.download_overture_snapshot(
- output_path: Path,
- taxonomy_allowlist: list,
- boundary_gdf: geopandas.GeoDataFrame,
- coarse_bboxes: list[dict],
- bucket: str,
- s3_region: str,
- release_date: str | None = None,
- source_label: str = 'overture',
- duckdb_memory_limit: str = '4GB',
- duckdb_threads: int = 2,
- workers: int = 2,
Downloads filtered Overture Maps Places data and writes it as GeoParquet.
The full-CONUS scan is split across the release’s
part-*.parquetfiles. Each part streams through DuckDB into a plain parquet intermediate (underoutput_path.parent / ".parts" / <release_date> /); the loop is resumable — if an intermediate already exists, the part is skipped on the next run. After every part is present, a single DuckDBCOPYapplies the exact US+PR polygon filter and writes the final GeoParquet without materializing rows in Python.- Parameters:
output_path – Path to write the output GeoParquet file.
taxonomy_allowlist – List of (L0, L1) pairs specifying which taxonomy branches to retain.
L1 = Nonemeans “all L1s under this L0”. Accepts pairs as two-element tuples or lists (YAML). Valid L0 values (from S3 data as of 2026-02-18): ‘food_and_drink’, ‘shopping’, ‘arts_and_entertainment’, ‘sports_and_recreation’, ‘health_care’, ‘services_and_business’, ‘travel_and_transportation’, ‘lifestyle_services’, ‘education’, ‘community_and_government’, ‘cultural_and_historic’, ‘lodging’, ‘geographic_entities’. See: https://docs.overturemaps.org/guides/places/taxonomy/boundary_gdf – Single-row GeoDataFrame in EPSG:4326 containing the dissolved, buffered US+PR polygon. Used as the exact spatial filter; obtain it from
openpois.io.boundary.coarse_bboxes – List of bbox dicts (keys
xmin, ymin, xmax, ymax) used as the DuckDB predicate-pushdown prefilter. Typically obtained fromopenpois.io.boundary.us_pr_bboxes.bucket – S3 bucket name hosting Overture releases.
s3_region – AWS region of the S3 bucket.
release_date – Overture release identifier (e.g., ‘2026-02-18.0’). If None, the latest release is fetched automatically.
source_label – Value for the output ‘source’ column.
duckdb_memory_limit – Per-connection DuckDB memory cap (e.g., “4GB”).
duckdb_threads – Per-connection DuckDB thread count.
workers – Number of parts to download in parallel via a
ThreadPoolExecutor. Peak host RAM isworkers × duckdb_memory_limitand peak CPU isworkers × duckdb_threads— scale down the per-worker knobs when increasingworkers. Must be >= 1.
- Returns:
The
output_pathof the written GeoParquet file. The file is readable bygpd.read_parquet(path)(withcolumns=...support) with CRS preserved as OGC:CRS84 (equivalent to EPSG:4326).- Raises:
ValueError – If
workersis less than 1, or if the S3 listing is truncated or empty.
- openpois.io.overture.get_latest_release_date(
- bucket: str,
Finds the most recent Overture Maps release date by listing the S3 bucket.
Queries the public S3 HTTP API for prefix listings under the ‘release/’ key and returns the lexicographically largest date string found.
- Parameters:
bucket – The S3 bucket name hosting Overture releases.
- Returns:
Release date string in the format ‘YYYY-MM-DD.N’ as it appears in S3 (e.g., ‘2026-02-18.0’).
- Raises:
requests.HTTPError – If the S3 list request fails.
ValueError – If no release prefixes are found in the bucket.
openpois.io.geohash_partition¶
Utilities for spatially partitioning GeoDataFrames by geohash for efficient
web-map viewport queries. Computes geohash columns from geometry centroids,
writes Hive-style partitioned Parquet datasets (geohash_prefix=XX/), and
sorts rows within each partition by a finer geohash for spatial locality.
openpois.io.source_coop¶
Upload a locally partitioned dataset to Source Cooperative’s S3-compatible
storage. Walks the Hive partition directory, uploads each Parquet file under
a versioned prefix, and reports the public URL on completion. Credentials
come from a JSON file at the repo root (publish.credentials_file).
openpois.io.credentials¶
Load Source Cooperative AWS-compatible credentials from a JSON file. Tokens are short-lived (~1 hour); the loader logs a clear error pointing at the credentials regeneration URL when the file is stale or missing.
Load Source Cooperative temporary AWS credentials from a local .env.json file.
Source Coop credentials are short-lived (issued via the dashboard and scoped to a single repository prefix). The file format is the JSON payload shown in the Source Coop UI — four keys:
- {
“aws_access_key_id”: “ASIA…”, “aws_secret_access_key”: “…”, “aws_session_token”: “…”, “region_name”: “us-west-2”
}
If the file has not been touched recently we warn the caller (but do not fail); STS tokens typically last an hour or so.
- openpois.io.credentials.load_source_coop_credentials(
- env_file: Path | str | None = None,
Read Source Coop temporary AWS credentials from
.env.json.env_filedefaults to~/repos/openpois/.env.json. RaisesFileNotFoundErrororValueErrorwith a refresh hint if the file is missing or malformed. Prints a warning if the file’s mtime is older than ~1 hour (tokens may have expired).
models¶
openpois.models.jax_core¶
JAX/BlackJAX helpers: a PRNG factory, a jitted Markov-chain scan, a NUTS sampler with window adaptation, and a vmap-based predictive-draw utility.
openpois.models.model_fitter¶
BlackJAX NUTS fitter for POI change-rate models. Takes an event_rate_fun
plus starting parameters as a pytree, runs window-adapted NUTS to draw from
the posterior, and produces posterior summaries and predictive distributions
of change probability versus time.
openpois.models.osm_models¶
JAX model classes for OSM turnover. ConstantModel and
RandomByTypeModel package their own data, priors, and event-rate
functions to hand to ModelFitter. Selectable via get_model_class.
JAX-based models for OSM POI turnover rate estimation.
Each model class is self-contained: it ingests a raw observations DataFrame
plus a metadata dict, prepares the JAX arrays that ModelFitter needs, and
exposes event_rate_fun and param_likelihood as bound instance methods.
The fitted rate is interpreted as a Poisson event rate per observation; the
change probability is recovered inside ModelFitter via P = 1 - exp(-rate).
- class openpois.models.osm_models.ConstantModel(
- dataset: pandas.DataFrame,
- metadata: dict,
Bases:
ModelFactoryConstant change rate with ZIE δ mixture.
λ = exp(log_lambda); δ = sigmoid(logit_delta). A fraction δ of individuals change at t = 0 (methodology §1.7); the remaining 1−δ fraction follow Exponential(λ). Only the first interval of each individual carries the (1−δ) discount — see
log_likelihood_fun.- Metadata keys:
- dt_col: Column containing per-observation interval length in years
(default
"tag_years").- log_lambda_prior_scale: Standard deviation of the N(0, scale) prior on
log_lambda(defaultDEFAULT_LOG_LAMBDA_PRIOR_SCALE).- logit_delta_prior: (loc, scale) tuple for the Normal prior on
logit_delta(defaultDEFAULT_LOGIT_DELTA_PRIOR= (−3, 1)).
- class openpois.models.osm_models.ModelFactory(
- dataset: pandas.DataFrame,
- metadata: dict,
Bases:
ABCBase class for OSM turnover models.
Subclasses must implement
build_model()to populatestarting_params,param_ids(andgroup_lookupfor random-effects variants), plus any per-observation columns inself.databeyonddt.- __init__(
- dataset: pandas.DataFrame,
- metadata: dict,
- Parameters:
dataset – Observations DataFrame (already filtered/prepared by
prepare_data_for_model).metadata – Model configuration. Required keys vary by subclass; all subclasses honor
dt_col(default"tag_years").
- abstractmethod build_model()[source]¶
Populate
starting_params,param_ids, and any extra data columns.
- abstractmethod build_predict_data(
- times: jax.numpy.ndarray,
Build the
datadict passed toModelFitter.predictfor a time grid.
- derive_draws(
- draws: dict[str, jax.numpy.ndarray],
Augment posterior draws with any derived/back-transformed parameters.
Default is an identity map. Override in subclasses that sample a reparameterised form (e.g. non-centered
epsilon_raw) to expose the natural parameter (epsilon) to downstream consumers.
- abstractmethod event_rate_fun(
- params: dict[str, jax.numpy.ndarray],
- data: dict[str, jax.numpy.ndarray],
Poisson event rate per observation.
- class openpois.models.osm_models.RandomByTypeModel(
- dataset: pandas.DataFrame,
- metadata: dict,
Bases:
ModelFactoryRandom-effects model on both λ and δ, grouped by a shared label.
- Per-group log-rate:
log λ_g = log_lambda_0 + ε_g, ε_g ~ N(0, exp(log_sigma)) log_sigma ~ N(var_prior[0], var_prior[1])
- Per-group zero-inflated mixture mass on the logit scale:
logit δ_g = logit_delta_0 + η_g, η_g ~ N(0, exp(log_tau)) log_tau ~ N(logit_delta_var_prior[0], logit_delta_var_prior[1])
Two equivalent parameterisations are supported via
metadata["reparam"]and apply symmetrically to both ε and η:"centered"(legacy): the sampler tracesepsilon/etadirectly. Simple, but the posterior is funnel-shaped near sparse groups and NUTS tends to diverge there."non_centered"(default): the sampler tracesepsilon_raw ~ N(0, 1)/eta_raw ~ N(0, 1)and we reconstructepsilon = exp(log_sigma) * epsilon_rawandeta = exp(log_tau) * eta_rawpost-hoc. Removes the funnel, usually yielding zero divergences and higher ESS on small groups with no change to well-identified groups.
Under
non_centeredthe natural-scaleepsilonandetadraws are exposed viaderive_drawsalong with the per-grouplogit_deltaanddelta, so downstream consumers (predict, parameter tables, savedparam_draws.csv) see the same derived names either way.- Metadata keys:
group: Column name in raw_data holding the grouping variable. Required. dt_col: Column containing per-observation interval length (default
"tag_years").- var_prior:
(loc, scale)tuple for the hyperprior onlog_sigma (default
DEFAULT_VAR_PRIOR).- logit_delta_prior:
(loc, scale)tuple for the prior on the intercept
logit_delta_0(defaultDEFAULT_LOGIT_DELTA_PRIOR).- logit_delta_var_prior:
(loc, scale)tuple for the tight hyperprior on
log_tau(defaultDEFAULT_LOGIT_DELTA_VAR_PRIOR).- reparam:
"centered"or"non_centered"(default "non_centered").
- var_prior:
- openpois.models.osm_models.get_model_class(
- model_name: str,
Return a
ModelFactorysubclass by name fromMODEL_REGISTRY.- Parameters:
model_name – Registry key (
"constant"or"random_by_type").- Returns:
The corresponding
ModelFactorysubclass.- Raises:
ValueError – If
model_nameis not a registered model.
openpois.models.setup¶
Data-preparation helpers. prepare_data_for_model filters and groups
observation records and computes the tag_years elapsed column used as
the per-observation interval length in fitting.
Data-preparation helpers for OSM turnover models.
- openpois.models.setup.prepare_data_for_model(
- data: pandas.DataFrame,
- group_key: str | None = None,
- group_values: list[str] | None = None,
- min_value_count: int | None = None,
- t1_col: str = 'last_obs_timestamp',
- t2_col: str = 'obs_timestamp',
Prepare an observations DataFrame for model fitting.
Per turnover-model-methodology.md §1.2, the per-row Bernoulli-on-Poisson likelihood requires Δ = t_k − t_{k−1} (inter-observation), so the default
t1_colislast_obs_timestamp. Multiplying per-row Bernoullis telescopes to the correct individual likelihood. The previous default oflast_tag_timestampwould have made Δ the duration since the individual’s start — correct for one-row-per-individual but biased downward on multi-version POIs.Also emits
is_first_interval— True exactly whenlast_obs_timestamp == last_tag_timestamp, i.e. this row is the first surviving observation of its(POI, name-iteration)individual. Used by the ZIE δ extension (methodology §1.7).- Parameters:
data – Observations DataFrame as returned by format_observations.
group_key – Column name of the grouping variable. If None, no group filtering is applied.
group_values – If provided, only rows with group_key in this list are kept.
min_value_count – If provided, groups with fewer than this many observations are dropped.
t1_col – Name of the start-time timestamp column. Default
last_obs_timestampgives the inter-observation interval.t2_col – Name of the end-time timestamp column.
- Returns:
Filtered DataFrame with additional
tag_days,tag_years, andis_first_intervalcolumns.- Raises:
ValueError – If
t1_colort2_colis not present in data.
openpois.models.apply¶
Apply saved change-rate model predictions to a POI snapshot. Loads
predictions.csv from a versioned model output directory and builds
fast numpy lookup arrays (indexed by group and time step) for both constant
and random-effects model variants.
Utilities for applying fitted change-rate model predictions to a POI snapshot.
Provides functions to load saved predictions and build fast numpy lookup arrays for both constant and random-effects model variants.
- openpois.models.apply.constant_lookup(
- pred_df: pandas.DataFrame,
Build a (101, 3) float64 lookup array for a constant model.
Row index = t2_int (0–100); columns = [p_mean, p_lower, p_upper].
- Parameters:
pred_df – Predictions DataFrame returned by
load_predictions.- Returns:
Array of shape (101, 3).
- openpois.models.apply.group_lookup(
- pred_df: pandas.DataFrame,
Build a (n_groups, 101, 3) float64 lookup array for a random-effects model.
First axis = group index (sorted alphabetically); second = t2_int (0–100); third = [p_mean, p_lower, p_upper].
- Parameters:
pred_df – Predictions DataFrame returned by
load_predictions. Must contain agroup_namecolumn.- Returns:
Tuple of (ordered_group_names, array of shape (n_groups, 101, 3)).
- openpois.models.apply.load_predictions(
- version_dir: Path,
Load predictions.csv from a model version directory.
Adds a
t2_intcolumn (t2 * 10, rounded to int) as an integer lookup key.- Parameters:
version_dir – Path to the versioned model output directory containing
predictions.csv.- Returns:
DataFrame with columns t1, t2, p_mean, p_lower, p_upper, units, t2_int, and optionally group / group_name for random-effects models.
osm¶
openpois.osm.format_observations¶
Convert raw OSM version histories into modelling-ready observation records. Joins version and change tables to produce one row per element version, with timestamps for the previous and current tag values and a flag indicating whether the configured tag changed at this version.
This module formats OSM changes and versions into observations, which can be more easily queried and statistically analyzed.
- openpois.osm.format_observations.format_observations_duckdb(
- changes_path: Path,
- versions_path: Path,
- output_path: Path,
- tag_key: str,
- keep_keys: list[str],
- duckdb_memory_limit: str = '4GB',
- duckdb_threads: int | None = None,
- duckdb_temp_dir: Path | None = None,
- batch_rows: int = 100000,
- verbose: bool = True,
Stream POI observations from Parquet inputs to Parquet via DuckDB.
DuckDB pivots the long-form
osm_changes.parquetwide by tag key, LEFT-joinsosm_versions.parqueton(type, id, version), and returns rows sorted by(type, id, version); the sort spills toduckdb_temp_dirpastduckdb_memory_limit. A Python scan then iterates the sorted stream through_advance_scan_state(), buffering emitted observations per DuckDB fetch batch and flushing them aspyarrow.Tablerecord batches to aParquetWriter.Peak RSS is bounded to roughly
duckdb_memory_limitplus one fetch batch of observations, regardless of input size.- Parameters:
changes_path – Input
osm_changes.parquet.versions_path – Input
osm_versions.parquet.output_path – Destination
.parquet. Overwritten.tag_key – Tag key to model (e.g.
"name").keep_keys – Tag keys to retain on each observation. Must not include special characters (validated against
[A-Za-z0-9_:]+).duckdb_memory_limit – DuckDB
memory_limitsetting. The sort operator spills to disk past this.duckdb_threads – DuckDB worker thread count. Defaults to
os.cpu_count().duckdb_temp_dir – Sort-spill directory. Defaults to
output_path.parent.batch_rows – Rows pulled per
fetchmanycall; also the ParquetWriter flush size.verbose – Print progress.
- Returns:
Total number of observation rows written.
openpois.osm.change_plots¶
Kaplan-Meier-style tag stability plots using plotnine. Computes the proportion of tag assignments that remain unchanged over time from observation records, and renders single-panel and faceted multi-panel figures saved as PNG files.