OpenPOIs | GitHub | Henry Spatial Analysis

API Reference¶

conflation¶

openpois.conflation.match¶

Spatial candidate matching and composite scoring for POI conflation. Provides a BallTree-based radius search to find nearby OSM–Overture candidate pairs within category-specific thresholds, a multi-component scorer (distance, name similarity, taxonomy agreement, shared identifiers), and a greedy one-to-one assignment step that filters below a minimum composite score.

openpois.conflation.merge¶

Merge matched and unmatched POIs into a unified conflated GeoDataFrame. Produces a superset containing matched OSM–Overture pairs with blended confidence scores, unmatched OSM POIs at their original confidence, and unmatched Overture POIs at downweighted confidence. Uses a disk-backed split-then-concat pattern to avoid peak memory issues at CONUS scale.

Merge matched and unmatched POIs into a unified conflated dataset.

Produces a GeoDataFrame superset:

Matched pairs (OSM + Overture) with blended confidence.
Unmatched OSM POIs with their original confidence.
Unmatched Overture POIs with downweighted confidence.

Three entry points:

merge_matched_pois: in-memory, for tests/small datasets.
build_merge_parts: disk-backed, row-sliced. Writes multiple part parquets so peak memory is bounded by slice size.
build_merge_parts_chunked: disk-backed, spatial-chunk-sliced. Reuses the osm_primary / overture_primary arrays produced by the chunked matching driver so each per-chunk part is small and independent.

openpois.conflation.merge.build_merge_parts( osm_gdf: geopandas.GeoDataFrame, overture_gdf: geopandas.GeoDataFrame, matches: pandas.DataFrame, osm_shared_labels: numpy.ndarray, overture_shared_labels: numpy.ndarray, overture_confidence_weight: float = 0.7, n_slices: int = 4, ) → list[Path][source]¶

Build each merge subset, writing to temp parquet files.

Unmatched OSM and Overture rows are split into n_slices contiguous row ranges each, and each slice is built and written independently. This caps peak memory at roughly (1 / n_slices) of the full-dataset footprint for unmatched parts. The matched part is written as a single file (it is the smallest and already bounded by the number of matches).

Returns:: List of temp parquet file paths in concat order.

openpois.conflation.merge.build_merge_parts_chunked( osm_gdf: geopandas.GeoDataFrame, overture_gdf: geopandas.GeoDataFrame, matches: pandas.DataFrame, osm_shared_labels: numpy.ndarray, overture_shared_labels: numpy.ndarray, osm_primary: numpy.ndarray, overture_primary: numpy.ndarray, n_chunks: int, overture_confidence_weight: float = 0.7, ) → list[Path][source]¶

Build per-spatial-chunk merge parts, writing one parquet per chunk.

Reuses the KD-bisected chunks produced by the chunked matching driver: for each chunk c we emit matched pairs whose OSM POI has osm_primary == c (the same OSM-anchored emit rule used during matching), unmatched OSM POIs with osm_primary == c, and unmatched Overture POIs with overture_primary == c.

Peak memory per chunk is bounded by chunk size × 18-column schema, so this stays within a few hundred MB for ~200k-POI chunks regardless of total dataset size.

Parameters:

osm_gdf – Full source frames.
overture_gdf – Full source frames.
matches – Post-dedup match DataFrame (osm_idx unique).
osm_shared_labels – Parallel to source frames.
overture_shared_labels – Parallel to source frames.
osm_primary – (n,) int arrays assigning each row to its primary chunk. Produced by chunking.assign_primary_chunk.
overture_primary – (n,) int arrays assigning each row to its primary chunk. Produced by chunking.assign_primary_chunk.
n_chunks – Total number of chunks; used for offset arrays.
overture_confidence_weight – Blend weight w (see _build_matched_gdf).

Returns:

List of per-chunk part file paths, in ascending chunk order.

openpois.conflation.merge.merge_matched_pois( osm_gdf: geopandas.GeoDataFrame, overture_gdf: geopandas.GeoDataFrame, matches: pandas.DataFrame, osm_shared_labels: numpy.ndarray, overture_shared_labels: numpy.ndarray, overture_confidence_weight: float = 0.7, ) → geopandas.GeoDataFrame[source]¶

Build the unified conflated dataset from matches + unmatched.

This in-memory version is suitable for tests and small datasets. For large datasets, use build_merge_parts (row-sliced) or build_merge_parts_chunked (spatial-chunk-sliced) + save_conflated_from_parts.

Returns:: Conflated GeoDataFrame with unified schema.

openpois.conflation.merge.save_conflated( gdf: geopandas.GeoDataFrame, output_path: Path, ) → None[source]¶: Hilbert-sort and save as GeoParquet (zstd, 50k row groups).

openpois.conflation.merge.save_conflated_from_parts( part_paths: list[Path], output_path: Path, ) → int[source]¶

Stream temp parquet parts into the final output file.

Opens each part sequentially, unifies its schema against the writer, and appends its row groups. Only one part is held in memory at a time, so peak memory is bounded by the largest part — independent of the number of parts or the total dataset size. Skips Hilbert sorting to stay within memory limits.

Returns:: Number of POIs written.

openpois.conflation.taxonomy¶

Taxonomy crosswalk between OSM tags and the Overture Maps category hierarchy. Loads four CSV reference files (OSM crosswalk, Overture crosswalk, match radii, and top-level key-to-L0 mappings) and provides functions to assign each POI a shared_label string, a per-category spatial match radius, and an L0 bitmask used for type-agreement scoring.

Taxonomy crosswalk between OSM tags and Overture Maps taxonomy.

Loads four CSV files that map OSM tag key/value pairs and Overture (L0, L1) categories to a unified shared_label, plus per-label match radii and top-level OSM-key-to-Overture-L0 mappings.

openpois.conflation.taxonomy.assign_osm_shared_label( gdf: pandas.DataFrame, osm_crosswalk: pandas.DataFrame, match_radii: pandas.DataFrame, filter_keys: list[str], default_radius_m: float = 100.0, return_all: bool = False, ) → tuple[numpy.ndarray, numpy.ndarray] | tuple[list[list[str]], list[list[float]]][source]¶

Assign shared taxonomy labels to each OSM POI.

Two modes, selected by return_all:

return_all=False (default) — produces a single label per row. Uses filter_keys in priority order (first non-null match wins), falling back to the per-key wildcard row if the specific value is not in the crosswalk. Returns (label, radius) as object / float64 ndarrays of length len(gdf). Unmatched rows have label == "" and radius == default_radius_m. This is the path used by the conflation pipeline and snapshot model application.
return_all=True — produces zero or more labels per row, used by the model-training pipeline which duplicates observations across every applicable taxonomy category.

Pass 1 (specific matches): for every filter_key, every row whose value for that key is in the crosswalk receives that label. A row can collect multiple specific labels.

Pass 2 (wildcard fallback): applied only to rows that had zero specific matches in pass 1. Within such a row, wildcard keys are walked in the order they appear in the crosswalk CSV (_build_osm_label_lookups populates the wildcards dict via iterrows, preserving CSV order via dict insertion order); the first wildcard key with a non-null/non-empty value wins and is the only wildcard label assigned.

Returns (labels_per_row, radii_per_row) as lists of lists; each inner list has >=0 entries and is de-duplicated (if two keys map to the same label, it appears once).

openpois.conflation.taxonomy.assign_overture_shared_label( gdf: pandas.DataFrame, overture_crosswalk: pandas.DataFrame, match_radii: pandas.DataFrame, default_radius_m: float = 100.0, ) → tuple[numpy.ndarray, numpy.ndarray][source]¶

Assign a shared_label and match_radius_m to each Overture POI using a 4-tier cascade from most to least specific.

Tiers (applied in order, each only to unmatched rows):

(L0, L1, L2) — crosswalk rows with all three populated.
(L0, L2) — L1 empty in crosswalk; matches any L1.
(L0, L1) — L2 empty in crosswalk; catch-all for an L1.
L0-only — both L1 and L2 empty in crosswalk.

Backward-compatible: if the GeoDataFrame has no taxonomy_l2 column, tiers 1-2 produce no matches and behaviour falls back to the old (L0, L1) + L0 logic.

Returns:: (shared_label ndarray of object, match_radius_m ndarray of float)

openpois.conflation.taxonomy.compute_osm_l0_bits( gdf: pandas.DataFrame, top_level_matches: pandas.DataFrame, ) → numpy.ndarray[source]¶

For each OSM POI, compute a uint16 bitmask encoding which Overture L0 categories it broadly matches.

A non-null value in an OSM tag key (e.g. amenity) sets the bit(s) for every L0 linked to that key via top_level_matches. For example, amenity maps to both arts_and_entertainment (bit 1) and food_and_drink (bit 2), so any POI with a non-null amenity value gets 1 | 2 = 3.

openpois.conflation.taxonomy.compute_overture_l0_bits( l0_array: numpy.ndarray, ) → numpy.ndarray[source]¶: For each Overture POI, compute a uint16 bitmask from its taxonomy_l0 value. Each POI has at most one L0 category, so a single bit is set.

openpois.conflation.taxonomy.load_match_radii() → pandas.DataFrame[source]¶

Load the match-radii CSV.

Columns: shared_label, match_radius_m.

openpois.conflation.taxonomy.load_osm_crosswalk() → pandas.DataFrame[source]¶

Load the OSM taxonomy crosswalk CSV.

Columns: osm_key, osm_value, shared_label.

openpois.conflation.taxonomy.load_overture_crosswalk() → pandas.DataFrame[source]¶

Load the Overture Maps taxonomy crosswalk CSV.

Columns: overture_l0, overture_l1, overture_l2, shared_label.

openpois.conflation.taxonomy.load_top_level_matches() → pandas.DataFrame[source]¶

Load the top-level OSM-key ↔ Overture-L0 CSV.

Columns: overture_l0, osm_key.

io¶

openpois.io.osm_history_pbf¶

Download Geofabrik full-history PBFs (US + Puerto Rico), filter to POI tags with osmium tags-filter, time-window with osmium time-filter, and parse with pyosmium into per-version and per-change Parquet tables suitable for the change-rate model. Uses an OAuth cookie jar against Geofabrik’s internal server.

This module downloads US+PR full-history OpenStreetMap data for POI change-rate modelling using Geofabrik full-history PBF extracts, osmium-tool CLI pre-filtering, and pyosmium streaming.

It is broken into the following functions:

download_history_pbf: Downloads a .osh.pbf file (optionally authenticated via
a Geofabrik OAuth cookie jar) via streaming HTTP.
filter_history_pbf: Runs osmium tags-filter –omit-referenced to produce a
reduced POI-only history PBF.
time_filter_history_pbf: Runs osmium time-filter FROM TO to slice the history
PBF to versions active in a given date range.
parse_history_pbf: Streams a filtered history PBF with pyosmium and writes
per-version metadata (osm_versions.parquet) and per-version tag diffs (osm_changes.parquet).
download_osm_history: End-to-end orchestrator. Downloads both the US-mainland
and Puerto Rico history extracts, filters and time-filters each, parses each, concatenates the results, and writes final versions/changes Parquets.

Data sources:

US mainland (all 50 states incl. AK + HI, ~11 GB): https://osm-internal.download.geofabrik.de/north-america/us-internal.osh.pbf
Puerto Rico (separate extract): https://osm-internal.download.geofabrik.de/north-america/us/puerto-rico-internal.osh.pbf

Both URLs live on Geofabrik’s OAuth-protected internal server and require a valid OSM-account cookie jar. Any OSM account grants access; see the README section on cookie acquisition for details.

osmium-tool CLI must be installed (conda install -c conda-forge osmium-tool).

Note: This module is separate from openpois.io.osm_snapshot, which extracts the current POI snapshot only.

openpois.io.osm_history_pbf.download_history_pbf( url: str, output_path: Path, cookie_file: Path | None = None, overwrite: bool = False, ) → Path[source]¶

Downloads a full-history PBF file from the given URL via streaming HTTP.

Writes to a temporary file in the same directory and renames atomically on success so a partial download never masquerades as a complete file.

Parameters:

url – URL of the history PBF file to download.
output_path – Local path to save the downloaded PBF.
cookie_file – Path to a Netscape-format cookie jar for Geofabrik OAuth, or None for an unauthenticated session (fine for public extracts, required for the internal server).
overwrite – If False and output_path already exists, skip the download.

Returns:

Path to the downloaded PBF file.

Raises:

requests.HTTPError – If the HTTP request fails.
FileNotFoundError – If cookie_file is given but does not exist.

openpois.io.osm_history_pbf.download_osm_history( pbf_url: str, raw_pbf_path: Path, filtered_pbf_path: Path, time_filtered_pbf_path: Path, us_versions_path: Path, us_changes_path: Path, pr_pbf_url: str, raw_pr_pbf_path: Path, filtered_pr_pbf_path: Path, time_filtered_pr_pbf_path: Path, pr_versions_path: Path, pr_changes_path: Path, output_versions_path: Path, output_changes_path: Path, filter_keys: list[str], start_date: datetime | date, end_date: datetime | date, cookie_file: Path | None = None, overwrite_download: bool = False, overwrite_filter: bool = False, overwrite_parse: bool = False, chunk_size: int = 500000, verbose: bool = True, ) → tuple[Path, Path][source]¶

End-to-end orchestrator: download the US-mainland and PR Geofabrik full-history PBFs, filter and time-filter each, parse each to Parquets, and concatenate into the final versions + changes files.

Parameters:

pbf_url – URL of the US-mainland full-history PBF (Geofabrik internal).
raw_pbf_path – Local path for the raw US PBF.
filtered_pbf_path – Local path for the tags-filtered US PBF.
time_filtered_pbf_path – Local path for the time-filtered US PBF.
us_versions_path – Intermediate Parquet for US versions.
us_changes_path – Intermediate Parquet for US changes.
pr_pbf_url – URL of the Puerto Rico full-history PBF.
raw_pr_pbf_path – Local path for the raw PR PBF.
filtered_pr_pbf_path – Local path for the tags-filtered PR PBF.
time_filtered_pr_pbf_path – Local path for the time-filtered PR PBF.
pr_versions_path – Intermediate Parquet for PR versions.
pr_changes_path – Intermediate Parquet for PR changes.
output_versions_path – Final concatenated osm_versions.parquet.
output_changes_path – Final concatenated osm_changes.parquet.
filter_keys – OSM tag keys passed to tags-filter.
start_date – Start of the time-filter window.
end_date – End of the time-filter window.
cookie_file – Netscape-format cookie jar for Geofabrik OAuth.
overwrite_download – Re-download raw PBFs even if present.
overwrite_filter – Re-run tags-filter and time-filter even if present.
overwrite_parse – Re-run parse even if Parquets are present.
chunk_size – Rows per Parquet-writer flush in the streaming parser.
verbose – Print progress during parsing.

Returns:

Tuple (output_versions_path, output_changes_path).

openpois.io.osm_history_pbf.filter_history_pbf( input_pbf: Path, output_pbf: Path, osm_keys: list[str], overwrite: bool = False, ) → Path[source]¶

Runs osmium tags-filter –omit-referenced on a full-history PBF.

–omit-referenced is required for history files: multi-pass filtering (the default mode used by filter_pbf in osm_snapshot.py) fails on .osh.pbf. –omit-referenced does a single pass and emits only the matching objects themselves; referenced node coordinates are not retained. This is fine for the change-rate pipeline because we never resolve way geometry downstream.

The --output-format=osh.pbf flag keeps the output in history format.

Parameters:

input_pbf – Path to the raw history PBF.
output_pbf – Path to write the filtered history PBF.
osm_keys – OSM tag keys to retain (e.g., [‘amenity’, ‘shop’]).
overwrite – If False and output_pbf exists, skip filtering.

Returns:

Path to the filtered PBF file.

Raises:

subprocess.CalledProcessError – If osmium exits with non-zero status.

openpois.io.osm_history_pbf.parse_history_pbf( pbf_path: Path, versions_path: Path, changes_path: Path, chunk_size: int = 500000, overwrite: bool = False, verbose: bool = True, ) → tuple[Path, Path][source]¶

Stream a filtered full-history PBF and write versions + changes Parquets.

The pyosmium FileProcessor emits every version of every element in (type, id, version) order for a history PBF. For each version we compare its tag set against the previous version of the same element (reset whenever (type, id) changes) and emit:

one row per version to versions_path with id, version, changeset, timestamp, user, uid, type;
one row per tag change (Added / Changed / Deleted) to changes_path with key, value, change, id, version.

Includes visible, lat, and lon as pseudo-tags so that visibility toggles (deletions) and coordinate edits are captured in osm_changes — matches the behaviour of the existing Overpass-based pipeline.

Parameters:

pbf_path – Path to the (tag-filtered and optionally time-filtered) history PBF.
versions_path – Destination Parquet for per-version metadata.
changes_path – Destination Parquet for per-version tag diffs.
chunk_size – Number of rows to buffer before each flush. Same value is applied independently to the versions and changes buffers.
overwrite – If False and both destinations already exist, skip parsing.
verbose – If True, print progress every chunk_size versions.

Returns:

Tuple (versions_path, changes_path).

openpois.io.osm_history_pbf.time_filter_history_pbf( input_pbf: Path, output_pbf: Path, start_date: datetime | date, end_date: datetime | date, overwrite: bool = False, ) → Path[source]¶

Runs osmium time-filter FROM TO on a full-history PBF.

With two ISO-formatted timestamps, osmium time-filter preserves every version active during the window and keeps the output in history format. A single-timestamp call would collapse the file to a snapshot — that is not what we want here.

Parameters:

input_pbf – Path to the tag-filtered history PBF.
output_pbf – Path to write the time-filtered history PBF.
start_date – Start of the window (inclusive). datetime or date.
end_date – End of the window (exclusive per osmium semantics).
overwrite – If False and output_pbf exists, skip the filter.

Returns:

Path to the time-filtered PBF file.

Raises:

subprocess.CalledProcessError – If osmium exits with non-zero status.

openpois.io.osm_snapshot¶

Download a current US-wide OSM POI snapshot from a Geofabrik PBF extract. Streams the PBF (~11 GB), runs osmium tags-filter to reduce it to matching tag keys, then parses nodes and way centroids with pyosmium into a GeoParquet file. The osmium binary is resolved from the conda environment rather than the system PATH.

This module downloads a current/latest POI snapshot for the US + Puerto Rico from OpenStreetMap using Geofabrik PBF extracts, osmium-tool CLI pre-filtering, and pyosmium parsing.

It is broken into the following functions:

download_pbf: Downloads a PBF file from a URL via streaming HTTP.
filter_pbf: Runs osmium tags-filter to produce a reduced POI-only PBF.
parse_pbf_to_geodataframe: Parses the filtered PBF with pyosmium into a
GeoDataFrame of nodes (Points) and ways (Polygons or Points).
download_osm_snapshot: End-to-end orchestrator. Downloads and parses both
the mainland US extract and the Puerto Rico extract, then concatenates the results.

Data sources:

US mainland (all 50 states incl. AK + HI, ~11 GB): https://download.geofabrik.de/north-america/us-latest.osm.pbf
Puerto Rico (separate extract, ~tens of MB): https://download.geofabrik.de/north-america/us/puerto-rico-latest.osm.pbf

Geofabrik extracts are cut along administrative boundaries, so no polygon post-filter is applied here — the two extracts together cover exactly the US + PR footprint.

osmium-tool CLI must be installed (conda install -c conda-forge osmium-tool).

Note: This module is separate from openpois.io.osm_history_pbf, which fetches full-history PBFs for change-rate modeling. This module downloads a current snapshot only.

openpois.io.osm_snapshot.download_osm_snapshot( pbf_url: str, raw_pbf_path: Path, filtered_pbf_path: Path, output_path: Path, filter_keys: list[str], extract_keys: list[str], pr_pbf_url: str, raw_pr_pbf_path: Path, filtered_pr_pbf_path: Path, overwrite_download: bool = False, overwrite_filter: bool = False, source_label: str = 'osm', keep_all_keys: bool = False, chunk_size: int = 500000, max_area_nodes: int | None = None, chunk_dir: Path | None = None, verbose: bool = True, ) → Path[source]¶

End-to-end orchestrator: download both the US-mainland and Puerto Rico Geofabrik PBFs, filter each to POIs, parse each, concat, and save as GeoParquet.

For each PBF the steps are:

download_pbf — streams the PBF to the raw_pbf path.
filter_pbf — runs osmium tags-filter to produce a POI-only PBF.
parse_pbf_to_geodataframe — parses with pyosmium into a GeoDataFrame.

The two GeoDataFrames are concatenated and written to output_path.

Steps 1 and 2 are skipped if the target files already exist unless overwrite_download / overwrite_filter are True.

Parameters:

pbf_url – URL of the US-mainland PBF extract (Geofabrik us-latest, all 50 states including AK + HI).
raw_pbf_path – Local path to store the US-mainland raw PBF.
filtered_pbf_path – Local path to store the US-mainland filtered PBF.
output_path – Path to write the output GeoParquet file.
filter_keys – OSM tag keys used to filter elements in the PBF. Elements lacking all of these keys are excluded.
extract_keys – OSM tag keys to include as output columns. If None, all tags on accepted elements are extracted.
pr_pbf_url – URL of the Puerto Rico PBF extract (Geofabrik puerto-rico-latest). Geofabrik serves this separately from the US extract.
raw_pr_pbf_path – Local path to store the PR raw PBF.
filtered_pr_pbf_path – Local path to store the PR filtered PBF.
overwrite_download – Re-download even if raw paths exist.
overwrite_filter – Re-filter even if filtered paths exist.
source_label – Value for the output ‘source’ column.
keep_all_keys – If True, all OSM tags are retained as columns in the output GeoDataFrame, not just those in extract_keys. filter_keys is still used to filter which elements are included.
chunk_size – Number of POI records per parquet chunk during parsing. Lower values reduce peak memory usage.
max_area_nodes – If set, relation-derived areas with more than this many total coordinate nodes are skipped before any Shapely geometry is built. Useful for excluding large multipolygons (parks, admin boundaries) that can exhaust memory. None disables the check.
chunk_dir – Directory under which a parse_chunks/ subdirectory is created to hold intermediate chunk files. Defaults to the parent of each filtered PBF. See parse_pbf_to_geodataframe for details.
verbose – If True, log progress after each chunk is flushed.

Returns:

Path to the written GeoParquet file (same as output_path).

openpois.io.osm_snapshot.download_pbf( url: str, output_path: Path, overwrite: bool = False, ) → Path[source]¶

Downloads a PBF file from the given URL to output_path via streaming HTTP.

Parameters:

url – URL of the PBF file to download (e.g., a Geofabrik extract).
output_path – Local path to save the downloaded PBF.
overwrite – If False and output_path already exists, skip the download.

Returns:

Path to the downloaded PBF file.

Raises:

requests.HTTPError – If the HTTP request fails.

openpois.io.osm_snapshot.filter_pbf( input_pbf: Path, output_pbf: Path, osm_keys: list[str], overwrite: bool = False, ) → Path[source]¶

Runs osmium tags-filter to extract nodes, ways, and relations matching the given keys.

Constructs and runs a command of the form:: osmium tags-filter -o {output_pbf} {input_pbf} nwr/{key1} nwr/{key2} …

The referenced nodes for matched ways are retained so that way geometries can be resolved by pyosmium in a subsequent step.

Parameters:

input_pbf – Path to the full PBF extract.
output_pbf – Path to write the filtered output PBF.
osm_keys – OSM tag keys to retain (e.g., [‘amenity’, ‘shop’]).
overwrite – If False and output_pbf exists, skip filtering.

Returns:

Path to the filtered PBF file.

Raises:

subprocess.CalledProcessError – If osmium exits with non-zero status.
FileNotFoundError – If osmium is not installed or not on PATH.

openpois.io.osm_snapshot.parse_pbf_to_geodataframe( pbf_path: Path, filter_keys: list[str] | None = None, extract_keys: list[str] | None = None, source_label: str = 'osm', chunk_size: int = 500000, max_area_nodes: int | None = None, chunk_dir: Path | None = None, verbose: bool = True, ) → geopandas.GeoDataFrame[source]¶

Parses a filtered PBF file with pyosmium and returns a GeoDataFrame.

Thin wrapper around parse_pbf_to_parquet that loads the written parquet into a GeoDataFrame. For very large extracts (e.g. a full US PBF), prefer parse_pbf_to_parquet and consume the parquet with PyArrow streaming to avoid holding all records in memory.

See parse_pbf_to_parquet for parameter documentation.

Returns:: source, osm_id (int64), osm_type (‘node’|’way’|’relation’), tag columns, name, geometry. CRS is EPSG:4326.
Return type:: GeoDataFrame with columns

openpois.io.osm_snapshot.parse_pbf_to_parquet( pbf_path: Path, out_path: Path, filter_keys: list[str] | None = None, extract_keys: list[str] | None = None, source_label: str = 'osm', chunk_size: int = 500000, max_area_nodes: int | None = None, chunk_dir: Path | None = None, verbose: bool = True, ) → Path[source]¶

Parses a filtered PBF file with pyosmium and writes the result as a single GeoParquet file at out_path.

Memory-efficient alternative to parse_pbf_to_geodataframe: records are flushed to per-chunk parquet files on disk, then merged directly to out_path via a PyArrow streaming writer. A full GeoDataFrame is never materialised in memory. Peak memory is one chunk’s worth of records.

Args: see parse_pbf_to_geodataframe. out_path is written with the same: schema that parse_pbf_to_geodataframe would produce in a GeoParquet round-trip (columns: source, osm_id, osm_type, name, geometry, plus any extract_keys tag columns).

Returns:: out_path.

openpois.io.overture¶

Download a current US-wide Overture Maps Places snapshot. Uses DuckDB’s httpfs and spatial extensions to query Overture GeoParquet files directly from public S3, filtering by bounding box and L0 taxonomy category. No authentication is required. Auto-detects the latest Overture release date from S3 if a specific date is not pinned.

This module downloads a current/latest Overture Maps Places snapshot for the US + Puerto Rico, filtered to a set of taxonomy categories.

Download strategy: a wide single-query scan of the full US+PR footprint crashed DuckDB on memory-constrained hosts (it materialized 6M+ rows before the spatial filter). This module instead iterates the 16 part-*.parquet files that make up a release, queries each one with a bounded DuckDB session, and writes a plain parquet intermediate per part. Intermediates survive across invocations, so a crashed run can be resumed by re-running the script.

After every part is present on local disk, a single DuckDB COPY applies the exact US+PR polygon filter (reading the boundary via the spatial extension), builds the geometry column with ST_Point, and writes the final GeoParquet without ever materializing rows in Python. The output file is valid GeoParquet (readable by gpd.read_parquet with CRS preserved).

Spatial filter strategy (two-stage, all inside DuckDB):

Per-part WHERE uses predicate pushdown on Overture’s bbox struct column, OR-ing across one or more coarse bboxes. Multiple bboxes are required to capture the Alaskan Near Islands (+172 E) without scanning longitudes the main US bbox would miss.
The final COPY does an exact ST_Within check against the dissolved US+PR polygon to drop Canadian and Mexican border slivers.

Data source: s3://overturemaps-us-west-2/release/ (public, no auth required).

Category filtering uses the taxonomy.hierarchy array. The first element (taxonomy.hierarchy[1] in SQL 1-based indexing) is the L0 category. The deprecated categories.primary field must NOT be used; it is removed in June 2026.

Memory knobs: duckdb_memory_limit and duckdb_threads are per DuckDB connection. workers parallelizes per-part downloads via a ThreadPoolExecutor. Peak host RAM ≈ workers × duckdb_memory_limit and peak CPU ≈ workers × duckdb_threads — scale the per-worker knobs down if raising workers beyond the default.

openpois.io.overture.build_overture_s3_path( release_date: str, bucket: str, ) → str[source]¶

Returns the S3 glob path for all Places Parquet files in a given release.

Parameters:

release_date – Release identifier as returned by get_latest_release_date (e.g., ‘2026-02-18.0’).
bucket – The S3 bucket name.

Returns:

S3 path string suitable for DuckDB read_parquet(), e.g. s3://overturemaps-us-west-2/release/2026-02-18.0/theme=places/type=place/

openpois.io.overture.download_overture_snapshot( output_path: Path, taxonomy_allowlist: list, boundary_gdf: geopandas.GeoDataFrame, coarse_bboxes: list[dict], bucket: str, s3_region: str, release_date: str | None = None, source_label: str = 'overture', duckdb_memory_limit: str = '4GB', duckdb_threads: int = 2, workers: int = 2, ) → Path[source]¶

Downloads filtered Overture Maps Places data and writes it as GeoParquet.

The full-CONUS scan is split across the release’s part-*.parquet files. Each part streams through DuckDB into a plain parquet intermediate (under output_path.parent / ".parts" / <release_date> /); the loop is resumable — if an intermediate already exists, the part is skipped on the next run. After every part is present, a single DuckDB COPY applies the exact US+PR polygon filter and writes the final GeoParquet without materializing rows in Python.

Parameters:

output_path – Path to write the output GeoParquet file.
taxonomy_allowlist – List of (L0, L1) pairs specifying which taxonomy branches to retain. L1 = None means “all L1s under this L0”. Accepts pairs as two-element tuples or lists (YAML). Valid L0 values (from S3 data as of 2026-02-18): ‘food_and_drink’, ‘shopping’, ‘arts_and_entertainment’, ‘sports_and_recreation’, ‘health_care’, ‘services_and_business’, ‘travel_and_transportation’, ‘lifestyle_services’, ‘education’, ‘community_and_government’, ‘cultural_and_historic’, ‘lodging’, ‘geographic_entities’. See: https://docs.overturemaps.org/guides/places/taxonomy/
boundary_gdf – Single-row GeoDataFrame in EPSG:4326 containing the dissolved, buffered US+PR polygon. Used as the exact spatial filter; obtain it from openpois.io.boundary.
coarse_bboxes – List of bbox dicts (keys xmin, ymin, xmax, ymax) used as the DuckDB predicate-pushdown prefilter. Typically obtained from openpois.io.boundary.us_pr_bboxes.
bucket – S3 bucket name hosting Overture releases.
s3_region – AWS region of the S3 bucket.
release_date – Overture release identifier (e.g., ‘2026-02-18.0’). If None, the latest release is fetched automatically.
source_label – Value for the output ‘source’ column.
duckdb_memory_limit – Per-connection DuckDB memory cap (e.g., “4GB”).
duckdb_threads – Per-connection DuckDB thread count.
workers – Number of parts to download in parallel via a ThreadPoolExecutor. Peak host RAM is workers × duckdb_memory_limit and peak CPU is workers × duckdb_threads — scale down the per-worker knobs when increasing workers. Must be >= 1.

Returns:

The output_path of the written GeoParquet file. The file is readable by gpd.read_parquet(path) (with columns=... support) with CRS preserved as OGC:CRS84 (equivalent to EPSG:4326).

Raises:

ValueError – If workers is less than 1, or if the S3 listing is truncated or empty.

openpois.io.overture.get_latest_release_date( bucket: str, ) → str[source]¶

Finds the most recent Overture Maps release date by listing the S3 bucket.

Queries the public S3 HTTP API for prefix listings under the ‘release/’ key and returns the lexicographically largest date string found.

Parameters:

bucket – The S3 bucket name hosting Overture releases.

Returns:

Release date string in the format ‘YYYY-MM-DD.N’ as it appears in S3 (e.g., ‘2026-02-18.0’).

Raises:

requests.HTTPError – If the S3 list request fails.
ValueError – If no release prefixes are found in the bucket.

openpois.io.geohash_partition¶

Utilities for spatially partitioning GeoDataFrames by geohash for efficient web-map viewport queries. Computes geohash columns from geometry centroids, writes Hive-style partitioned Parquet datasets (geohash_prefix=XX/), and sorts rows within each partition by a finer geohash for spatial locality.

openpois.io.source_coop¶

Upload a locally partitioned dataset to Source Cooperative’s S3-compatible storage. Walks the Hive partition directory, uploads each Parquet file under a versioned prefix, and reports the public URL on completion. Credentials come from a JSON file at the repo root (publish.credentials_file).

openpois.io.credentials¶

Load Source Cooperative AWS-compatible credentials from a JSON file. Tokens are short-lived (~1 hour); the loader logs a clear error pointing at the credentials regeneration URL when the file is stale or missing.

Load Source Cooperative temporary AWS credentials from a local .env.json file.

Source Coop credentials are short-lived (issued via the dashboard and scoped to a single repository prefix). The file format is the JSON payload shown in the Source Coop UI — four keys:

{
“aws_access_key_id”: “ASIA…”, “aws_secret_access_key”: “…”, “aws_session_token”: “…”, “region_name”: “us-west-2”

}

If the file has not been touched recently we warn the caller (but do not fail); STS tokens typically last an hour or so.

openpois.io.credentials.load_source_coop_credentials( env_file: Path | str | None = None, ) → dict[source]¶

Read Source Coop temporary AWS credentials from .env.json.

env_file defaults to ~/repos/openpois/.env.json. Raises FileNotFoundError or ValueError with a refresh hint if the file is missing or malformed. Prints a warning if the file’s mtime is older than ~1 hour (tokens may have expired).

models¶

openpois.models.jax_core¶

JAX/BlackJAX helpers: a PRNG factory, a jitted Markov-chain scan, a NUTS sampler with window adaptation, and a vmap-based predictive-draw utility.

openpois.models.model_fitter¶

BlackJAX NUTS fitter for POI change-rate models. Takes an event_rate_fun plus starting parameters as a pytree, runs window-adapted NUTS to draw from the posterior, and produces posterior summaries and predictive distributions of change probability versus time.

openpois.models.osm_models¶

JAX model classes for OSM turnover. ConstantModel and RandomByTypeModel package their own data, priors, and event-rate functions to hand to ModelFitter. Selectable via get_model_class.

JAX-based models for OSM POI turnover rate estimation.

Each model class is self-contained: it ingests a raw observations DataFrame plus a metadata dict, prepares the JAX arrays that ModelFitter needs, and exposes event_rate_fun and param_likelihood as bound instance methods.

The fitted rate is interpreted as a Poisson event rate per observation; the change probability is recovered inside ModelFitter via P = 1 - exp(-rate).

class openpois.models.osm_models.ConstantModel( dataset: pandas.DataFrame, metadata: dict, )[source]¶

Bases: ModelFactory

Constant change rate with ZIE δ mixture.

λ = exp(log_lambda); δ = sigmoid(logit_delta). A fraction δ of individuals change at t = 0 (methodology §1.7); the remaining 1−δ fraction follow Exponential(λ). Only the first interval of each individual carries the (1−δ) discount — see log_likelihood_fun.

Metadata keys:

dt_col: Column containing per-observation interval length in years: (default "tag_years").
log_lambda_prior_scale: Standard deviation of the N(0, scale) prior on: log_lambda (default DEFAULT_LOG_LAMBDA_PRIOR_SCALE).
logit_delta_prior: (loc, scale) tuple for the Normal prior on: logit_delta (default DEFAULT_LOGIT_DELTA_PRIOR = (−3, 1)).

build_model()[source]¶: Define log_lambda and logit_delta parameters + ZIE likelihood.

build_predict_data( times, )[source]¶: Build the data dict passed to ModelFitter.predict for a time grid.

derive_draws( draws, )[source]¶: Expose delta = sigmoid(logit_delta) alongside raw draws.

event_rate_fun( params, data, )[source]¶: Poisson event rate per observation.

param_likelihood( params, )[source]¶: Optional log-prior contribution. Default is flat (0.0).

class openpois.models.osm_models.ModelFactory( dataset: pandas.DataFrame, metadata: dict, )[source]¶

Bases: ABC

Base class for OSM turnover models.

Subclasses must implement build_model() to populate starting_params, param_ids (and group_lookup for random-effects variants), plus any per-observation columns in self.data beyond dt.

__init__( dataset: pandas.DataFrame, metadata: dict, )[source]¶

Parameters:

dataset – Observations DataFrame (already filtered/prepared by prepare_data_for_model).
metadata – Model configuration. Required keys vary by subclass; all subclasses honor dt_col (default "tag_years").

assign_targets()[source]¶: Create self.data['dt'], self.target, and is_first_interval.

abstractmethod build_model()[source]¶: Populate starting_params, param_ids, and any extra data columns.

abstractmethod build_predict_data( times: jax.numpy.ndarray, ) → dict[str, jax.numpy.ndarray][source]¶: Build the data dict passed to ModelFitter.predict for a time grid.

derive_draws( draws: dict[str, jax.numpy.ndarray], ) → dict[str, jax.numpy.ndarray][source]¶

Augment posterior draws with any derived/back-transformed parameters.

Default is an identity map. Override in subclasses that sample a reparameterised form (e.g. non-centered epsilon_raw) to expose the natural parameter (epsilon) to downstream consumers.

abstractmethod event_rate_fun( params: dict[str, jax.numpy.ndarray], data: dict[str, jax.numpy.ndarray], ) → jax.numpy.ndarray[source]¶: Poisson event rate per observation.

param_likelihood( params: dict[str, jax.numpy.ndarray], ) → jax.numpy.ndarray[source]¶: Optional log-prior contribution. Default is flat (0.0).

validate_inputs()[source]¶: Override to validate raw_data / metadata before build_model.

class openpois.models.osm_models.RandomByTypeModel( dataset: pandas.DataFrame, metadata: dict, )[source]¶

Bases: ModelFactory

Random-effects model on both λ and δ, grouped by a shared label.

Per-group log-rate:: log λ_g = log_lambda_0 + ε_g, ε_g ~ N(0, exp(log_sigma)) log_sigma ~ N(var_prior[0], var_prior[1])
Per-group zero-inflated mixture mass on the logit scale:: logit δ_g = logit_delta_0 + η_g, η_g ~ N(0, exp(log_tau)) log_tau ~ N(logit_delta_var_prior[0], logit_delta_var_prior[1])

Two equivalent parameterisations are supported via metadata["reparam"] and apply symmetrically to both ε and η:

"centered" (legacy): the sampler traces epsilon / eta directly. Simple, but the posterior is funnel-shaped near sparse groups and NUTS tends to diverge there.
"non_centered" (default): the sampler traces epsilon_raw ~ N(0, 1) / eta_raw ~ N(0, 1) and we reconstruct epsilon = exp(log_sigma) * epsilon_raw and eta = exp(log_tau) * eta_raw post-hoc. Removes the funnel, usually yielding zero divergences and higher ESS on small groups with no change to well-identified groups.

Under non_centered the natural-scale epsilon and eta draws are exposed via derive_draws along with the per-group logit_delta and delta, so downstream consumers (predict, parameter tables, saved param_draws.csv) see the same derived names either way.

Metadata keys:

group: Column name in raw_data holding the grouping variable. Required. dt_col: Column containing per-observation interval length (default

"tag_years").

var_prior: (loc, scale) tuple for the hyperprior on log_sigma: (default DEFAULT_VAR_PRIOR).
logit_delta_prior: (loc, scale) tuple for the prior on the: intercept logit_delta_0 (default DEFAULT_LOGIT_DELTA_PRIOR).
logit_delta_var_prior: (loc, scale) tuple for the tight hyperprior: on log_tau (default DEFAULT_LOGIT_DELTA_VAR_PRIOR).
reparam: "centered" or "non_centered" (default: "non_centered").

build_model()[source]¶: Encode group IDs and allocate per-group epsilon + logit_delta.

build_predict_data( times, )[source]¶: Build the data dict passed to ModelFitter.predict for a time grid.

derive_draws( draws, )[source]¶: Expose natural-scale epsilon/eta and per-group logit_delta/delta.

event_rate_fun( params, data, )[source]¶: Poisson event rate per observation.

log_1md_fun( params, data, )[source]¶: Per-observation log(1-δ_{g(i)}) for fresh-mode predictions.

param_likelihood( params, )[source]¶: Optional log-prior contribution. Default is flat (0.0).

validate_inputs()[source]¶: Require a valid group metadata entry that is a column in raw_data.

openpois.models.osm_models.get_model_class( model_name: str, ) → type[ModelFactory][source]¶

Return a ModelFactory subclass by name from MODEL_REGISTRY.

Parameters:: model_name – Registry key ("constant" or "random_by_type").
Returns:: The corresponding ModelFactory subclass.
Raises:: ValueError – If model_name is not a registered model.

openpois.models.setup¶

Data-preparation helpers. prepare_data_for_model filters and groups observation records and computes the tag_years elapsed column used as the per-observation interval length in fitting.

Data-preparation helpers for OSM turnover models.

openpois.models.setup.prepare_data_for_model( data: pandas.DataFrame, group_key: str | None = None, group_values: list[str] | None = None, min_value_count: int | None = None, t1_col: str = 'last_obs_timestamp', t2_col: str = 'obs_timestamp', ) → pandas.DataFrame[source]¶

Prepare an observations DataFrame for model fitting.

Per turnover-model-methodology.md §1.2, the per-row Bernoulli-on-Poisson likelihood requires Δ = t_k − t_{k−1} (inter-observation), so the default t1_col is last_obs_timestamp. Multiplying per-row Bernoullis telescopes to the correct individual likelihood. The previous default of last_tag_timestamp would have made Δ the duration since the individual’s start — correct for one-row-per-individual but biased downward on multi-version POIs.

Also emits is_first_interval — True exactly when last_obs_timestamp == last_tag_timestamp, i.e. this row is the first surviving observation of its (POI, name-iteration) individual. Used by the ZIE δ extension (methodology §1.7).

Parameters:

data – Observations DataFrame as returned by format_observations.
group_key – Column name of the grouping variable. If None, no group filtering is applied.
group_values – If provided, only rows with group_key in this list are kept.
min_value_count – If provided, groups with fewer than this many observations are dropped.
t1_col – Name of the start-time timestamp column. Default last_obs_timestamp gives the inter-observation interval.
t2_col – Name of the end-time timestamp column.

Returns:

Filtered DataFrame with additional tag_days, tag_years, and is_first_interval columns.

Raises:

ValueError – If t1_col or t2_col is not present in data.

openpois.models.apply¶

Apply saved change-rate model predictions to a POI snapshot. Loads predictions.csv from a versioned model output directory and builds fast numpy lookup arrays (indexed by group and time step) for both constant and random-effects model variants.

Utilities for applying fitted change-rate model predictions to a POI snapshot.

Provides functions to load saved predictions and build fast numpy lookup arrays for both constant and random-effects model variants.

openpois.models.apply.constant_lookup( pred_df: pandas.DataFrame, ) → numpy.ndarray[source]¶

Build a (101, 3) float64 lookup array for a constant model.

Row index = t2_int (0–100); columns = [p_mean, p_lower, p_upper].

Parameters:: pred_df – Predictions DataFrame returned by load_predictions.
Returns:: Array of shape (101, 3).

openpois.models.apply.group_lookup( pred_df: pandas.DataFrame, ) → tuple[list[str], numpy.ndarray][source]¶

Build a (n_groups, 101, 3) float64 lookup array for a random-effects model.

First axis = group index (sorted alphabetically); second = t2_int (0–100); third = [p_mean, p_lower, p_upper].

Parameters:: pred_df – Predictions DataFrame returned by load_predictions. Must contain a group_name column.
Returns:: Tuple of (ordered_group_names, array of shape (n_groups, 101, 3)).

openpois.models.apply.load_predictions( version_dir: Path, ) → pandas.DataFrame[source]¶

Load predictions.csv from a model version directory.

Adds a t2_int column (t2 * 10, rounded to int) as an integer lookup key.

Parameters:: version_dir – Path to the versioned model output directory containing predictions.csv.
Returns:: DataFrame with columns t1, t2, p_mean, p_lower, p_upper, units, t2_int, and optionally group / group_name for random-effects models.

osm¶

openpois.osm.format_observations¶

Convert raw OSM version histories into modelling-ready observation records. Joins version and change tables to produce one row per element version, with timestamps for the previous and current tag values and a flag indicating whether the configured tag changed at this version.

This module formats OSM changes and versions into observations, which can be more easily queried and statistically analyzed.

openpois.osm.format_observations.format_observations_duckdb( changes_path: Path, versions_path: Path, output_path: Path, tag_key: str, keep_keys: list[str], duckdb_memory_limit: str = '4GB', duckdb_threads: int | None = None, duckdb_temp_dir: Path | None = None, batch_rows: int = 100000, verbose: bool = True, ) → int[source]¶

Stream POI observations from Parquet inputs to Parquet via DuckDB.

DuckDB pivots the long-form osm_changes.parquet wide by tag key, LEFT-joins osm_versions.parquet on (type, id, version), and returns rows sorted by (type, id, version); the sort spills to duckdb_temp_dir past duckdb_memory_limit. A Python scan then iterates the sorted stream through _advance_scan_state(), buffering emitted observations per DuckDB fetch batch and flushing them as pyarrow.Table record batches to a ParquetWriter.

Peak RSS is bounded to roughly duckdb_memory_limit plus one fetch batch of observations, regardless of input size.

Parameters:

changes_path – Input osm_changes.parquet.
versions_path – Input osm_versions.parquet.
output_path – Destination .parquet. Overwritten.
tag_key – Tag key to model (e.g. "name").
keep_keys – Tag keys to retain on each observation. Must not include special characters (validated against [A-Za-z0-9_:]+).
duckdb_memory_limit – DuckDB memory_limit setting. The sort operator spills to disk past this.
duckdb_threads – DuckDB worker thread count. Defaults to os.cpu_count().
duckdb_temp_dir – Sort-spill directory. Defaults to output_path.parent.
batch_rows – Rows pulled per fetchmany call; also the ParquetWriter flush size.
verbose – Print progress.

Returns:

Total number of observation rows written.

openpois.osm.change_plots¶

Kaplan-Meier-style tag stability plots using plotnine. Computes the proportion of tag assignments that remain unchanged over time from observation records, and renders single-panel and faceted multi-panel figures saved as PNG files.