Scheduled Map Rebuild Workflows
Part of the Data Refresh & Automation Pipelines guide.
In geospatial applications, static datasets rarely stay accurate for long. This page explains how to build a deterministic, repeatable pipeline that regenerates map assets — GeoJSON, vector tiles, or raster layers — on a fixed cadence, without manual intervention. The primary audience is frontend and full-stack developers who own a geo-dashboard and need a reliable publish path; GIS analysts who produce the source data but hand off the delivery side; and platform teams who want audit trails, rollback capability, and predictable compute costs baked in from day one. The stack components you will wire together are a cron scheduler (GitHub Actions), a Python geospatial runtime (geopandas, shapely), versioned object storage, and a CDN edge with programmatic purging.
Prerequisites
Before building a production rebuild pipeline, confirm the following are in place:
- Python 3.10+ with
geopandas>=0.14,shapely>=2.0, andjsonschema>=4.21installed in a reproducible environment (requirements.txtorpyproject.toml - Source data access
- CI/CD scheduler
- Versioned object storage
- CDN with programmatic purge API — CloudFront, Cloudflare, or Fastly. Without the ability to purge
manifest.jsonon demand, your cache invalidation strategy - Frontend mapping library (
MapLibre GL JS,Leaflet, orOpenLayers - Familiarity with the RFC 7946 GeoJSON specification — specifically coordinate order (
[lon, lat]
Step 1: Trigger & Data Fetch
The scheduler fires at a predetermined interval — 0 2 * * * for 02:00 UTC is a common choice because it lands after most business-hour writes but before morning dashboards open. Implement schema validation immediately on the raw payload: reject malformed data before any expensive geospatial operation begins. Use SHA-256 checksums or HTTP ETags to detect whether the source has actually changed since the last successful run; if the checksum matches, exit early and avoid unnecessary CDN churn.
# scripts/fetch_data.py
import hashlib
import json
import sys
from pathlib import Path
import httpx
SOURCE_URL = "https://data.example.com/features.geojson"
CHECKSUM_FILE = Path(".last_checksum")
OUTPUT_FILE = Path("raw.geojson")
def fetch() -> None:
response = httpx.get(SOURCE_URL, timeout=30)
response.raise_for_status()
payload = response.content
digest = hashlib.sha256(payload).hexdigest()
if CHECKSUM_FILE.exists() and CHECKSUM_FILE.read_text().strip() == digest:
print("Source unchanged — skipping rebuild.")
sys.exit(0)
OUTPUT_FILE.write_bytes(payload)
CHECKSUM_FILE.write_text(digest)
print(f"Fetched {len(payload):,} bytes sha256={digest[:12]}…")
if __name__ == "__main__":
fetch()
If the source is a PostGIS database rather than an HTTP endpoint, replace the httpx call with a geopandas.read_postgis() call and serialise to GeoJSON with gdf.to_file("raw.geojson", driver="GeoJSON").
Step 2: Geospatial Processing & Optimisation
Raw features are rarely optimised for web delivery. Apply topology cleaning, coordinate rounding, and Douglas–Peucker simplification to reduce payload size without sacrificing visual fidelity. Strip unused attributes early — every extra property key adds up across thousands of features.
If your output will feed a tile vs vector rendering strategy based on vector tiles, enforce consistent zoom-level clipping and tile boundary snapping here to prevent rendering artefacts.
# scripts/process.py
import json
from pathlib import Path
import geopandas as gpd
from shapely.validation import make_valid
KEEP_FIELDS = {"id", "name", "category", "value"}
SIMPLIFY_TOLERANCE = 0.0001 # degrees — adjust per zoom target
COORD_PRECISION = 6 # decimal places
def process(input_path: Path, output_path: Path) -> None:
gdf = gpd.read_file(input_path)
# Ensure WGS 84 (EPSG:4326) — RFC 7946 requires it
if gdf.crs and gdf.crs.to_epsg() != 4326:
gdf = gdf.to_crs(epsg=4326)
# Repair invalid geometries before simplification
gdf["geometry"] = gdf["geometry"].apply(make_valid)
# Drop null geometries
gdf = gdf[gdf.geometry.notna() & ~gdf.geometry.is_empty]
# Simplify
gdf["geometry"] = gdf["geometry"].simplify(
SIMPLIFY_TOLERANCE, preserve_topology=True
)
# Keep only required attributes
drop_cols = [c for c in gdf.columns if c not in KEEP_FIELDS | {"geometry"}]
gdf = gdf.drop(columns=drop_cols)
# Round coordinates to reduce file size
gdf.to_file(output_path, driver="GeoJSON")
# Truncate coordinate precision in the raw JSON
data = json.loads(output_path.read_text())
for feature in data["features"]:
_round_coords(feature["geometry"], COORD_PRECISION)
output_path.write_text(json.dumps(data, separators=(",", ":")))
print(f"Processed {len(gdf)} features → {output_path.stat().st_size:,} bytes")
def _round_coords(geom: dict, precision: int) -> None:
if geom["type"] == "Point":
geom["coordinates"] = [round(v, precision) for v in geom["coordinates"]]
elif geom["type"] in {"LineString", "MultiPoint"}:
geom["coordinates"] = [
[round(v, precision) for v in pt] for pt in geom["coordinates"]
]
elif geom["type"] in {"Polygon", "MultiLineString"}:
geom["coordinates"] = [
[[round(v, precision) for v in pt] for pt in ring]
for ring in geom["coordinates"]
]
elif geom["type"] == "MultiPolygon":
geom["coordinates"] = [
[[[round(v, precision) for v in pt] for pt in ring] for ring in poly]
for poly in geom["coordinates"]
]
if __name__ == "__main__":
process(Path("raw.geojson"), Path("optimised.geojson"))
Parallelise across geographic partitions or feature ID ranges if the dataset is large — Python’s concurrent.futures.ProcessPoolExecutor or a simple multiprocessing.Pool distributes the make_valid + simplify workload across CPU cores.
Step 3: Validation & Quality Assurance
Never publish an unvalidated output. Frontend rendering engines silently drop invalid geometries, which produces confusing blank areas that are difficult to debug in production. Run all checks before touching cloud storage:
# scripts/validate.py
import json
import sys
from pathlib import Path
import geopandas as gpd
from shapely.validation import explain_validity
MAX_UNCOMPRESSED_BYTES = 50 * 1024 * 1024 # 50 MB
REQUIRED_FIELDS = {"id", "name", "category"}
EXPECTED_BBOX = (-180, -90, 180, 90) # adjust to your region
def validate(path: Path) -> None:
errors: list[str] = []
# Size guard
size = path.stat().st_size
if size > MAX_UNCOMPRESSED_BYTES:
errors.append(f"File too large: {size:,} bytes (limit {MAX_UNCOMPRESSED_BYTES:,})")
gdf = gpd.read_file(path)
# CRS guard
if gdf.crs and gdf.crs.to_epsg() != 4326:
errors.append(f"Wrong CRS: {gdf.crs.to_epsg()} (expected 4326)")
# Geometry validity
invalid = gdf[~gdf.geometry.is_valid]
for idx, row in invalid.iterrows():
errors.append(f"Invalid geometry at index {idx}: {explain_validity(row.geometry)}")
# Attribute completeness
missing_cols = REQUIRED_FIELDS - set(gdf.columns)
if missing_cols:
errors.append(f"Missing required fields: {missing_cols}")
# Null-value check on required fields
for col in REQUIRED_FIELDS & set(gdf.columns):
nulls = gdf[col].isna().sum()
if nulls:
errors.append(f"Column '{col}' has {nulls} null value(s)")
# Spatial bounds
minx, miny, maxx, maxy = gdf.total_bounds
if not (
EXPECTED_BBOX[0] <= minx and maxx <= EXPECTED_BBOX[2]
and EXPECTED_BBOX[1] <= miny and maxy <= EXPECTED_BBOX[3]
):
errors.append(f"Bounds outside expected range: {gdf.total_bounds.tolist()}")
if errors:
for e in errors:
print(f"FAIL {e}", file=sys.stderr)
sys.exit(1)
print(f"OK {len(gdf)} features · {size:,} bytes · bounds {gdf.total_bounds.tolist()}")
if __name__ == "__main__":
validate(Path(sys.argv[1]))
A failure here should trigger an alert and halt the pipeline entirely. Retain the failed output in a quarantine prefix (e.g. assets/quarantine/<timestamp>/) for post-mortem inspection.
Step 4: Atomic Deployment & Versioning
Publishing must be atomic to prevent partial updates reaching frontend consumers. Upload all assets to a new timestamped directory, verify each object’s ETag against the local checksum, and only then overwrite manifest.json. Frontends read manifest.json at initialisation and immediately get a consistent, complete dataset.
# scripts/deploy.py
import datetime
import hashlib
import json
import sys
from pathlib import Path
import boto3
BUCKET = "my-geo-assets"
LOCAL_OUTPUT = Path("optimised.geojson")
def deploy() -> None:
s3 = boto3.client("s3")
version = datetime.date.today().isoformat() # e.g. "2026-06-23"
prefix = f"assets/{version}"
key = f"{prefix}/optimised.geojson"
# Upload asset
s3.upload_file(
str(LOCAL_OUTPUT),
BUCKET,
key,
ExtraArgs={"ContentType": "application/geo+json",
"CacheControl": "public, max-age=86400"},
)
# Verify upload integrity via ETag
head = s3.head_object(Bucket=BUCKET, Key=key)
remote_etag = head["ETag"].strip('"')
local_md5 = hashlib.md5(LOCAL_OUTPUT.read_bytes()).hexdigest()
if remote_etag != local_md5:
print("ETag mismatch — aborting manifest update", file=sys.stderr)
sys.exit(1)
# Atomically update the manifest
manifest = {
"version": version,
"url": f"https://cdn.example.com/{key}",
"featureCount": _count_features(LOCAL_OUTPUT),
}
s3.put_object(
Bucket=BUCKET,
Key="manifest.json",
Body=json.dumps(manifest).encode(),
ContentType="application/json",
CacheControl="no-cache, no-store, must-revalidate",
)
print(f"Deployed version {version}: {manifest}")
def _count_features(path: Path) -> int:
data = json.loads(path.read_text())
return len(data.get("features", []))
if __name__ == "__main__":
deploy()
Keep the last three versioned prefixes in storage at all times. Rolling back is then a single manifest update pointing at the previous timestamp — no reprocessing required.
Step 5: Cache Busting & Frontend Sync
The final step synchronises CDN edges and browser caches with the newly published assets. Combine URL-fingerprinting for long-lived tile assets with an explicit purge of manifest.json, which carries a no-cache directive and must always be fetched fresh. This is the operational side of cache invalidation strategies applied specifically to map layers.
# .github/workflows/nightly-rebuild.yml
name: Nightly Map Rebuild
on:
schedule:
- cron: "0 2 * * *"
workflow_dispatch:
env:
AWS_REGION: us-east-1
jobs:
rebuild:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: "pip"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Fetch source data
run: python scripts/fetch_data.py
# exits 0 silently if source unchanged (checksum match)
- name: Process & optimise
run: python scripts/process.py
- name: Validate output
run: python scripts/validate.py optimised.geojson
- name: Deploy atomically
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: python scripts/deploy.py
- name: Purge CDN cache for manifest
env:
CF_ZONE_ID: ${{ secrets.CF_ZONE_ID }}
CF_API_TOKEN: ${{ secrets.CF_API_TOKEN }}
run: |
curl -s -X POST \
"https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/purge_cache" \
-H "Authorization: Bearer $CF_API_TOKEN" \
-H "Content-Type: application/json" \
--data '{"files":["https://cdn.example.com/manifest.json"]}' \
| jq -e '.success'
- name: Notify on failure
if: failure()
run: |
curl -s -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-H "Content-Type: application/json" \
--data "{\"text\":\"Nightly map rebuild failed on $(date -u +%Y-%m-%dT%H:%M:%SZ)\"}"
For tile-based outputs, purge only the affected tile grid cells rather than the entire origin — most CDN APIs accept a list of exact URLs, so enumerate only the tiles that cover the dataset’s bounding box.
Verification & Smoke-Test
After each run, confirm the pipeline produced a live, fresh layer before closing the incident window:
# 1. Fetch the manifest and confirm the version field matches today's date
curl -sf https://cdn.example.com/manifest.json | jq '.version'
# 2. Download the asset URL from the manifest and count features
ASSET_URL=$(curl -sf https://cdn.example.com/manifest.json | jq -r '.url')
curl -sf "$ASSET_URL" | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(f'{len(d[\"features\"])} features · type={d[\"type\"]}')
"
# 3. Confirm the manifest is not being served from a stale cache
curl -sI https://cdn.example.com/manifest.json | grep -i 'cache-control\|age\|cf-cache-status'
# 4. Browser devtools: open the dashboard, open Network tab, reload, filter
# for manifest.json — confirm Status 200, no "from disk cache", and
# the feature layer reloads with the expected bounding box.
In your mapping library, register a console log on the source load event:
// MapLibre GL JS example
map.on('sourcedata', (e) => {
if (e.sourceId === 'geo-features' && e.isSourceLoaded) {
const src = map.getSource('geo-features');
console.info('Layer loaded:', src.serialize());
}
});
Troubleshooting
Why does manifest.json still serve the old version after a successful build?
The CDN cached the manifest with a max-age set from a previous upload that lacked Cache-Control: no-cache, no-store, must-revalidate. Upload the manifest again with the correct header, then trigger an explicit edge purge via the CDN’s API. Going forward, always set no-cache on manifest.json at upload time — the individual asset files under assets/<version>/ can carry a long max-age because their URLs are immutable.
The pipeline succeeds but Leaflet renders an empty layer — what went wrong?
The most common cause is a coordinate-axis flip: the output GeoJSON has coordinates in [lat, lon] order rather than the [lon, lat] order that RFC 7946 requires. This happens when source data arrives in a geographic CRS where the axis order is latitude-first (common in older EPSG definitions). Add an explicit to_crs(epsg=4326) call in the processing step and verify the output by inspecting the first feature’s coordinates against a known point on a map.
How do I skip the rebuild when source data has not changed?
Store the SHA-256 checksum of the last successful fetch in a persistent location — a small file in S3 or a GitHub Actions cache. At the top of the fetch step, compare the current checksum against the stored value. If they match, exit the step with code 0 and an output flag (skip=true); downstream steps check that flag and exit early via a if: steps.fetch.outputs.skip != 'true' condition.
The GitHub Actions job times out on large tile generation — how do I fix this?
Partition the tile generation by bounding box or zoom range and run each partition as a parallel matrix job. Set timeout-minutes on the individual jobs rather than globally so partial failures surface immediately. Cache the intermediate processed GeoJSON using actions/cache so that retrying a failed matrix job only re-runs the tile generation for that partition, not the full fetch-and-process chain.
How do I roll back to the previous build without rerunning the full pipeline?
Because assets are immutable under their timestamped prefix, a rollback is just a manifest update. Write a one-shot script that reads the second-most-recent prefix from S3, constructs a new manifest.json pointing at it, and uploads with no-cache. Then purge the CDN edge. The entire operation takes under ten seconds and requires no reprocessing.
Gotchas & Edge Cases
- Axis-order bugs:
geopandas.read_postgis()returns coordinates in the database CRS axis order, which forEPSG:4326in PostGIS is(latitude, longitude). Always call.to_crs(epsg=4326)even when the source claims it is already WGS 84 — the axis flip will produce geometries that appear in the wrong hemisphere. - CDN
Cache-Controlheader conflicts: If a reverse proxy between the origin and CDN strips or overridesCache-Controlheaders, theno-cachedirective onmanifest.jsonwill not reach the CDN edge. Verify the header survives by runningcurl -sIagainst the CDN URL and checking the raw response, not the browser’s cached view. - ETag comparison on multipart uploads: S3’s ETag for objects uploaded via multipart upload is not a plain MD5 — it is
md5(concat(part_md5s))-Nwhere N is the part count. Use thes3api head-objectchecksum field (ChecksumSHA256) instead of the ETag for integrity verification when files exceed the multipart threshold (~5 MB by default inboto3). - Partial state during atomic deploys: If the job fails between uploading the asset and updating
manifest.json, the orphaned asset sits in storage but the manifest still points at the previous version — which is exactly correct behaviour. Add a cleanup cron that prunes asset prefixes older than 7 days that are not referenced by the current manifest. - Silent geometry drops in MapLibre GL JS: MapLibre silently skips features with geometries it cannot parse. Enable the browser console and look for
[MapLibre GL] Feature index out of rangeorinvalid geometry typewarnings during dashboard load. The validation step in the pipeline should catch these upstream, but always verify in the browser as a final smoke-test. - Timezone drift in cron expressions: GitHub Actions runs cron jobs in UTC. If your source data is updated at a business-hours boundary in a non-UTC timezone, schedule the rebuild trigger at least two hours after the expected source update to avoid a race between source write and pipeline fetch.
Scheduled vs. Event-Driven Architectures
Choosing between scheduled and reactive patterns depends on data volatility, user expectations, and infrastructure constraints. Scheduled rebuilds excel when data changes predictably, batch processing reduces costs, and frontend consumers can tolerate minor latency. They simplify debugging, enable comprehensive QA gates, and integrate cleanly with traditional CI/CD practices.
If your application requires sub-minute data freshness — live fleet tracking, emergency response routing, or IoT sensor dashboards — webhook-triggered updates or stream-processing pipelines are more appropriate. Event-driven architectures introduce higher complexity around deduplication, ordering guarantees, and partial state management, but they eliminate the staleness window entirely.
Many mature platforms implement a hybrid model: scheduled nightly rebuilds establish a clean, fully validated baseline, while webhooks apply incremental patches during business hours. This hybrid feeds naturally into incremental data processing patterns where only the changed features are re-validated and re-published, keeping compute costs proportional to the change rate rather than the full dataset size.
Related
- Data Refresh & Automation Pipelines — parent guide covering the full automation stack
- Cache Invalidation Strategies — URL fingerprinting, ETag rotation, and CDN purge patterns for map layers
- Webhook-Triggered Updates — event-driven alternative for sub-minute freshness requirements
- Incremental Data Processing — process only changed features to reduce compute cost between full rebuilds
- Automating Nightly GeoJSON Rebuilds with GitHub Actions — deep-dive on environment variable management, parallel tile generation, and Slack notifications
- Tile vs Vector Rendering Strategies — choosing raster vs vector tile output affects the processing and deploy steps above