Scheduled Map Rebuild Workflows

Part of the Data Refresh & Automation Pipelines guide.

In geospatial applications, static datasets rarely stay accurate for long. This page explains how to build a deterministic, repeatable pipeline that regenerates map assets — GeoJSON, vector tiles, or raster layers — on a fixed cadence, without manual intervention. The primary audience is frontend and full-stack developers who own a geo-dashboard and need a reliable publish path; GIS analysts who produce the source data but hand off the delivery side; and platform teams who want audit trails, rollback capability, and predictable compute costs baked in from day one. The stack components you will wire together are a cron scheduler (GitHub Actions), a Python geospatial runtime (geopandas, shapely), versioned object storage, and a CDN edge with programmatic purging.


Scheduled Map Rebuild Pipeline Five-stage linear pipeline: Cron Trigger → Fetch & Validate → Process & Optimise → Atomic Deploy → Cache Sync. Arrows connect each stage left to right. Cron Trigger 0 2 * * * Fetch & Validate checksum + schema Process & Optimise simplify · clean Atomic Deploy manifest.json swap Cache Sync CDN purge · reload validation failure → alert + halt (no publish)

Prerequisites

Before building a production rebuild pipeline, confirm the following are in place:

  • Python 3.10+ with geopandas>=0.14, shapely>=2.0, and jsonschema>=4.21 installed in a reproducible environment (requirements.txt or pyproject.toml
  • Source data access
  • CI/CD scheduler
  • Versioned object storage
  • CDN with programmatic purge API — CloudFront, Cloudflare, or Fastly. Without the ability to purge manifest.json on demand, your cache invalidation strategy
  • Frontend mapping library (MapLibre GL JS, Leaflet, or OpenLayers
  • Familiarity with the RFC 7946 GeoJSON specification — specifically coordinate order ([lon, lat]

Step 1: Trigger & Data Fetch

The scheduler fires at a predetermined interval — 0 2 * * * for 02:00 UTC is a common choice because it lands after most business-hour writes but before morning dashboards open. Implement schema validation immediately on the raw payload: reject malformed data before any expensive geospatial operation begins. Use SHA-256 checksums or HTTP ETags to detect whether the source has actually changed since the last successful run; if the checksum matches, exit early and avoid unnecessary CDN churn.

# scripts/fetch_data.py
import hashlib
import json
import sys
from pathlib import Path

import httpx

SOURCE_URL = "https://data.example.com/features.geojson"
CHECKSUM_FILE = Path(".last_checksum")
OUTPUT_FILE = Path("raw.geojson")


def fetch() -> None:
    response = httpx.get(SOURCE_URL, timeout=30)
    response.raise_for_status()
    payload = response.content

    digest = hashlib.sha256(payload).hexdigest()
    if CHECKSUM_FILE.exists() and CHECKSUM_FILE.read_text().strip() == digest:
        print("Source unchanged — skipping rebuild.")
        sys.exit(0)

    OUTPUT_FILE.write_bytes(payload)
    CHECKSUM_FILE.write_text(digest)
    print(f"Fetched {len(payload):,} bytes  sha256={digest[:12]}…")


if __name__ == "__main__":
    fetch()

If the source is a PostGIS database rather than an HTTP endpoint, replace the httpx call with a geopandas.read_postgis() call and serialise to GeoJSON with gdf.to_file("raw.geojson", driver="GeoJSON").


Step 2: Geospatial Processing & Optimisation

Raw features are rarely optimised for web delivery. Apply topology cleaning, coordinate rounding, and Douglas–Peucker simplification to reduce payload size without sacrificing visual fidelity. Strip unused attributes early — every extra property key adds up across thousands of features.

If your output will feed a tile vs vector rendering strategy based on vector tiles, enforce consistent zoom-level clipping and tile boundary snapping here to prevent rendering artefacts.

# scripts/process.py
import json
from pathlib import Path

import geopandas as gpd
from shapely.validation import make_valid

KEEP_FIELDS = {"id", "name", "category", "value"}
SIMPLIFY_TOLERANCE = 0.0001   # degrees — adjust per zoom target
COORD_PRECISION = 6           # decimal places


def process(input_path: Path, output_path: Path) -> None:
    gdf = gpd.read_file(input_path)

    # Ensure WGS 84 (EPSG:4326) — RFC 7946 requires it
    if gdf.crs and gdf.crs.to_epsg() != 4326:
        gdf = gdf.to_crs(epsg=4326)

    # Repair invalid geometries before simplification
    gdf["geometry"] = gdf["geometry"].apply(make_valid)

    # Drop null geometries
    gdf = gdf[gdf.geometry.notna() & ~gdf.geometry.is_empty]

    # Simplify
    gdf["geometry"] = gdf["geometry"].simplify(
        SIMPLIFY_TOLERANCE, preserve_topology=True
    )

    # Keep only required attributes
    drop_cols = [c for c in gdf.columns if c not in KEEP_FIELDS | {"geometry"}]
    gdf = gdf.drop(columns=drop_cols)

    # Round coordinates to reduce file size
    gdf.to_file(output_path, driver="GeoJSON")

    # Truncate coordinate precision in the raw JSON
    data = json.loads(output_path.read_text())
    for feature in data["features"]:
        _round_coords(feature["geometry"], COORD_PRECISION)
    output_path.write_text(json.dumps(data, separators=(",", ":")))

    print(f"Processed {len(gdf)} features → {output_path.stat().st_size:,} bytes")


def _round_coords(geom: dict, precision: int) -> None:
    if geom["type"] == "Point":
        geom["coordinates"] = [round(v, precision) for v in geom["coordinates"]]
    elif geom["type"] in {"LineString", "MultiPoint"}:
        geom["coordinates"] = [
            [round(v, precision) for v in pt] for pt in geom["coordinates"]
        ]
    elif geom["type"] in {"Polygon", "MultiLineString"}:
        geom["coordinates"] = [
            [[round(v, precision) for v in pt] for pt in ring]
            for ring in geom["coordinates"]
        ]
    elif geom["type"] == "MultiPolygon":
        geom["coordinates"] = [
            [[[round(v, precision) for v in pt] for pt in ring] for ring in poly]
            for poly in geom["coordinates"]
        ]


if __name__ == "__main__":
    process(Path("raw.geojson"), Path("optimised.geojson"))

Parallelise across geographic partitions or feature ID ranges if the dataset is large — Python’s concurrent.futures.ProcessPoolExecutor or a simple multiprocessing.Pool distributes the make_valid + simplify workload across CPU cores.


Step 3: Validation & Quality Assurance

Never publish an unvalidated output. Frontend rendering engines silently drop invalid geometries, which produces confusing blank areas that are difficult to debug in production. Run all checks before touching cloud storage:

# scripts/validate.py
import json
import sys
from pathlib import Path

import geopandas as gpd
from shapely.validation import explain_validity

MAX_UNCOMPRESSED_BYTES = 50 * 1024 * 1024   # 50 MB
REQUIRED_FIELDS = {"id", "name", "category"}
EXPECTED_BBOX = (-180, -90, 180, 90)        # adjust to your region


def validate(path: Path) -> None:
    errors: list[str] = []

    # Size guard
    size = path.stat().st_size
    if size > MAX_UNCOMPRESSED_BYTES:
        errors.append(f"File too large: {size:,} bytes (limit {MAX_UNCOMPRESSED_BYTES:,})")

    gdf = gpd.read_file(path)

    # CRS guard
    if gdf.crs and gdf.crs.to_epsg() != 4326:
        errors.append(f"Wrong CRS: {gdf.crs.to_epsg()} (expected 4326)")

    # Geometry validity
    invalid = gdf[~gdf.geometry.is_valid]
    for idx, row in invalid.iterrows():
        errors.append(f"Invalid geometry at index {idx}: {explain_validity(row.geometry)}")

    # Attribute completeness
    missing_cols = REQUIRED_FIELDS - set(gdf.columns)
    if missing_cols:
        errors.append(f"Missing required fields: {missing_cols}")

    # Null-value check on required fields
    for col in REQUIRED_FIELDS & set(gdf.columns):
        nulls = gdf[col].isna().sum()
        if nulls:
            errors.append(f"Column '{col}' has {nulls} null value(s)")

    # Spatial bounds
    minx, miny, maxx, maxy = gdf.total_bounds
    if not (
        EXPECTED_BBOX[0] <= minx and maxx <= EXPECTED_BBOX[2]
        and EXPECTED_BBOX[1] <= miny and maxy <= EXPECTED_BBOX[3]
    ):
        errors.append(f"Bounds outside expected range: {gdf.total_bounds.tolist()}")

    if errors:
        for e in errors:
            print(f"FAIL  {e}", file=sys.stderr)
        sys.exit(1)

    print(f"OK  {len(gdf)} features · {size:,} bytes · bounds {gdf.total_bounds.tolist()}")


if __name__ == "__main__":
    validate(Path(sys.argv[1]))

A failure here should trigger an alert and halt the pipeline entirely. Retain the failed output in a quarantine prefix (e.g. assets/quarantine/<timestamp>/) for post-mortem inspection.


Step 4: Atomic Deployment & Versioning

Publishing must be atomic to prevent partial updates reaching frontend consumers. Upload all assets to a new timestamped directory, verify each object’s ETag against the local checksum, and only then overwrite manifest.json. Frontends read manifest.json at initialisation and immediately get a consistent, complete dataset.

# scripts/deploy.py
import datetime
import hashlib
import json
import sys
from pathlib import Path

import boto3

BUCKET = "my-geo-assets"
LOCAL_OUTPUT = Path("optimised.geojson")


def deploy() -> None:
    s3 = boto3.client("s3")
    version = datetime.date.today().isoformat()   # e.g. "2026-06-23"
    prefix = f"assets/{version}"
    key = f"{prefix}/optimised.geojson"

    # Upload asset
    s3.upload_file(
        str(LOCAL_OUTPUT),
        BUCKET,
        key,
        ExtraArgs={"ContentType": "application/geo+json",
                   "CacheControl": "public, max-age=86400"},
    )

    # Verify upload integrity via ETag
    head = s3.head_object(Bucket=BUCKET, Key=key)
    remote_etag = head["ETag"].strip('"')
    local_md5 = hashlib.md5(LOCAL_OUTPUT.read_bytes()).hexdigest()
    if remote_etag != local_md5:
        print("ETag mismatch — aborting manifest update", file=sys.stderr)
        sys.exit(1)

    # Atomically update the manifest
    manifest = {
        "version": version,
        "url": f"https://cdn.example.com/{key}",
        "featureCount": _count_features(LOCAL_OUTPUT),
    }
    s3.put_object(
        Bucket=BUCKET,
        Key="manifest.json",
        Body=json.dumps(manifest).encode(),
        ContentType="application/json",
        CacheControl="no-cache, no-store, must-revalidate",
    )
    print(f"Deployed version {version}: {manifest}")


def _count_features(path: Path) -> int:
    data = json.loads(path.read_text())
    return len(data.get("features", []))


if __name__ == "__main__":
    deploy()

Keep the last three versioned prefixes in storage at all times. Rolling back is then a single manifest update pointing at the previous timestamp — no reprocessing required.


Step 5: Cache Busting & Frontend Sync

The final step synchronises CDN edges and browser caches with the newly published assets. Combine URL-fingerprinting for long-lived tile assets with an explicit purge of manifest.json, which carries a no-cache directive and must always be fetched fresh. This is the operational side of cache invalidation strategies applied specifically to map layers.

# .github/workflows/nightly-rebuild.yml
name: Nightly Map Rebuild
on:
  schedule:
    - cron: "0 2 * * *"
  workflow_dispatch:

env:
  AWS_REGION: us-east-1

jobs:
  rebuild:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
          cache: "pip"

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Fetch source data
        run: python scripts/fetch_data.py
        # exits 0 silently if source unchanged (checksum match)

      - name: Process & optimise
        run: python scripts/process.py

      - name: Validate output
        run: python scripts/validate.py optimised.geojson

      - name: Deploy atomically
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: python scripts/deploy.py

      - name: Purge CDN cache for manifest
        env:
          CF_ZONE_ID: ${{ secrets.CF_ZONE_ID }}
          CF_API_TOKEN: ${{ secrets.CF_API_TOKEN }}
        run: |
          curl -s -X POST \
            "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/purge_cache" \
            -H "Authorization: Bearer $CF_API_TOKEN" \
            -H "Content-Type: application/json" \
            --data '{"files":["https://cdn.example.com/manifest.json"]}' \
            | jq -e '.success'

      - name: Notify on failure
        if: failure()
        run: |
          curl -s -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H "Content-Type: application/json" \
            --data "{\"text\":\"Nightly map rebuild failed on $(date -u +%Y-%m-%dT%H:%M:%SZ)\"}"

For tile-based outputs, purge only the affected tile grid cells rather than the entire origin — most CDN APIs accept a list of exact URLs, so enumerate only the tiles that cover the dataset’s bounding box.


Verification & Smoke-Test

After each run, confirm the pipeline produced a live, fresh layer before closing the incident window:

# 1. Fetch the manifest and confirm the version field matches today's date
curl -sf https://cdn.example.com/manifest.json | jq '.version'

# 2. Download the asset URL from the manifest and count features
ASSET_URL=$(curl -sf https://cdn.example.com/manifest.json | jq -r '.url')
curl -sf "$ASSET_URL" | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(f'{len(d[\"features\"])} features · type={d[\"type\"]}')
"

# 3. Confirm the manifest is not being served from a stale cache
curl -sI https://cdn.example.com/manifest.json | grep -i 'cache-control\|age\|cf-cache-status'

# 4. Browser devtools: open the dashboard, open Network tab, reload, filter
#    for manifest.json — confirm Status 200, no "from disk cache", and
#    the feature layer reloads with the expected bounding box.

In your mapping library, register a console log on the source load event:

// MapLibre GL JS example
map.on('sourcedata', (e) => {
  if (e.sourceId === 'geo-features' && e.isSourceLoaded) {
    const src = map.getSource('geo-features');
    console.info('Layer loaded:', src.serialize());
  }
});

Troubleshooting

Why does manifest.json still serve the old version after a successful build?

The CDN cached the manifest with a max-age set from a previous upload that lacked Cache-Control: no-cache, no-store, must-revalidate. Upload the manifest again with the correct header, then trigger an explicit edge purge via the CDN’s API. Going forward, always set no-cache on manifest.json at upload time — the individual asset files under assets/<version>/ can carry a long max-age because their URLs are immutable.

The pipeline succeeds but Leaflet renders an empty layer — what went wrong?

The most common cause is a coordinate-axis flip: the output GeoJSON has coordinates in [lat, lon] order rather than the [lon, lat] order that RFC 7946 requires. This happens when source data arrives in a geographic CRS where the axis order is latitude-first (common in older EPSG definitions). Add an explicit to_crs(epsg=4326) call in the processing step and verify the output by inspecting the first feature’s coordinates against a known point on a map.

How do I skip the rebuild when source data has not changed?

Store the SHA-256 checksum of the last successful fetch in a persistent location — a small file in S3 or a GitHub Actions cache. At the top of the fetch step, compare the current checksum against the stored value. If they match, exit the step with code 0 and an output flag (skip=true); downstream steps check that flag and exit early via a if: steps.fetch.outputs.skip != 'true' condition.

The GitHub Actions job times out on large tile generation — how do I fix this?

Partition the tile generation by bounding box or zoom range and run each partition as a parallel matrix job. Set timeout-minutes on the individual jobs rather than globally so partial failures surface immediately. Cache the intermediate processed GeoJSON using actions/cache so that retrying a failed matrix job only re-runs the tile generation for that partition, not the full fetch-and-process chain.

How do I roll back to the previous build without rerunning the full pipeline?

Because assets are immutable under their timestamped prefix, a rollback is just a manifest update. Write a one-shot script that reads the second-most-recent prefix from S3, constructs a new manifest.json pointing at it, and uploads with no-cache. Then purge the CDN edge. The entire operation takes under ten seconds and requires no reprocessing.


Gotchas & Edge Cases

  • Axis-order bugs: geopandas.read_postgis() returns coordinates in the database CRS axis order, which for EPSG:4326 in PostGIS is (latitude, longitude). Always call .to_crs(epsg=4326) even when the source claims it is already WGS 84 — the axis flip will produce geometries that appear in the wrong hemisphere.
  • CDN Cache-Control header conflicts: If a reverse proxy between the origin and CDN strips or overrides Cache-Control headers, the no-cache directive on manifest.json will not reach the CDN edge. Verify the header survives by running curl -sI against the CDN URL and checking the raw response, not the browser’s cached view.
  • ETag comparison on multipart uploads: S3’s ETag for objects uploaded via multipart upload is not a plain MD5 — it is md5(concat(part_md5s))-N where N is the part count. Use the s3api head-object checksum field (ChecksumSHA256) instead of the ETag for integrity verification when files exceed the multipart threshold (~5 MB by default in boto3).
  • Partial state during atomic deploys: If the job fails between uploading the asset and updating manifest.json, the orphaned asset sits in storage but the manifest still points at the previous version — which is exactly correct behaviour. Add a cleanup cron that prunes asset prefixes older than 7 days that are not referenced by the current manifest.
  • Silent geometry drops in MapLibre GL JS: MapLibre silently skips features with geometries it cannot parse. Enable the browser console and look for [MapLibre GL] Feature index out of range or invalid geometry type warnings during dashboard load. The validation step in the pipeline should catch these upstream, but always verify in the browser as a final smoke-test.
  • Timezone drift in cron expressions: GitHub Actions runs cron jobs in UTC. If your source data is updated at a business-hours boundary in a non-UTC timezone, schedule the rebuild trigger at least two hours after the expected source update to avoid a race between source write and pipeline fetch.

Scheduled vs. Event-Driven Architectures

Choosing between scheduled and reactive patterns depends on data volatility, user expectations, and infrastructure constraints. Scheduled rebuilds excel when data changes predictably, batch processing reduces costs, and frontend consumers can tolerate minor latency. They simplify debugging, enable comprehensive QA gates, and integrate cleanly with traditional CI/CD practices.

If your application requires sub-minute data freshness — live fleet tracking, emergency response routing, or IoT sensor dashboards — webhook-triggered updates or stream-processing pipelines are more appropriate. Event-driven architectures introduce higher complexity around deduplication, ordering guarantees, and partial state management, but they eliminate the staleness window entirely.

Many mature platforms implement a hybrid model: scheduled nightly rebuilds establish a clean, fully validated baseline, while webhooks apply incremental patches during business hours. This hybrid feeds naturally into incremental data processing patterns where only the changed features are re-validated and re-published, keeping compute costs proportional to the change rate rather than the full dataset size.