Incremental Data Processing for Geo-Dashboards
Part of the Data Refresh & Automation Pipelines guide.
Full dataset rebuilds are unsustainable for modern geo-dashboards. When a municipal boundary shifts, a sensor network reports new telemetry, or a land-use classification updates, regenerating millions of vector tiles or re-indexing entire spatial tables introduces unacceptable latency, compute costs, and user-facing downtime. Incremental data processing solves this by isolating, transforming, and applying only the changed records while preserving spatial topology and dashboard state — enabling near-real-time updates without sacrificing the deterministic guarantees required for analytical dashboards.
How Incremental Processing Fits the Pipeline
The diagram below shows where delta processing sits between raw ingestion and the map serving layer. Change detection feeds into transformation and validation; only clean, committed records reach the tile cache and connected clients.
Prerequisites
Before implementing incremental updates, verify your stack meets these baseline requirements:
- Versioned spatial storage — a PostGIS-enabled database with explicit
updated_at - Deterministic primary keys
- Spatial indexing — GIST or R-tree indexes on geometry columns must be maintained; delta workflows rely on bounding-box pre-filtering to avoid full-table scans. Confirm that CRS & Projection Management
- Tile or vector serving layer
- Idempotent transformation logic
- PostGIS 3.1+ — required for the explicit
ST_MakeValid
Step 1 — Establish Baseline and Spatial Indexing
Materialize a complete snapshot of your geospatial dataset and compute a deterministic hash over geometry and key attributes. This snapshot becomes the reconciliation anchor for all subsequent delta operations.
SELECT
md5(
string_agg(
encode(ST_AsBinary(geom), 'hex') || feature_id::text,
'' ORDER BY feature_id
)
) AS baseline_hash,
COUNT(*) AS baseline_row_count,
NOW() AS last_synced_at
FROM spatial_features;
Store baseline_hash, baseline_row_count, and last_synced_at in a pipeline_state table. Always run ANALYZE on the table after baseline creation to refresh query-planner statistics. If your architecture relies on Scheduled Map Rebuild Workflows to generate the initial state, use that job for baseline generation and then switch the scheduler to incremental mode for subsequent runs.
Step 2 — Implement Change Detection
Identify modified, inserted, and deleted records since the last successful run. Three patterns cover the majority of production scenarios:
High-water mark — simplest approach, suitable when deletes are handled with soft-delete flags:
SELECT *
FROM spatial_features
WHERE updated_at > (SELECT last_synced_at FROM pipeline_state)
AND updated_at <= NOW();
Always account for clock skew and concurrent writes. A safer variant replaces the timestamp with a strictly monotonic sequence or the PostgreSQL xmin transaction ID to guarantee no records slip through during overlapping sync windows.
Logical replication slots — captures hard deletes and schema-transparent updates by streaming WAL changes. This approach requires no application-level timestamp updates and is preferable for high-throughput tables where rows are deleted rather than marked inactive.
Trigger-based audit tables — maintain a spatial_audit table that logs INSERT, UPDATE, and DELETE operations with full transactional consistency. The audit row includes the operation type, affected feature_id, and the pre-change geometry, enabling exact rollback or replay.
For high-frequency telemetry, decouple detection from transformation using a message queue to absorb write spikes and provide backpressure before geometries enter the validation stage.
Step 3 — Transform and Validate Deltas
Raw deltas rarely map directly to the production schema. Apply deterministic transformations that normalize geometries, resolve coordinate reference system mismatches, and enforce topology rules before any record touches the production table.
WITH validated_deltas AS (
SELECT
feature_id,
CASE
WHEN ST_IsValid(geom) THEN geom
-- ST_MakeValid with method param requires PostGIS 3.1+
ELSE ST_MakeValid(geom, '{"method":"linework"}')
END AS clean_geom,
updated_at,
status,
-- Reproject to WGS 84 if source is a local CRS
ST_Transform(geom, 4326) AS geom_wgs84
FROM staging_deltas
)
SELECT feature_id, clean_geom, geom_wgs84, updated_at, status
FROM validated_deltas
WHERE clean_geom IS NOT NULL;
Log invalid geometries that ST_MakeValid cannot repair to a dead-letter queue rather than failing the entire batch. Each dead-letter entry should carry the original payload, the ST_IsValidReason string, and a processing timestamp for manual triage. This preserves pipeline continuity while maintaining data quality — the same principle that underpins the Cache Invalidation Strategies dead-letter handling pattern.
For Python-based transformations using geopandas (0.14+), wrap geometry operations in try/except blocks and serialize failures to a structured JSON log.
Step 4 — Apply Updates and Maintain Topology
Execute delta application within a single transactional boundary. Use UPSERT to handle inserts and updates atomically. For soft deletes, toggle an is_active flag rather than physically removing rows — this preserves referential integrity for historical dashboard queries and simplifies audit trails.
BEGIN;
-- Record the batch to enforce idempotency on retry
INSERT INTO applied_batches (batch_id, applied_at)
VALUES (:'batch_id', NOW())
ON CONFLICT (batch_id) DO NOTHING;
-- Skip the rest if this batch was already applied
DO $$
BEGIN
IF NOT FOUND THEN RAISE EXCEPTION 'batch already applied'; END IF;
END $$;
INSERT INTO production_features (feature_id, geom, metadata, updated_at)
SELECT feature_id, clean_geom, metadata, updated_at
FROM validated_deltas
ON CONFLICT (feature_id) DO UPDATE SET
geom = EXCLUDED.geom,
metadata = EXCLUDED.metadata,
updated_at = EXCLUDED.updated_at;
-- Soft-delete records in the delete queue
UPDATE production_features
SET is_active = FALSE, updated_at = NOW()
WHERE feature_id IN (SELECT feature_id FROM delete_queue);
COMMIT;
After committing, refresh the spatial index to prevent bloat from repeated delta applications:
REINDEX INDEX CONCURRENTLY idx_production_features_geom;
Concurrent reindexing acquires only a ShareUpdateExclusiveLock, which blocks DDL but not reads or writes, so dashboard queries remain responsive during maintenance.
Step 5 — Invalidate Caches and Trigger Re-renders
Once deltas are committed, the serving layer must reflect the changes immediately. Calculate the bounding box of modified features and purge only the affected tile coordinates rather than flushing the entire cache — a full flush defeats the purpose of incremental processing. For in-depth cache lifecycle management see the Cache Invalidation Strategies guide.
import psycopg2
import requests
def invalidate_tiles_for_delta(conn, batch_id: str, tile_server_url: str) -> None:
"""Purge only tile coordinates intersecting the changed bounding box."""
with conn.cursor() as cur:
cur.execute(
"""
SELECT
ST_XMin(bbox) AS min_lng,
ST_YMin(bbox) AS min_lat,
ST_XMax(bbox) AS max_lng,
ST_YMax(bbox) AS max_lat
FROM (
SELECT ST_Extent(geom)::geometry AS bbox
FROM production_features
WHERE batch_id = %s
) sub
""",
(batch_id,),
)
row = cur.fetchone()
if row is None:
return
min_lng, min_lat, max_lng, max_lat = row
resp = requests.post(
f"{tile_server_url}/invalidate",
json={"bbox": [min_lng, min_lat, max_lng, max_lat]},
timeout=10,
)
resp.raise_for_status()
Additionally, use WebSocket push notifications to alert connected dashboard clients. Frontend map layers should refresh only the visible viewport — the Tile vs Vector Rendering Strategies you chose determines whether this means re-fetching MVT endpoints or updating a GeoJSON source in place. Implement exponential backoff on client-side retry logic to prevent thundering-herd effects during large delta bursts.
Step 6 — Verify and Smoke-Test
After each batch, confirm correctness before the next cycle runs:
-- 1. Row count should match expected delta size
SELECT COUNT(*) FROM production_features WHERE batch_id = :'batch_id';
-- 2. No invalid geometries should survive in production
SELECT COUNT(*) FROM production_features WHERE NOT ST_IsValid(geom);
-- 3. Recompute dataset hash and compare to expected value
SELECT md5(
string_agg(
encode(ST_AsBinary(geom), 'hex') || feature_id::text,
'' ORDER BY feature_id
)
) AS current_hash
FROM production_features
WHERE is_active = TRUE;
From the command line, verify tile freshness by requesting a known affected tile coordinate and inspecting the Last-Modified response header:
curl -I "https://tiles.example.com/data/layer/{z}/{x}/{y}.mvt" \
| grep -i "last-modified"
If the timestamp predates your commit, the tile server cache was not successfully invalidated — re-run the invalidation step rather than the entire pipeline.
Production Safeguards and Monitoring
Incremental pipelines require rigorous observability to prevent silent data corruption:
- Checksum reconciliation — after each delta batch, recompute the dataset hash and compare it against the expected value. Mismatches indicate dropped records or transformation drift.
- Lag monitoring — track the gap between
last_synced_atandNOW(). Alert if lag exceeds your SLA (e.g., five minutes for near-real-time dashboards). - Dead-letter queue size — alert immediately on any non-zero dead-letter count; geometries that could not be repaired represent data loss until manually resolved.
- Topology constraints — enforce
ST_IsValidand spatial relationship checks (ST_Intersects,ST_Contains) during validation. Broken topology causes rendering artifacts and breaks spatial joins downstream.
Emit structured log events for rows_processed, rows_failed, processing_duration_ms, and cache_purge_count. Integrate with Prometheus/Grafana or Datadog for real-time alerting. Set up automated runbooks that trigger on dead_letter_queue_size > 0 or spatial_index_bloat_ratio > 0.3.
Troubleshooting
Why does my high-water mark query miss deletes?
High-water marks on updated_at only capture rows that were modified — physically removed rows leave no trace in the source table. Switch to logical replication slots or a trigger-based audit table that records a DELETE operation row before the record disappears. Alternatively, enforce soft deletes application-wide so every removal sets is_active = FALSE and bumps updated_at.
ST_MakeValid repaired my geometry but the rendered layer still looks wrong — why?
ST_MakeValid can introduce new vertices that shift boundary positions. When your feature shares an edge with an adjacent polygon, the neighbour is not automatically healed, producing gaps or overlaps at shared boundaries. Run ST_Snap(a.geom, b.geom, tolerance) between adjacent features after repair to re-align shared edges.
Concurrent REINDEX keeps blocking my queries — how do I fix this?
REINDEX INDEX CONCURRENTLY requires a ShareUpdateExclusiveLock, which blocks DDL but not reads or writes. If contention still appears, check for long-running transactions holding locks — they prevent the concurrent build from finishing and can escalate to heavier locks. Use pg_stat_activity to identify and terminate stale transactions before reindexing.
My pipeline retried and applied the same delta twice — what happened?
The transactional outbox pattern in step 4 guards against this, but only if batch_id is generated outside the retry loop and remains stable across retries. If your orchestrator generates a new UUID on each attempt, the idempotency key changes and the guard fails. Pin batch_id to a deterministic value — for example, the SHA-256 of the source file path and modification timestamp — so retries always produce the same key.
The tile invalidation endpoint returns 200 but stale tiles persist — why?
Most tile servers distinguish between expiring a cache entry (which marks it stale but serves the old tile until the next request) and actively purging it (which removes the entry so the next request regenerates it). Confirm whether your server’s /invalidate endpoint performs a purge or an expiry. If it only expires, instruct the webhook-triggered update to issue a follow-up request to /purge or set Cache-Control: no-cache on the downstream fetch.
Delta batches grow larger over time even though source changes are small — why?
This usually indicates that the high-water mark is drifting because last_synced_at is not being persisted after each successful run. Verify that your pipeline writes the new watermark inside the same transaction as the UPSERT so that a crash mid-run rolls back both the data and the watermark, forcing a clean replay rather than a growing accumulation.
Gotchas and Edge Cases
- Clock skew on multi-node writes — if your source database uses streaming replication and reads come from replicas,
updated_aton the replica can lag the primary by milliseconds, causing the watermark to skip records that committed just before the sync. Query the primary for change detection, or use the WAL-based approach which is clock-independent. - CRS mismatch between delta source and production table — if a data provider switches from
EPSG:3857toEPSG:4326without notice, geometries silently land in the wrong projection and appear as a pinpoint in the wrong ocean. Add an explicitST_SRID(geom) = expected_sridassertion in the validation step. - Geometry type drift — a
POLYGONthat gains a hole becomes aPOLYGON WITH HOLE, but aMULTIPOLYGONthat loses all but one member may arrive as a barePOLYGON. Casting mismatches causeON CONFLICT DO UPDATEto fail at the column type level. Cast togeometryin the production column and handle type coercion in the transformation step. - Large delta bursts after source outages — when a source system is offline for hours and then reconnects, all accumulated changes arrive at once. Without backpressure, this can exhaust connection pool slots. Cap batch size (
LIMIT 10000) and loop until the watermark is caught up rather than processing the entire backlog in a single transaction. - Shared sequences across feature types — if
feature_idis drawn from a shared sequence used by multiple geometry types (points, lines, polygons), delta reconciliation can collide on the same ID from different layers. Prefix IDs with a layer code or use UUIDs generated per feature type.
Related
- Scheduled Map Rebuild Workflows — configure scheduled full rebuilds that generate the initial baseline and correct long-term drift.
- Cache Invalidation Strategies — targeted tile and browser cache purge patterns that complement delta commits.
- Webhook-Triggered Updates — event-driven pipeline triggers that fire incremental runs in response to source system changes.
- CRS & Projection Management — ensure coordinate reference system consistency before geometries enter the delta pipeline.
- Tile vs Vector Rendering Strategies — the rendering approach determines which cache layers need invalidation after a delta commit.