Incremental sitemap regeneration for dynamic CMS routes

Regenerating a full sitemap on every deploy stops scaling past a few thousand dynamic routes — it exhausts CMS API quotas, triggers cascading CDN invalidations, and blocks the pipeline. Incremental regeneration isolates route-level updates: a versioned route registry, webhook-triggered delta fetches, and chunked XML where only modified chunks get rewritten and purged. It’s a piece of Localization & SEO Optimization, which demands precise cache control per region.

The architecture decouples route enumeration from XML serialization. Instead of iterating every CMS document per deploy, maintain a persistent versioned registry; webhook payloads trigger targeted fetches, and only modified slugs, locales, or content types enter the regeneration queue.

Three Failure Modes at Scale

Sitemap generation in headless environments fails three ways:

  1. API rate-limit exhaustion: Bulk enumeration during build hits CMS API ceilings, causing timeouts and failed deploys.
  2. Edge cache staleness: Aggressive max-age on sitemap-index.xml makes the CDN serve a stale index, delaying crawler discovery.
  3. Phantom locale routes: Fallback configs emit 200 OK URLs with duplicate or placeholder content, diluting crawl budget.

All three converge on the same fix: event-driven incremental updates, not monolithic rebuilds.

Deterministic Route Resolution

Start with a resolver that queries the CMS delivery API using cursor pagination for consistent traversal, filtering strictly on updatedAt to capture only deltas. This TypeScript utility shows the incremental fetch with typing, error boundaries, and cursor management:

TypeScript
import { CMSClient, RouteEntry } from './types';

interface FetchOptions {
  cmsClient: CMSClient;
  lastSyncTimestamp: string;
  batchSize?: number;
}

export async function fetchIncrementalRoutes({
  cmsClient,
  lastSyncTimestamp,
  batchSize = 100,
}: FetchOptions): Promise<RouteEntry[]> {
  const routes: RouteEntry[] = [];
  let cursor: string | null = null;

  try {
    do {
      const response = await cmsClient.getEntries({
        limit: batchSize,
        cursor,
        fields: ['slug', 'locale', 'updatedAt', 'contentType'],
        filter: `updatedAt > "${lastSyncTimestamp}"`,
      });

      const mappedRoutes = response.items.map((item) => ({
        path: `/${item.locale}/${item.slug}`,
        lastmod: item.updatedAt,
        priority: item.contentType === 'landing' ? 0.8 : 0.6,
        changefreq: 'weekly' as const,
        entryId: item.sys.id,
        locale: item.locale,
      }));

      routes.push(...mappedRoutes);
      cursor = response.nextCursor ?? null;
    } while (cursor);
  } catch (error) {
    console.error('Route enumeration failed:', error);
    throw new Error('Incremental fetch interrupted. Verify CMS API health.');
  }

  return routes;
}

The output feeds a chunked sitemap generator. The Sitemaps Protocol caps each file at 50,000 URLs and 50MB uncompressed, so chunk the registry by locale and content type. Give each chunk its own ETag for conditional requests, and reference only modified chunks from the index to cut crawler parse overhead.

Webhook Deduplication

CMS platforms emit duplicate entry.update events — draft transitions, scheduled publishing, and metadata-only edits all fire redundant payloads. Derive an idempotency key from entryId plus a SHA-256 hash of the modified field paths, and track processed keys in an LRU cache over a 30–60s window. A matching key short-circuits the cycle, so only substantive mutations trigger serialization.

Locale Fallback

Missing localized variants must be omitted from the sitemap or mapped to a canonical parent — never indexed as low-value fallback pages. Route Mapping for Multilingual Sites keeps hreflang and canonical tags synced with the XML. When generating locale chunks, cross-reference available translations against the fallback hierarchy; exclude missing variants from the sitemap but keep them in the routing layer for graceful degradation. This underpins Dynamic Sitemap Generation at global scale.

CDN Synchronization

Issue targeted PURGE requests keyed on ETag or Last-Modified rather than purging directories. Cache tags that map to sitemap chunks let you invalidate surgically without touching static assets. Set Cache-Control: public, max-age=3600, stale-while-revalidate=86400 on updated chunks to balance freshness against edge performance. Conditional requests — see MDN on HTTP ETag — mean crawlers download only modified XML, cutting bandwidth and speeding indexation.

Pipeline Flow

The full sequence routes each webhook through deduplication before any serialization work happens, so only substantive mutations rewrite a chunk.

flowchart TD
  Hook["CMS webhook event"] --> Dedup{"Idempotency key in LRU cache?"}
  Dedup -->|"duplicate"| Drop["Drop event"]
  Dedup -->|"new"| Delta["Delta fetch: updatedAt filter + cursor"]
  Delta --> Chunk["Serialize modified routes by locale / type"]
  Chunk --> Etag["Assign ETag per chunk"]
  Etag --> Index["Reconcile sitemap-index.xml"]
  Index --> Purge["Targeted edge purge (cache tags)"]

The full sequence:

  1. Event Ingestion: CMS webhooks deliver payload metadata to a serverless function or edge worker.
  2. Deduplication Check: Idempotency keys are validated against an LRU cache. Redundant events are dropped.
  3. Delta Fetch: The route resolver queries the CMS API using updatedAt filters and cursor pagination.
  4. Chunk Serialization: Modified routes are grouped by locale/content type, serialized to XML, and assigned ETag values.
  5. Index Reconciliation: The sitemap-index.xml is updated to reference only modified chunk URLs.
  6. Edge Invalidation: Targeted cache tags or URL paths are purged. New chunks deploy with optimized Cache-Control directives.

Incremental regeneration turns sitemap management from a deployment bottleneck into a background, event-driven process. Isolating route updates, enforcing deduplication, and using edge cache primitives is what holds crawl efficiency steady as content velocity climbs.