Incremental Content Migration Scripts for Headless

Why Bulk Exports Fail

An incremental sync moves delta payloads from a legacy database into a headless content API without busting the whole cache, hitting rate limits, or merging draft and published state. Bulk exports that ignore the content lifecycle break ISR caches, orphan asset references, and desync preview environments. The working pattern is stateful and cursor-driven: track version vectors, apply idempotent upserts, and fire rebuild webhooks only for the route segments that changed. This keeps editorial velocity intact through the decoupling phase.

Three constraints cause most pipeline degradation:

  1. API quotas. CMS management APIs cap requests (often 10–50 req/s). Bulk payloads without exponential backoff hit 429 Too Many Requests (RFC 6585) immediately; without jittered retries, scripts exhaust connection pools and corrupt the sync cursor.
  2. Merged draft/published state. Naive utilities treat content as flat JSON and ignore the draft/published split that preview routing depends on. Merging both into one API call overwrites draft metadata, invalidates preview tokens, and forces the frontend back to stale production snapshots.
  3. Cache-key collisions. Frameworks derive cache keys from updatedAt or content hashes. Overwriting records without version locking forces full-site regeneration and exhausts Vercel/Netlify build concurrency. Without local state, the script reprocesses unchanged records, creating duplicates and breaking preview auth.

Architecture

The sync is a stateful loop: read the cursor, pull only the delta, split draft from published, upsert with version locks, then fire targeted webhooks.

flowchart TD
  A["Read cursor from state ledger"] --> B["Query legacy WHERE updated_at > cursor"]
  B --> C{"Status?"}
  C -->|draft| D["POST to draft endpoint"]
  C -->|published| E["POST to content endpoint"]
  D --> F["Version-locked upsert (If-Match)"]
  E --> F
  F --> G{"429 or 5xx?"}
  G -->|yes| H["Backoff + jitter, retry"]
  H --> F
  G -->|no| I["Persist etag + sync_status"]
  I --> J["Advance cursor"]
  J --> K["Dispatch targeted webhook (changed slugs)"]
  K --> L["ISR revalidation / partial rebuild"]

1. Local state ledger

Run a lightweight SQLite or Redis instance as the delta ledger. Each row stores content_id, legacy_updated_at, headless_etag, and sync_status — the source of truth for whether a payload needs transmitting or can be skipped.

2. Cursor-based pagination

Query the legacy database with a monotonic cursor: WHERE updated_at > :last_cursor ORDER BY updated_at ASC LIMIT 500. Persist the cursor after each batch so syncs resume across network partitions or restarts without reprocessing history.

3. Normalize and split states

Map legacy fields to the headless schema with explicit type coercion. Route status = 'draft' to the draft endpoint and status = 'published' to production — never in one call. This preserves preview routing and keeps in-progress content out of production.

4. Version-locked idempotent upserts

Send If-Match or X-Version-Id headers to the management API to avoid clobbering concurrent edits. Retry 429 and 5xx with exponential backoff plus jitter. An idempotency key of content_id + legacy_updated_at makes duplicate transmissions resolve to the same server state.

5. Targeted webhooks

After a successful upsert, emit a webhook carrying only the affected slugs and their new version hashes, and trigger ISR revalidation or partial rebuilds rather than a full one. This keeps the build surface minimal, in line with the broader Legacy System Decoupling Strategies.

Implementation (TypeScript)

A production-safe incremental sync against a generic headless CMS — local state tracking, cursor pagination, version-locked upserts, and targeted webhook dispatch.

TypeScript
import Database from 'better-sqlite3';
import { randomInt } from 'crypto';

// Configuration & Types
interface SyncState {
  content_id: string;
  legacy_updated_at: string;
  headless_etag: string | null;
  sync_status: 'pending' | 'success' | 'failed';
}

interface WebhookPayload {
  type: 'content_delta';
  affected_slugs: string[];
  version_hashes: Record<string, string>;
  timestamp: string;
}

interface CMSResponse {
  id: string;
  etag: string;
  slug: string;
}

class HeadlessDeltaSync {
  private db: Database.Database;
  private cmsBaseUrl: string;
  private cmsApiKey: string;
  private webhookUrl: string;
  private readonly BATCH_LIMIT = 500;
  private readonly MAX_RETRIES = 5;

  constructor(dbPath: string, cmsBaseUrl: string, cmsApiKey: string, webhookUrl: string) {
    this.db = new Database(dbPath);
    this.cmsBaseUrl = cmsBaseUrl;
    this.cmsApiKey = cmsApiKey;
    this.webhookUrl = webhookUrl;
    this.initializeStateTable();
  }

  private initializeStateTable(): void {
    this.db.exec(`
      CREATE TABLE IF NOT EXISTS sync_state (
        content_id TEXT PRIMARY KEY,
        legacy_updated_at TEXT NOT NULL,
        headless_etag TEXT,
        sync_status TEXT DEFAULT 'pending'
      );
      CREATE INDEX IF NOT EXISTS idx_updated_at ON sync_state(legacy_updated_at);
    `);
  }

  private getCursor(): string {
    const row = this.db.prepare('SELECT MAX(legacy_updated_at) as cursor FROM sync_state WHERE sync_status = ?').get('success') as { cursor: string | null };
    return row.cursor ?? '1970-01-01T00:00:00Z';
  }

  private async fetchLegacyDelta(cursor: string): Promise<Array<Record<string, any>>> {
    // Simulated legacy DB query. Replace with actual ORM/SQL driver.
    console.log(`[Delta] Fetching records updated after ${cursor}...`);
    return []; // Placeholder for actual legacy data fetch
  }

  private async upsertToCMS(record: Record<string, any>, etag: string | null): Promise<CMSResponse | null> {
    const endpoint = record.status === 'draft' 
      ? `${this.cmsBaseUrl}/api/v1/drafts` 
      : `${this.cmsBaseUrl}/api/v1/content`;

    const headers: Record<string, string> = {
      'Authorization': `Bearer ${this.cmsApiKey}`,
      'Content-Type': 'application/json',
      'Idempotency-Key': `${record.id}-${record.updated_at}`
    };

    if (etag) {
      // Enforce optimistic concurrency control per MDN specifications
      headers['If-Match'] = etag;
    }

    let attempt = 0;
    while (attempt < this.MAX_RETRIES) {
      try {
        const res = await fetch(endpoint, {
          method: 'POST',
          headers,
          body: JSON.stringify(record)
        });

        if (res.status === 429) {
          const retryAfter = parseInt(res.headers.get('Retry-After') || '5', 10);
          await new Promise(r => setTimeout(r, retryAfter * 1000 + randomInt(0, 1000)));
          attempt++;
          continue;
        }

        if (!res.ok) throw new Error(`CMS Error: ${res.status} ${res.statusText}`);

        const data = await res.json();
        return { id: data.id, etag: res.headers.get('ETag') || '', slug: data.slug };
      } catch (err) {
        attempt++;
        if (attempt >= this.MAX_RETRIES) throw err;
        const backoff = Math.pow(2, attempt) * 100 + randomInt(0, 500);
        await new Promise(r => setTimeout(r, backoff));
      }
    }
    return null;
  }

  private persistState(record: Record<string, any>, cmsResponse: CMSResponse | null): void {
    const stmt = this.db.prepare(`
      INSERT INTO sync_state (content_id, legacy_updated_at, headless_etag, sync_status)
      VALUES (?, ?, ?, ?)
      ON CONFLICT(content_id) DO UPDATE SET
        legacy_updated_at = excluded.legacy_updated_at,
        headless_etag = excluded.headless_etag,
        sync_status = excluded.sync_status
    `);
    stmt.run(
      record.id,
      record.updated_at,
      cmsResponse?.etag ?? null,
      cmsResponse ? 'success' : 'failed'
    );
  }

  private async dispatchTargetedWebhook(affectedRecords: CMSResponse[]): Promise<void> {
    if (affectedRecords.length === 0) return;

    const payload: WebhookPayload = {
      type: 'content_delta',
      affected_slugs: affectedRecords.map(r => r.slug),
      version_hashes: affectedRecords.reduce((acc, r) => ({ ...acc, [r.id]: r.etag }), {}),
      timestamp: new Date().toISOString()
    };

    await fetch(this.webhookUrl, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(payload)
    });
  }

  public async runSync(): Promise<void> {
    const cursor = this.getCursor();
    const deltaRecords = await this.fetchLegacyDelta(cursor);
    const successfulUpserts: CMSResponse[] = [];

    for (const record of deltaRecords) {
      const state = this.db.prepare('SELECT headless_etag FROM sync_state WHERE content_id = ?').get(record.id) as { headless_etag: string | null } | undefined;
      const currentEtag = state?.headless_etag ?? null;

      const cmsRes = await this.upsertToCMS(record, currentEtag);
      if (cmsRes) {
        this.persistState(record, cmsRes);
        successfulUpserts.push(cmsRes);
      }
    }

    await this.dispatchTargetedWebhook(successfulUpserts);
    console.log(`[Sync] Completed. Processed ${deltaRecords.length} records. Dispatched ${successfulUpserts.length} rebuild triggers.`);
  }
}

// Usage
// const sync = new HeadlessDeltaSync('./sync-state.db', 'https://cms.example.com', 'sk_live_...', 'https://frontend.example.com/api/revalidate');
// await sync.runSync();

Operational Guardrails

  • Cursor monotonicity. Index updated_at at millisecond precision (ISO 8601). Clock skew between legacy and headless nodes causes missed deltas — sync NTP across the pipeline.
  • ETag handling. Never cache If-Match responses. A 412 Precondition Failed means the record changed concurrently: refetch the headless state, merge, retry.
  • Webhook payload limits. Batch slugs but cap each webhook (e.g., 50 routes). Above that, dispatch in chunks to avoid gateway timeouts.
  • Preview token rotation. Draft upserts must keep the original preview_token or regenerate it deterministically. New tokens break deep-link sharing and review workflows.
  • Observability. Log sync_duration_ms, retry_count, 429_encounters, and cache_hit_ratio. Alert when sync_status = 'failed' exceeds 5% of a batch.

State separation, version-locked upserts, and cursor pagination eliminate full-site cache busts while holding preview sync under a second. The pipeline scales linearly with content volume and degrades gracefully under throttling.