Content Mapping Templates for Legacy to Headless Transition

A legacy-to-headless migration needs a deterministic mapping contract that turns implicit, template-bound legacy data into explicit, schema-validated JSON. The contract preserves referential integrity, keeps rich-text ASTs intact, and synchronizes draft/publish state across decoupled environments. Without it, ingestion scripts emit unpredictable documents that break rendering pipelines and preview environments.

Why ingestion breaks without a contract

Legacy CMS platforms store content as serialized HTML in relational tables or document stores, relying on server-side rendering to resolve shortcodes, inline styles, and implicit media dependencies. Decoupling without a strict field-to-component contract fractures that structure: WYSIWYG editors embed layout-specific markup, tracking pixels, and relative asset paths with no structured metadata. Headless platforms, by contrast, enforce normalized fields, explicit cross-content references, and finite state machines separating draft, published, and archived records.

Bypass the mapping layer and three failure modes appear:

Orphaned assets and broken references. Relative URLs and shortcode media embeds fail against headless asset CDNs, producing broken image/video nodes.
Malformed rich-text ASTs. Serialized HTML injected into rich-text fields without AST normalization violates the target node schema and crashes preview renderers.
Draft state desync. Legacy publish timestamps mapped straight to live endpoints bypass validation gates and trigger premature webhook rebuilds.

A structured mapping template enforces schema normalization, deterministic transformation, and state-aware ingestion. Within Legacy System Decoupling Strategies, this contract is the single source of truth for translation — every legacy record maps to a predictable, API-ready document.

The three-phase pipeline

Each legacy record flows through inventory, transformation, and validation before it becomes a publishable document:

flowchart LR
  A["Legacy record"] --> B["Phase 1: schema inventory & field mapping"]
  B --> C["Phase 2: ETL transform (HTML-to-AST, URL normalize)"]
  C --> D["Tag _migration_draft, push to draft endpoint"]
  D --> E["Phase 3: JSON Schema validation"]
  E --> F{"valid in staging preview?"}
  F -->|yes| G["Hold for manual publication"]
  F -->|no| H["Log to migration ledger, rollback"]

1. Schema inventory and field mapping

Extract the legacy schema and map each table/column to a target content type. The template explicitly defines:

Source data paths (SQL columns, JSON keys, serialized-HTML selectors)
Target field types (string, number, reference, rich-text, media)
Transformation functions (HTML-to-AST, URL normalization, shortcode extraction)
Validation constraints (required fields, regex patterns, enum limits)

2. Transformation pipeline execution

Build an ETL runner that consumes the template, fetches content in paginated batches, applies field-level transformations, and pushes payloads to the headless API. Tag migrated entries with a _migration_draft flag and defer publication until validation passes. Use cursor-based pagination to avoid memory exhaustion and exponential backoff for rate limits.

3. Validation and state sync

Run JSON Schema validation against the model before submission. Confirm draft entries resolve in staging by querying the preview endpoint with explicit draft parameters. Log transformation outcomes, validation failures, and API responses to a migration ledger for auditability and rollback. Strict draft isolation here — per Preview & Draft Workflow Patterns — prevents premature publication and lets frontend teams test content safely before go-live.

Mapping template and ETL runner

This JSON template is the declarative source of truth for field transformations, defining source paths, target types, AST conversion, and draft handling.

{
  "mappingTemplate": {
    "version": "1.0.0",
    "sourceSystem": "legacyRelationalCMS",
    "targetSystem": "headlessCMS",
    "contentTypes": {
      "article": {
        "sourceTable": "wp_posts",
        "targetModel": "blog_post",
        "fieldMappings": [
          {
            "sourcePath": "post_title",
            "targetField": "title",
            "type": "string",
            "transform": null,
            "validation": { "required": true, "maxLength": 255 }
          },
          {
            "sourcePath": "post_content",
            "targetField": "body",
            "type": "rich_text",
            "transform": "html_to_ast",
            "validation": { "required": true, "allowedNodes": ["paragraph", "heading", "image", "link"] }
          },
          {
            "sourcePath": "post_excerpt",
            "targetField": "summary",
            "type": "string",
            "transform": "strip_html",
            "validation": { "maxLength": 160 }
          },
          {
            "sourcePath": "featured_image_id",
            "targetField": "hero_image",
            "type": "reference",
            "transform": "resolve_asset_url",
            "validation": { "required": false }
          }
        ],
        "stateConfig": {
          "draftFlag": "_migration_draft",
          "publishStrategy": "deferred",
          "previewEndpoint": "/api/preview?draft=true"
        }
      }
    }
  }
}

The runner consumes the template, applies transformations, validates payloads, and ingests with draft isolation.

import { readFileSync } from 'fs';
import Ajv from 'ajv';
import addFormats from 'ajv-formats';

// Load mapping template and target JSON Schema
const mappingTemplate = JSON.parse(readFileSync('./mapping-template.json', 'utf-8'));
const targetSchema = JSON.parse(readFileSync('./headless-cms-schema.json', 'utf-8'));

const ajv = new Ajv({ allErrors: true });
addFormats(ajv);
const validate = ajv.compile(targetSchema);

interface LegacyRecord {
  id: number;
  post_title: string;
  post_content: string;
  post_excerpt: string;
  featured_image_id: string | null;
}

// Mock transformation functions (replace with production implementations)
const transforms = {
  html_to_ast: (html: string) => {
    // Use standards-compliant W3C HTML Parsing rules to convert to AST
    // Implementation typically leverages rehype/remark or custom DOMParser
    return { type: 'root', children: [{ type: 'paragraph', value: html }] };
  },
  strip_html: (html: string) => html.replace(/<[^>]*>/g, ''),
  resolve_asset_url: (id: string) => `https://cdn.example.com/assets/${id}.webp`
};

async function executeMigrationBatch(records: LegacyRecord[]) {
  const results = { success: 0, failed: 0, errors: [] as string[] };

  for (const record of records) {
    const payload: Record<string, unknown> = {
      _migration_draft: true, // Enforce draft isolation
      legacy_id: record.id
    };

    try {
      // Apply field mappings
      for (const mapping of mappingTemplate.mappingTemplate.contentTypes.article.fieldMappings) {
        const rawValue = record[mapping.sourcePath as keyof LegacyRecord];
        if (rawValue === undefined || rawValue === null) {
          if (mapping.validation.required) throw new Error(`Missing required field: ${mapping.sourcePath}`);
          continue;
        }

        const transformed = mapping.transform 
          ? transforms[mapping.transform as keyof typeof transforms](rawValue as string) 
          : rawValue;

        payload[mapping.targetField] = transformed;
      }

      // Validate against headless CMS schema
      const isValid = validate(payload);
      if (!isValid) {
        throw new Error(`Schema validation failed: ${JSON.stringify(validate.errors)}`);
      }

      // Push to headless CMS API (draft endpoint)
      await fetch('https://api.headless-cms.com/v1/content/blog_post', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json', 'Authorization': `Bearer ${process.env.CMS_TOKEN}` },
        body: JSON.stringify(payload)
      });

      results.success++;
    } catch (error) {
      results.failed++;
      results.errors.push(`Record ${record.id}: ${(error as Error).message}`);
    }
  }

  return results;
}

Validation checklist

AST node compliance. The html_to_ast transformer must strip legacy inline styles (style="...") and convert deprecated tags (<font>, <center>) to semantic equivalents before serialization.
Reference resolution. Resolve asset references synchronously or queue them for background processing; unresolved references throw 404s in preview.
Draft state verification. Query the preview endpoint with ?draft=true right after ingestion and confirm _migration_draft: true blocks public API exposure and suppresses webhook rebuilds until manual publication.
Idempotency. Make legacy_id a unique constraint in the target CMS so pipeline retries don’t create duplicates.

Enforcing this contract removes guesswork from migration, guarantees schema compliance at ingestion, and keeps full control over the draft-to-publish workflow.