Database extraction strategies for monolithic CMS

Direct relational extraction from monolithic platforms — WordPress, Drupal, Joomla, legacy PHP stacks — consistently produces payloads that aren’t headless-compatible. The schema was built for server-side rendering, not API-first consumption, so this page covers a read-only, idempotent pipeline that maps legacy relational structures to flat, type-safe content models without locking production, corrupting drafts, or breaking media references.

Why direct dumps fail

Monolithic databases optimize for admin-UI convenience and PHP query patterns, not strict data contracts. Content spreads across normalized tables (wp_posts, wp_postmeta, field_data_*, node_revision) with foreign keys enforced at the application layer, not the database. The critical failure points:

  1. Serialized blob storage. Custom fields, block layouts, and taxonomy relationships are stored as PHP-serialized strings or JSON blobs inside meta_value or field_data. Direct SQL dumps yield unparsed strings that break hydration and type inference.
  2. Implicit draft/revision coupling. Draft state lives in status flags (post_status = 'draft', revision tables, is_latest booleans) that don’t map to preview tokens. Extracting without state filtering merges unpublished revisions into production payloads and triggers cache-invalidation storms.
  3. Asset path drift. Media URLs are stored as relative paths, protocol-relative strings, or docroot-absolute references. Static builds and CDN edges can’t resolve them without deterministic path rewriting and protocol normalization.
  4. Missing referential integrity. Many schemas disable foreign-key constraints for performance. Naive JOIN-based extraction silently drops orphaned records, duplicates nodes, or throws null-pointer exceptions during rendering.

Building the idempotent pipeline

The pipeline reads from a replica and runs each row through five deterministic stages before emitting hash-keyed JSON:

flowchart LR
  A["Read replica / snapshot"] --> B["1. Read-only isolation"]
  B --> C["2. Explicit JOINs (filter trash/auto-draft/inherit)"]
  C --> D["3. Deserialize PHP/JSON metadata"]
  D --> E["4. Normalize asset paths (CDN, https)"]
  E --> F["5. Stream, validate, emit"]
  F --> G["Deterministic JSON keyed by content hash"]

1. Isolate read replicas

Never query a primary writer. Route extraction to a dedicated read replica or point-in-time snapshot, with read-only transaction isolation (READ COMMITTED or REPEATABLE READ) so SELECT locks don’t hit admin-UI latency or webhook-triggered rebuilds.

SQL
-- Example: Explicitly route to a read-only transaction in PostgreSQL/MySQL
SET TRANSACTION READ ONLY;
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;

2. Map joins explicitly

Replace implicit ORM queries with explicit LEFT JOIN chains that reconstruct content-type hierarchies, filtering out trash, auto-draft, and inherit revisions at the query level. This enforces the state boundaries that Legacy System Decoupling Strategies require before data leaves the database.

SQL
SELECT 
  p.ID AS content_id,
  p.post_title,
  p.post_content,
  p.post_status,
  p.post_date_gmt,
  pm.meta_key,
  pm.meta_value
FROM wp_posts p
LEFT JOIN wp_postmeta pm ON p.ID = pm.post_id
WHERE p.post_type IN ('post', 'page', 'custom_type')
  AND p.post_status NOT IN ('trash', 'auto-draft', 'inherit')
ORDER BY p.ID ASC, pm.meta_key ASC;

3. Deserialize metadata deterministically

Detect PHP serialization headers (a:, s:, O:, i:, b:) in meta_value and parse them into structured JSON, then validate against a strict content schema before emission. The JSON Schema specification gives the contract you need to catch malformed legacy data before the build pipeline.

TypeScript
import { unserialize } from 'php-serialize';

function parseMetaValue(raw: string): Record<string, unknown> {
  if (!raw) return {};
  
  // Detect PHP serialization signature
  if (/^[aOis]:\d+/.test(raw.trim())) {
    try {
      return unserialize(raw) as Record<string, unknown>;
    } catch {
      return { _error: 'deserialization_failed', raw };
    }
  }
  
  // Fallback to JSON or raw string
  try {
    return JSON.parse(raw);
  } catch {
    return { _raw: raw };
  }
}

4. Normalize asset paths

Rewrite guid, _wp_attached_file, or uri columns through a deterministic base-URL mapping. Generate CDN-ready absolute URLs, strip legacy docroot prefixes, and force https:// on protocol-relative URLs to avoid mixed-content warnings during SSG.

TypeScript
function normalizeAssetUrl(
  rawPath: string, 
  legacyBase: string, 
  cdnBase: string
): string {
  // Strip legacy docroot or relative prefixes
  const cleanPath = rawPath.replace(/^\/?(wp-content|sites\/default\/files)\//, '');
  
  // Enforce absolute CDN path
  const absoluteUrl = new URL(cleanPath, cdnBase).toString();
  
  return absoluteUrl;
}

5. Stream, validate, emit

Process in memory-efficient chunks — never load whole tables into RAM. Pipe rows through validation with Node.js stream or Python iterators, then emit deterministic JSON keyed by a stable content hash. Identical database states then produce identical build artifacts, avoiding needless CDN purges.

TypeScript
import { pipeline } from 'stream/promises';
import { Transform } from 'stream';

const extractionStream = new Transform({
  objectMode: true,
  transform(chunk, _encoding, callback) {
    const normalized = {
      id: `cms_${chunk.content_id}`,
      type: chunk.post_type,
      status: chunk.post_status === 'publish' ? 'published' : 'draft',
      title: chunk.post_title,
      body: chunk.post_content,
      meta: parseMetaValue(chunk.meta_value),
      assets: chunk.media_urls?.map(normalizeAssetUrl) || []
    };
    this.push(JSON.stringify(normalized));
    callback();
  }
});

// Pipe to file or GraphQL mutation queue
await pipeline(dbQueryStream, extractionStream, outputWriter);

Production hardening

  • Draft routing. Map extracted status: 'draft' nodes to isolated preview endpoints, not public SSG routes. This prevents accidental publication and matches Preview & Draft Workflow Patterns, where token-gated previews consume the same normalized payload under different routing.
  • Referential reconciliation. Cross-reference post_parent and menu_order in a post-extraction pass; flag orphaned relationships with warnings instead of failing the build.
  • Schema drift detection. Version the JSON output contract. When a legacy plugin introduces a new serialized structure, route it to a quarantine queue for manual mapping rather than breaking the frontend type system.
  • Observability. Instrument extraction latency, deserialization error rates, and asset-rewrite failures with structured JSON logging, correlated against Jamstack build durations.

Treat the monolithic database as an untrusted source and apply strict, deterministic normalization at the extraction boundary. That’s what lets you decouple legacy stores from modern frontends without data loss, draft corruption, or build instability.