Database extraction strategies for monolithic CMS
Direct relational extraction from monolithic platforms — WordPress, Drupal, Joomla, legacy PHP stacks — consistently produces payloads that aren’t headless-compatible. The schema was built for server-side rendering, not API-first consumption, so this page covers a read-only, idempotent pipeline that maps legacy relational structures to flat, type-safe content models without locking production, corrupting drafts, or breaking media references.
Why direct dumps fail
Monolithic databases optimize for admin-UI convenience and PHP query patterns, not strict data contracts. Content spreads across normalized tables (wp_posts, wp_postmeta, field_data_*, node_revision) with foreign keys enforced at the application layer, not the database. The critical failure points:
- Serialized blob storage. Custom fields, block layouts, and taxonomy relationships are stored as PHP-serialized strings or JSON blobs inside
meta_valueorfield_data. Direct SQL dumps yield unparsed strings that break hydration and type inference. - Implicit draft/revision coupling. Draft state lives in status flags (
post_status = 'draft', revision tables,is_latestbooleans) that don’t map to preview tokens. Extracting without state filtering merges unpublished revisions into production payloads and triggers cache-invalidation storms. - Asset path drift. Media URLs are stored as relative paths, protocol-relative strings, or docroot-absolute references. Static builds and CDN edges can’t resolve them without deterministic path rewriting and protocol normalization.
- Missing referential integrity. Many schemas disable foreign-key constraints for performance. Naive
JOIN-based extraction silently drops orphaned records, duplicates nodes, or throws null-pointer exceptions during rendering.
Building the idempotent pipeline
The pipeline reads from a replica and runs each row through five deterministic stages before emitting hash-keyed JSON:
flowchart LR A["Read replica / snapshot"] --> B["1. Read-only isolation"] B --> C["2. Explicit JOINs (filter trash/auto-draft/inherit)"] C --> D["3. Deserialize PHP/JSON metadata"] D --> E["4. Normalize asset paths (CDN, https)"] E --> F["5. Stream, validate, emit"] F --> G["Deterministic JSON keyed by content hash"]
1. Isolate read replicas
Never query a primary writer. Route extraction to a dedicated read replica or point-in-time snapshot, with read-only transaction isolation (READ COMMITTED or REPEATABLE READ) so SELECT locks don’t hit admin-UI latency or webhook-triggered rebuilds.
-- Example: Explicitly route to a read-only transaction in PostgreSQL/MySQL
SET TRANSACTION READ ONLY;
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2. Map joins explicitly
Replace implicit ORM queries with explicit LEFT JOIN chains that reconstruct content-type hierarchies, filtering out trash, auto-draft, and inherit revisions at the query level. This enforces the state boundaries that Legacy System Decoupling Strategies require before data leaves the database.
SELECT
p.ID AS content_id,
p.post_title,
p.post_content,
p.post_status,
p.post_date_gmt,
pm.meta_key,
pm.meta_value
FROM wp_posts p
LEFT JOIN wp_postmeta pm ON p.ID = pm.post_id
WHERE p.post_type IN ('post', 'page', 'custom_type')
AND p.post_status NOT IN ('trash', 'auto-draft', 'inherit')
ORDER BY p.ID ASC, pm.meta_key ASC;
3. Deserialize metadata deterministically
Detect PHP serialization headers (a:, s:, O:, i:, b:) in meta_value and parse them into structured JSON, then validate against a strict content schema before emission. The JSON Schema specification gives the contract you need to catch malformed legacy data before the build pipeline.
import { unserialize } from 'php-serialize';
function parseMetaValue(raw: string): Record<string, unknown> {
if (!raw) return {};
// Detect PHP serialization signature
if (/^[aOis]:\d+/.test(raw.trim())) {
try {
return unserialize(raw) as Record<string, unknown>;
} catch {
return { _error: 'deserialization_failed', raw };
}
}
// Fallback to JSON or raw string
try {
return JSON.parse(raw);
} catch {
return { _raw: raw };
}
}
4. Normalize asset paths
Rewrite guid, _wp_attached_file, or uri columns through a deterministic base-URL mapping. Generate CDN-ready absolute URLs, strip legacy docroot prefixes, and force https:// on protocol-relative URLs to avoid mixed-content warnings during SSG.
function normalizeAssetUrl(
rawPath: string,
legacyBase: string,
cdnBase: string
): string {
// Strip legacy docroot or relative prefixes
const cleanPath = rawPath.replace(/^\/?(wp-content|sites\/default\/files)\//, '');
// Enforce absolute CDN path
const absoluteUrl = new URL(cleanPath, cdnBase).toString();
return absoluteUrl;
}
5. Stream, validate, emit
Process in memory-efficient chunks — never load whole tables into RAM. Pipe rows through validation with Node.js stream or Python iterators, then emit deterministic JSON keyed by a stable content hash. Identical database states then produce identical build artifacts, avoiding needless CDN purges.
import { pipeline } from 'stream/promises';
import { Transform } from 'stream';
const extractionStream = new Transform({
objectMode: true,
transform(chunk, _encoding, callback) {
const normalized = {
id: `cms_${chunk.content_id}`,
type: chunk.post_type,
status: chunk.post_status === 'publish' ? 'published' : 'draft',
title: chunk.post_title,
body: chunk.post_content,
meta: parseMetaValue(chunk.meta_value),
assets: chunk.media_urls?.map(normalizeAssetUrl) || []
};
this.push(JSON.stringify(normalized));
callback();
}
});
// Pipe to file or GraphQL mutation queue
await pipeline(dbQueryStream, extractionStream, outputWriter);
Production hardening
- Draft routing. Map extracted
status: 'draft'nodes to isolated preview endpoints, not public SSG routes. This prevents accidental publication and matches Preview & Draft Workflow Patterns, where token-gated previews consume the same normalized payload under different routing. - Referential reconciliation. Cross-reference
post_parentandmenu_orderin a post-extraction pass; flag orphaned relationships with warnings instead of failing the build. - Schema drift detection. Version the JSON output contract. When a legacy plugin introduces a new serialized structure, route it to a quarantine queue for manual mapping rather than breaking the frontend type system.
- Observability. Instrument extraction latency, deserialization error rates, and asset-rewrite failures with structured JSON logging, correlated against Jamstack build durations.
Treat the monolithic database as an untrusted source and apply strict, deterministic normalization at the extraction boundary. That’s what lets you decouple legacy stores from modern frontends without data loss, draft corruption, or build instability.