production

PostgreSQL dblink

Cross-database queries and data migration using PostgreSQL dblink extension

#postgres#database#sql#dblink#migration

Migrating data between PostgreSQL databases usually means pg_dump and pg_restore. But what if you need to sync live databases? Or copy a subset of rows based on a query? That’s where dblink comes in.

dblink lets you query remote PostgreSQL databases from within SQL. Works with AWS RDS out of the box.

CREATE EXTENSION IF NOT EXISTS dblink;

Create named connection:

SELECT dblink_connect('src', 'host=remote.host dbname=db user=postgres password=secret');

Copying Data Between Databases

The simplest case: pull rows from source, skip duplicates.

dblink() fetches the entire result set before returning. For large datasets use cursors instead.

INSERT INTO documents (id, content, created, modified)
SELECT * FROM dblink('src',
  $$
    SELECT id, content, created, modified
    FROM documents
    WHERE modified > NOW() - INTERVAL '14 days'
  $$
) AS t(
  id char(32),
  content text,
  created timestamptz,
  modified timestamptz
)
ON CONFLICT (id) DO NOTHING;

SELECT dblink_disconnect('src');

This works for small datasets. But what happens when you’re moving millions of rows?

Batched Migration with Cursors

dblink() loads everything into memory before returning. For large migrations, this either crashes or spools to disk. Cursors solve this by fetching in batches:

SELECT dblink_connect('src', 'host=remote.host dbname=db user=postgres password=secret');

SELECT dblink_open('src', 'cursor',
  $$
    SELECT id, content, created, modified
    FROM documents
    WHERE modified > NOW() - INTERVAL '14 days'
  $$
);

CREATE OR REPLACE PROCEDURE migrate_data() LANGUAGE plpgsql AS $$
DECLARE
  batch_size INT := 5000;
  rows_fetched_n INT;
  rows_inserted_n INT;
  rows_n INT := 0;
BEGIN
  LOOP
    WITH rows_fetched AS (
      SELECT *
      FROM dblink_fetch('src', 'cursor', batch_size) AS t(
        id char(32),
        content text,
        created timestamptz,
        modified timestamptz
      )
    ),
    rows_inserted AS (
      INSERT INTO documents (id, content, created, modified)
      SELECT t.id, t.content, t.created, t.modified
      FROM rows_fetched t
      ON CONFLICT (id) DO NOTHING
      RETURNING 1
    )
    SELECT
      (SELECT COUNT(*) FROM rows_fetched),
      (SELECT COUNT(*) FROM rows_inserted)
    INTO rows_fetched_n, rows_inserted_n;

    rows_n := rows_n + rows_inserted_n;
    RAISE NOTICE 'Batch: fetched=%, inserted=%, total=%', rows_fetched_n, rows_inserted_n, rows_n;
    COMMIT;
    EXIT WHEN rows_fetched_n < batch_size;
  END LOOP;
  RAISE NOTICE 'Total rows transferred: %', rows_n;
END $$;

CALL migrate_data();
DROP PROCEDURE migrate_data();

SELECT dblink_close('src', 'cursor');
SELECT dblink_disconnect('src');

Why a procedure instead of DO $$? Anonymous blocks run in a single transaction - you can’t commit inside them. For million-row migrations, that means one giant transaction holding locks until completion, and complete rollback if anything fails. Procedures support COMMIT, letting each batch persist independently.

Why the CTE pattern? A subtle bug lurks here. GET DIAGNOSTICS ROW_COUNT returns rows inserted, not fetched. With ON CONFLICT DO NOTHING, duplicates are skipped. If you fetch 5000 rows but 3000 are duplicates, ROW_COUNT returns 2000 - and your loop exits early thinking it’s done.

The CTE lets us count fetched rows separately from inserted rows.

Batch size of 5000 is a starting point. Tune based on row size and network latency - larger rows or slower networks benefit from smaller batches.

Delta Sync: Transfer Only Missing Rows

Ok, cursors handle memory. But we’re still transferring all rows just to discard duplicates on arrival. What if destination already has 90% of the data?

Compute the diff where the data lives. Transfer only what’s missing.

The trick: IDs are usually small, full rows are large. Pull IDs to source first, filter there, then destination pulls only the delta.

Step 1 - On source: Pull destination IDs, create filtered view

SELECT dblink_connect('dst', 'host=dst.host dbname=db user=u password=p');

CREATE TABLE migration_ids AS
SELECT id::char(32) AS id
FROM dblink('dst', $$ SELECT id FROM documents $$) AS t(id char(32));

CREATE INDEX ON migration_ids(id);
SELECT dblink_disconnect('dst');

CREATE VIEW migration_documents AS
SELECT d.id, d.content, d.created, d.modified
FROM documents d
LEFT JOIN migration_ids mi ON mi.id = d.id
WHERE mi.id IS NULL
  AND d.modified > NOW() - INTERVAL '90 days';

Step 2 - On destination: Pull from pre-filtered view

SELECT dblink_connect('src', 'host=src.host dbname=db user=u password=p');

SELECT dblink_open('src', 'cursor', $$ SELECT * FROM migration_documents $$);

-- Batched insert (same pattern as above)
-- ...

SELECT dblink_close('src', 'cursor');
SELECT dblink_disconnect('src');

Step 3 - Cleanup on source:

DROP VIEW migration_documents;
DROP TABLE migration_ids;

This two-phase approach pays off when syncing databases that share most of their data. Instead of transferring everything and discarding 90%, you transfer only the 10% that’s actually missing.

One caveat: if source data changes between Step 1 and Step 2, new rows will be missed. For live sync scenarios, run delta sync periodically or accept eventual consistency.

References