FREE PREVIEW

You're viewing a free preview

This is a sample of 15 questions from our full collection of 50 interview questions.

Unlock all 50 questions with detailed explanations and code examples

Get Full Access

Query Planning and Optimization

What is the difference between a CTE and a subquery in terms of optimization?

The 30-Second Answer: CTEs (Common Table Expressions) in PostgreSQL are optimization fences that are always materialized and executed once, while subqueries can be optimized, inlined, and potentially executed multiple times or pushed down. Use CTEs for readability or when you want guaranteed single execution; use subqueries when you want the optimizer to have maximum flexibility for performance.

The 2-Minute Answer (If They Want More): In PostgreSQL versions before 12, CTEs were always materialized (executed once and stored in memory), acting as optimization barriers. The planner could not "see through" a CTE to optimize it with the outer query. This was beneficial when you wanted to ensure a subquery executed only once (to avoid repeated expensive computations) but harmful when the optimizer could have made better decisions by considering the full query context.

PostgreSQL 12 introduced automatic CTE inlining: simple CTEs that are referenced only once and don't have side effects can now be inlined and optimized together with the main query, similar to subqueries. However, you can still force materialization with the MATERIALIZED keyword or prevent it with NOT MATERIALIZED.

Subqueries, on the other hand, have always been subject to full optimization. The planner can inline them, transform them into joins, push down predicates, or choose different execution strategies based on the overall query plan. This gives maximum flexibility but means a correlated subquery in a WHERE clause might execute once per outer row (nested loop behavior).

The key differences: CTEs provide query organization and can prevent redundant execution when used multiple times in a query. They're useful for recursive queries (which require CTE syntax) and when you explicitly want to materialize an expensive result once. Subqueries offer better optimization potential, especially for simple filters or EXISTS clauses, and the optimizer can choose the most efficient execution strategy.

For write operations (INSERT/UPDATE/DELETE in CTEs), materialization is always enforced because these have side effects. Data-modifying CTEs execute in the order specified, which is important for correctness. This is one area where CTEs have unique capabilities that subqueries cannot replicate.

In practice: use subqueries for simple filters and EXISTS checks where you want maximum optimization. Use CTEs for complex queries that need to be referenced multiple times, recursive queries, or when you want to control execution order. In PostgreSQL 12+, use MATERIALIZED when you want guaranteed single execution of an expensive operation, and NOT MATERIALIZED when you want to ensure optimizer flexibility.

Code Example:

-- Setup test data
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INT,
    total_amount DECIMAL(10,2),
    created_at TIMESTAMP
);

CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    country VARCHAR(50)
);

INSERT INTO orders (customer_id, total_amount, created_at)
SELECT
    (random() * 1000)::INT + 1,
    random() * 1000,
    NOW() - (random() * 365 || ' days')::INTERVAL
FROM generate_series(1, 100000);

INSERT INTO customers (name, country)
SELECT
    'Customer ' || i,
    CASE (random() * 5)::INT
        WHEN 0 THEN 'USA'
        WHEN 1 THEN 'UK'
        WHEN 2 THEN 'Canada'
        ELSE 'Other'
    END
FROM generate_series(1, 1000) i;

ANALYZE orders;
ANALYZE customers;


-- 1. CTE (PostgreSQL < 12: always materialized)
-- (PostgreSQL >= 12: may be inlined unless MATERIALIZED keyword used)
EXPLAIN ANALYZE
WITH high_value_customers AS (
    SELECT customer_id, SUM(total_amount) as total
    FROM orders
    GROUP BY customer_id
    HAVING SUM(total_amount) > 5000
)
SELECT c.name, h.total
FROM customers c
JOIN high_value_customers h ON c.id = h.customer_id;

-- PostgreSQL 12+ might inline this, but you can force materialization:
EXPLAIN ANALYZE
WITH high_value_customers AS MATERIALIZED (
    SELECT customer_id, SUM(total_amount) as total
    FROM orders
    GROUP BY customer_id
    HAVING SUM(total_amount) > 5000
)
SELECT c.name, h.total
FROM customers c
JOIN high_value_customers h ON c.id = h.customer_id;


-- 2. Equivalent subquery (always optimizable)
EXPLAIN ANALYZE
SELECT c.name, sub.total
FROM customers c
JOIN (
    SELECT customer_id, SUM(total_amount) as total
    FROM orders
    GROUP BY customer_id
    HAVING SUM(total_amount) > 5000
) sub ON c.id = sub.customer_id;

-- Optimizer can inline and optimize with outer query


-- 3. Multiple references - CTE advantage
-- With CTE (computed once)
EXPLAIN ANALYZE
WITH monthly_sales AS (
    SELECT
        DATE_TRUNC('month', created_at) as month,
        SUM(total_amount) as total
    FROM orders
    GROUP BY DATE_TRUNC('month', created_at)
)
SELECT
    current.month,
    current.total as current_total,
    previous.total as previous_total,
    current.total - previous.total as growth
FROM monthly_sales current
LEFT JOIN monthly_sales previous
    ON previous.month = current.month - INTERVAL '1 month'
ORDER BY current.month;

-- With subquery (might compute twice - inefficient!)
EXPLAIN ANALYZE
SELECT
    current.month,
    current.total as current_total,
    previous.total as previous_total,
    current.total - previous.total as growth
FROM (
    SELECT
        DATE_TRUNC('month', created_at) as month,
        SUM(total_amount) as total
    FROM orders
    GROUP BY DATE_TRUNC('month', created_at)
) current
LEFT JOIN (
    SELECT
        DATE_TRUNC('month', created_at) as month,
        SUM(total_amount) as total
    FROM orders
    GROUP BY DATE_TRUNC('month', created_at)
) previous ON previous.month = current.month - INTERVAL '1 month'
ORDER BY current.month;
-- This scans orders table twice!


-- 4. Forcing CTE to NOT materialize (PostgreSQL 12+)
EXPLAIN ANALYZE
WITH recent_orders AS NOT MATERIALIZED (
    SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '30 days'
)
SELECT * FROM recent_orders WHERE total_amount > 100;

-- Optimizer will inline this and potentially push down the total_amount filter


-- 5. Correlated subquery vs CTE
-- Correlated subquery (executes per outer row - can be slow)
EXPLAIN ANALYZE
SELECT
    c.name,
    (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.id) as order_count
FROM customers c
WHERE c.country = 'USA';

-- CTE alternative (executes once)
EXPLAIN ANALYZE
WITH order_counts AS (
    SELECT customer_id, COUNT(*) as order_count
    FROM orders
    GROUP BY customer_id
)
SELECT c.name, COALESCE(oc.order_count, 0)
FROM customers c
LEFT JOIN order_counts oc ON c.id = oc.customer_id
WHERE c.country = 'USA';

-- Or subquery with JOIN (optimizable)
EXPLAIN ANALYZE
SELECT c.name, COALESCE(oc.order_count, 0)
FROM customers c
LEFT JOIN (
    SELECT customer_id, COUNT(*) as order_count
    FROM orders
    GROUP BY customer_id
) oc ON c.id = oc.customer_id
WHERE c.country = 'USA';


-- 6. Recursive CTE (only possible with CTE syntax)
WITH RECURSIVE org_hierarchy AS (
    -- Base case
    SELECT id, name, 1 as level
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case
    SELECT e.id, e.name, oh.level + 1
    FROM employees e
    JOIN org_hierarchy oh ON e.manager_id = oh.id
)
SELECT * FROM org_hierarchy;

-- Cannot be done with subqueries!


-- 7. Data-modifying CTE (always materialized)
WITH deleted_orders AS (
    DELETE FROM orders
    WHERE created_at < NOW() - INTERVAL '5 years'
    RETURNING *
)
INSERT INTO orders_archive
SELECT * FROM deleted_orders;

-- Executes in order, guaranteed


-- 8. Performance comparison: EXISTS vs CTE
-- EXISTS subquery (often faster for checking existence)
EXPLAIN ANALYZE
SELECT c.name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.id AND o.total_amount > 1000
);

-- CTE alternative
EXPLAIN ANALYZE
WITH high_value_orders AS (
    SELECT DISTINCT customer_id
    FROM orders
    WHERE total_amount > 1000
)
SELECT c.name
FROM customers c
JOIN high_value_orders h ON c.id = h.customer_id;


-- 9. Controlling optimization with MATERIALIZED keyword
-- Force materialization for expensive operation you want once
EXPLAIN ANALYZE
WITH expensive_calc AS MATERIALIZED (
    SELECT
        customer_id,
        AVG(total_amount) as avg_amount,
        STDDEV(total_amount) as stddev_amount,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_amount) as median
    FROM orders
    GROUP BY customer_id
)
SELECT * FROM expensive_calc WHERE avg_amount > 500;

-- Prevent materialization to allow optimizer flexibility
EXPLAIN ANALYZE
WITH filtered_orders AS NOT MATERIALIZED (
    SELECT * FROM orders WHERE created_at > '2024-01-01'
)
SELECT * FROM filtered_orders WHERE customer_id = 123;
-- Optimizer can combine both WHERE conditions efficiently


-- 10. Viewing actual CTE materialization in plan
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
WITH sales_summary AS (
    SELECT customer_id, COUNT(*) as cnt
    FROM orders
    GROUP BY customer_id
)
SELECT c.name, s.cnt
FROM customers c
JOIN sales_summary s ON c.id = s.customer_id;

-- Look for "CTE sales_summary" in the plan
-- If you see "CTE Scan on sales_summary", it was materialized
-- If you see the CTE inlined into the main plan, it was optimized away

CTE vs Subquery Decision Tree:

graph TD
    A[Need to Extract Part of Query] --> B{What's the use case?}

    B -->|Recursive query| C[MUST use CTE<br/>WITH RECURSIVE]
    B -->|Data modification| D[MUST use CTE<br/>for RETURNING]
    B -->|Referenced multiple times| E{Is operation expensive?}
    B -->|Referenced once| F{Need optimization control?}

    E -->|Yes| G[Use CTE with MATERIALIZED<br/>Execute once, reuse result]
    E -->|No| H[Either works<br/>CTE for readability]

    F -->|Yes - force inline| I[Use CTE NOT MATERIALIZED<br/>or subquery]
    F -->|No - want flexibility| J[Use subquery<br/>Let optimizer decide]

    K{Query complexity?} --> L[Simple filter/EXISTS]
    K --> M[Complex aggregation]
    K --> N[Multiple self-joins]

    L --> O[Use subquery<br/>Better optimization]
    M --> P[Use CTE<br/>Better readability]
    N --> G

    style C fill:#ffe1e1
    style D fill:#ffe1e1
    style G fill:#e1f5ff
    style J fill:#e1ffe1

Optimization Behavior:

graph TB
    A[PostgreSQL 12+ Behavior] --> B[CTE]
    A --> C[Subquery]

    B --> B1{CTE Properties}
    B1 -->|Referenced once + no side effects| B2[May be inlined<br/>default behavior]
    B1 -->|WITH ... MATERIALIZED| B3[Always materialized<br/>Execute once]
    B1 -->|WITH ... NOT MATERIALIZED| B4[Always inlined<br/>Optimize freely]
    B1 -->|Referenced multiple times| B5[Usually materialized<br/>Avoid recomputation]
    B1 -->|Has side effects INSERT/UPDATE/DELETE| B6[Always materialized<br/>Guaranteed order]

    C --> C1[Subquery Properties]
    C1 --> C2[Always subject to<br/>full optimization]
    C2 --> C3[Can be inlined]
    C2 --> C4[Can be converted to JOIN]
    C2 --> C5[Can have predicates<br/>pushed down]
    C2 --> C6[Execution strategy varies<br/>by optimizer]

    style B3 fill:#ffe1e1
    style B4 fill:#e1ffe1
    style C2 fill:#e1f5ff

Performance Characteristics:

Aspect CTE (MATERIALIZED) CTE (NOT MATERIALIZED) Subquery
Optimization Limited Full Full
Referenced Multiple Times ✅ Efficient (once) ❌ May recompute ❌ May recompute
Recursive Queries ✅ Only option ✅ Only option ❌ Not supported
Readability ✅ High ✅ High ⚠️ Medium
Optimizer Flexibility ❌ Blocked ✅ Full ✅ Full
Data Modification ✅ Supported ✅ Supported ❌ Not in FROM
Memory Usage ⚠️ Materializes ✅ Minimal ✅ Minimal

References:

↑ Back to top

What is the difference between sequential scan, index scan, and bitmap scan?

The 30-Second Answer: Sequential scan reads all table rows in order (full table scan). Index scan uses an index to find specific rows and fetches them individually. Bitmap scan combines the benefits of both: it scans the index to build a bitmap of matching rows, then fetches them from the table in physical order, which is more efficient for retrieving many rows.

The 2-Minute Answer (If They Want More): A sequential scan reads every row in a table from start to finish in physical storage order. It's efficient for small tables or when retrieving a large percentage of rows (typically >5-10%) because it involves sequential I/O operations, which are faster than random access. Sequential scans don't require indexes and avoid the overhead of index lookups.

An index scan uses a B-tree or other index structure to locate specific rows matching the query condition. For each matching index entry, PostgreSQL performs a random read to fetch the actual table row (heap tuple). This is efficient when retrieving a small number of rows, but becomes expensive with many rows because each row fetch is a separate random I/O operation. Index scans can return results in sorted order if using a B-tree index.

A bitmap scan is a hybrid approach used when retrieving a moderate number of rows (typically 1-10% of the table). It works in two phases: first, it scans the index and builds an in-memory bitmap of page locations containing matching rows (bitmap index scan). Second, it sorts these page locations and reads them sequentially from the table (bitmap heap scan). This dramatically reduces random I/O by fetching each page only once, even if multiple matching rows exist on that page.

Bitmap scans can also efficiently combine multiple indexes using AND/OR operations (BitmapAnd, BitmapOr nodes). PostgreSQL builds a bitmap for each index condition, then performs bitmap operations before accessing the table. This is more efficient than combining results from multiple separate index scans. The trade-off is that bitmap scans don't preserve index ordering and require additional memory for the bitmap structure.

Code Example:

-- Create test table with different data distributions
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    category VARCHAR(50),
    price DECIMAL(10,2),
    in_stock BOOLEAN,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_products_category ON products(category);
CREATE INDEX idx_products_price ON products(price);
CREATE INDEX idx_products_stock ON products(in_stock);

-- Insert test data
INSERT INTO products (category, price, in_stock)
SELECT
    CASE (random() * 10)::INT
        WHEN 0 THEN 'electronics'
        WHEN 1 THEN 'books'
        WHEN 2 THEN 'clothing'
        ELSE 'other'
    END,
    random() * 1000,
    random() > 0.3
FROM generate_series(1, 100000);

ANALYZE products;

-- 1. SEQUENTIAL SCAN - retrieving most rows
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM products WHERE price > 100;
-- Output: Seq Scan on products (cost=0.00..2179.00 rows=66666 width=...)
-- Used because we're retrieving ~90% of rows

-- 2. INDEX SCAN - retrieving few rows in order
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM products WHERE category = 'electronics' ORDER BY id;
-- Output: Index Scan using products_pkey on products (cost=0.29..850.50 rows=10000 width=...)
-- Index scan preserves order by id

-- 3. BITMAP SCAN - retrieving moderate number of rows
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM products WHERE category = 'electronics';
-- Output:
-- Bitmap Heap Scan on products (cost=184.50..1456.75 rows=10000 width=...)
--   Recheck Cond: (category = 'electronics')
--   -> Bitmap Index Scan on idx_products_category (cost=0.00..182.00 rows=10000)

-- 4. BITMAP SCAN with multiple index combination
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM products
WHERE category = 'electronics' AND in_stock = true;
-- Output:
-- Bitmap Heap Scan on products (cost=289.50..1561.75 rows=7000 width=...)
--   Recheck Cond: (in_stock = true AND category = 'electronics')
--   -> BitmapAnd (cost=289.50..289.50 rows=7000 width=0)
--       -> Bitmap Index Scan on idx_products_stock (cost=0.00..95.00 rows=70000)
--       -> Bitmap Index Scan on idx_products_category (cost=0.00..182.00 rows=10000)

-- 5. Force different scan types for comparison
SET enable_seqscan = off;
SET enable_bitmapscan = off;
EXPLAIN (ANALYZE) SELECT * FROM products WHERE category = 'electronics';
-- Forces Index Scan

SET enable_seqscan = off;
SET enable_indexscan = off;
SET enable_bitmapscan = on;
EXPLAIN (ANALYZE) SELECT * FROM products WHERE category = 'electronics';
-- Forces Bitmap Scan

-- Reset
RESET enable_seqscan;
RESET enable_bitmapscan;
RESET enable_indexscan;

-- View actual buffer hits for scan types
EXPLAIN (ANALYZE, BUFFERS, TIMING)
SELECT * FROM products WHERE category = 'books';
-- Check "Buffers: shared hit=X read=Y" to see I/O patterns

Scan Type Comparison:

graph TD
    A[Query: WHERE condition] --> B{Planner Estimates Rows}

    B -->|Very Few Rows <br/>~0.1-1%| C[Index Scan]
    B -->|Moderate Rows<br/>~1-10%| D[Bitmap Scan]
    B -->|Many Rows<br/>>10%| E[Sequential Scan]

    C --> C1[1. Traverse B-tree Index]
    C1 --> C2[2. For each match:<br/>Random read heap tuple]
    C2 --> C3[Advantages:<br/>- Preserves order<br/>- Efficient for few rows]
    C3 --> C4[Disadvantages:<br/>- Many random I/Os<br/>- Expensive for many rows]

    D --> D1[Phase 1:<br/>Bitmap Index Scan]
    D1 --> D2[Build bitmap of<br/>matching page locations]
    D2 --> D3[Phase 2:<br/>Bitmap Heap Scan]
    D3 --> D4[Read pages in<br/>physical order]
    D4 --> D5[Advantages:<br/>- Combines multiple indexes<br/>- Sequential page reads]
    D5 --> D6[Disadvantages:<br/>- Memory overhead<br/>- No order preservation]

    E --> E1[Read all pages<br/>sequentially]
    E1 --> E2[Check condition<br/>for each row]
    E2 --> E3[Advantages:<br/>- Simple & fast I/O<br/>- No index overhead]
    E3 --> E4[Disadvantages:<br/>- Reads entire table<br/>- Slow for large tables]

    style C fill:#ffe1e1
    style D fill:#e1f5ff
    style E fill:#e1ffe1

References:

↑ Back to top

What is the difference between EXPLAIN and EXPLAIN ANALYZE?

The 30-Second Answer: EXPLAIN shows the query planner's estimated execution plan without running the query, while EXPLAIN ANALYZE actually executes the query and shows both the planned estimates and actual runtime statistics (real row counts, execution time, buffer hits). EXPLAIN ANALYZE is more accurate but has side effects since it executes the query.

The 2-Minute Answer (If They Want More): EXPLAIN generates the query execution plan by running the planner/optimizer but stops before executing the query. It shows estimated costs, estimated row counts, and the planned operations based on table statistics. This is safe to run on production since it doesn't modify data or consume significant resources, but the estimates can be wrong if statistics are outdated or if the data distribution is unusual.

EXPLAIN ANALYZE goes further by actually executing the query and collecting real-time performance metrics. It shows the actual number of rows returned by each node, actual execution time in milliseconds for each operation, and buffer usage (cache hits vs disk reads). This allows you to compare estimated vs actual values to identify statistics problems or planner errors.

The key difference is accuracy vs safety. EXPLAIN is fast, safe, and doesn't affect the database, but provides only estimates. EXPLAIN ANALYZE provides truth but has important side effects: it actually runs the query (including INSERT/UPDATE/DELETE), consumes resources, acquires locks, and can take significant time for slow queries.

To safely test data modification queries with EXPLAIN ANALYZE, wrap them in a transaction with ROLLBACK. You can also use EXPLAIN ANALYZE with additional options like BUFFERS (shows I/O statistics), TIMING (can be disabled to reduce overhead), and VERBOSE (shows additional details like output columns and filter expressions).

When optimizing queries, start with EXPLAIN to understand the plan structure and identify obvious issues. Use EXPLAIN ANALYZE when you need actual performance data or when estimated row counts seem suspicious. Always be cautious with EXPLAIN ANALYZE on production databases, especially for write queries or queries that might take a long time.

Code Example:

-- 1. EXPLAIN - Shows plan only, doesn't execute
EXPLAIN
SELECT * FROM orders WHERE customer_id = 123;

-- Output (estimates only):
-- Index Scan using idx_orders_customer_id on orders
--   (cost=0.42..25.50 rows=15 width=128)
--   Index Cond: (customer_id = 123)

-- Safe for production, no side effects
-- Shows estimated costs and rows only


-- 2. EXPLAIN ANALYZE - Executes and shows actual stats
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;

-- Output (with actual data):
-- Index Scan using idx_orders_customer_id on orders
--   (cost=0.42..25.50 rows=15 width=128)
--   (actual time=0.123..0.456 rows=18 loops=1)
--   Index Cond: (customer_id = 123)
-- Planning Time: 0.234 ms
-- Execution Time: 0.567 ms

-- Notice:
-- - Estimated rows=15 vs actual rows=18 (statistics slightly off)
-- - Actual time in milliseconds
-- - Planning vs Execution time breakdown


-- 3. Comparing estimates vs actuals
EXPLAIN ANALYZE
SELECT COUNT(*) FROM orders WHERE status = 'pending';

-- Look for large discrepancies:
-- (cost=0.00..1234.56 rows=100 width=8)      <- ESTIMATE
-- (actual time=45.123..45.124 rows=5000 loops=1)  <- ACTUAL
-- If estimated rows=100 but actual rows=5000, run ANALYZE!


-- 4. Safe way to test UPDATE/DELETE with EXPLAIN ANALYZE
BEGIN;
EXPLAIN ANALYZE
UPDATE orders SET status = 'cancelled' WHERE id = 999;
ROLLBACK;  -- Prevents actual changes

-- Output shows:
-- Update on orders (cost=0.42..8.44 rows=1 width=...)
--   (actual time=0.123..0.124 rows=0 loops=1)
-- -> Index Scan using orders_pkey on orders (...)
-- Planning Time: 0.1 ms
-- Execution Time: 0.5 ms


-- 5. EXPLAIN ANALYZE with additional options
EXPLAIN (ANALYZE, BUFFERS, TIMING, VERBOSE)
SELECT o.*, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > NOW() - INTERVAL '30 days';

-- ANALYZE = execute and show actual stats
-- BUFFERS = show buffer cache hits/misses
-- TIMING = show timing for each node (default on, can disable for less overhead)
-- VERBOSE = show output columns and detailed filter info

-- Sample output with BUFFERS:
-- Hash Join (cost=... rows=1000 width=...)
--   (actual time=12.345..56.789 rows=987 loops=1)
--   Buffers: shared hit=245 read=12 dirtied=3
--   ...

-- Buffers explanation:
-- hit=245   : blocks found in cache (fast)
-- read=12   : blocks read from disk (slow)
-- dirtied=3 : blocks modified


-- 6. Disable timing to reduce overhead
EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM huge_table WHERE value > 1000;

-- Useful when:
-- - Query is very slow
-- - System clock calls are expensive
-- - You only care about row counts, not timing


-- 7. Compare different query approaches
-- Query A - using IN clause
EXPLAIN ANALYZE
SELECT * FROM products WHERE id IN (SELECT product_id FROM order_items);

-- Query B - using EXISTS
EXPLAIN ANALYZE
SELECT * FROM products p
WHERE EXISTS (SELECT 1 FROM order_items oi WHERE oi.product_id = p.id);

-- Query C - using JOIN
EXPLAIN ANALYZE
SELECT DISTINCT p.* FROM products p
JOIN order_items oi ON p.id = oi.product_id;

-- Compare "Execution Time" to see which is fastest


-- 8. Testing parallel query execution
SET max_parallel_workers_per_gather = 4;

EXPLAIN ANALYZE
SELECT COUNT(*) FROM huge_table WHERE value > 1000;

-- Look for:
-- Finalize Aggregate (actual time=...)
--   -> Gather (actual time=...)
--       Workers Planned: 4
--       Workers Launched: 4
--       -> Partial Aggregate (actual time=...)


-- 9. Identifying statistics problems
-- First, check if statistics are outdated
SELECT
    schemaname,
    tablename,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE tablename = 'orders';

-- If last_analyze is old, update statistics
ANALYZE orders;

-- Re-run EXPLAIN ANALYZE and compare


-- 10. Format output for better readability
-- JSON format
EXPLAIN (ANALYZE, FORMAT JSON)
SELECT * FROM orders WHERE status = 'pending';

-- YAML format
EXPLAIN (ANALYZE, FORMAT YAML)
SELECT * FROM orders WHERE status = 'pending';

-- Use with visualization tools like:
-- - https://explain.depesz.com/
-- - https://explain.dalibo.com/
-- - pgAdmin built-in visualizer

Decision Flow:

graph TD
    A[Need to Analyze Query] --> B{Is it a production system?}

    B -->|Yes| C{Is query SELECT only?}
    B -->|No| D[Use EXPLAIN ANALYZE freely]

    C -->|Yes| E{Is query fast <1s?}
    C -->|No - INSERT/UPDATE/DELETE| F[Use BEGIN; EXPLAIN ANALYZE; ROLLBACK;]

    E -->|Yes| G[Safe to use EXPLAIN ANALYZE]
    E -->|No - Slow query| H{Can afford the runtime?}

    H -->|Yes| I[EXPLAIN ANALYZE with caution<br/>during low-traffic period]
    H -->|No| J[Use EXPLAIN only<br/>Check statistics quality]

    G --> K[Compare estimated vs actual]
    I --> K
    J --> L[Look for obvious plan issues]
    F --> K

    K --> M{Large discrepancy?}
    M -->|Yes| N[Run ANALYZE to update stats]
    M -->|No| O[Plan is accurate]

    N --> P[Re-run EXPLAIN ANALYZE]
    P --> O

    style G fill:#e1ffe1
    style F fill:#fff4e1
    style J fill:#ffe1e1
    style I fill:#ffe1e1

Key Differences Summary:

graph LR
    A[EXPLAIN] --> A1[No Query Execution]
    A --> A2[Shows Estimates Only]
    A --> A3[Zero Side Effects]
    A --> A4[Safe for Production]
    A --> A5[Fast < 1ms]

    B[EXPLAIN ANALYZE] --> B1[Executes Query Fully]
    B --> B2[Shows Estimates + Actuals]
    B --> B3[Has Side Effects]
    B --> B4[Use with Caution]
    B --> B5[Takes Full Query Time]

    style A fill:#e1ffe1
    style B fill:#ffe1e1

References:

↑ Back to top

JSON and Full-Text Search

What is the difference between JSON and JSONB in PostgreSQL?

The 30-Second Answer: JSON stores data as plain text exactly as input, while JSONB stores data in a decomposed binary format. JSONB is faster for processing and supports indexing, but JSON preserves formatting and key order. For most use cases, JSONB is preferred due to its performance advantages.

The 2-Minute Answer (If They Want More): PostgreSQL offers two JSON data types with distinct characteristics. The JSON type stores an exact copy of the input text, preserving whitespace, key ordering, and duplicate keys. Every operation on JSON data requires reparsing the text, making it slower for processing but faster for initial insertion.

JSONB (JSON Binary) stores data in a decomposed binary format that removes whitespace, doesn't preserve key order, and keeps only the last value for duplicate keys. This binary format makes JSONB significantly faster for processing operations like searching, indexing, and extracting values because the data doesn't need to be reparsed each time.

JSONB supports indexing (GIN, GiST indexes) and provides operators for efficient querying, making it ideal for applications that frequently query or manipulate JSON data. The tradeoff is slightly slower insertion due to the conversion overhead and marginally higher storage space due to the binary format.

In practice, JSONB is the recommended choice for almost all scenarios unless you specifically need to preserve exact formatting or key order (such as when storing configuration files where order matters semantically).

Code Example:

-- Create table with both types
CREATE TABLE json_comparison (
    id SERIAL PRIMARY KEY,
    data_json JSON,
    data_jsonb JSONB
);

-- Insert identical data
INSERT INTO json_comparison (data_json, data_jsonb) VALUES
('{"name": "Alice", "age": 30, "name": "Bob"}',
 '{"name": "Alice", "age": 30, "name": "Bob"}');

-- JSON preserves duplicate keys and order
SELECT data_json FROM json_comparison;
-- Result: {"name": "Alice", "age": 30, "name": "Bob"}

-- JSONB keeps only last duplicate and may reorder keys
SELECT data_jsonb FROM json_comparison;
-- Result: {"age": 30, "name": "Bob"}

-- JSONB supports containment operators (much faster)
SELECT * FROM json_comparison
WHERE data_jsonb @> '{"age": 30}';

-- JSONB supports indexing
CREATE INDEX idx_jsonb_data ON json_comparison USING GIN (data_jsonb);

-- Performance comparison: JSONB is faster for extraction
SELECT data_json->>'name' FROM json_comparison;  -- Slower (requires parsing)
SELECT data_jsonb->>'name' FROM json_comparison; -- Faster (binary access)

Storage Format Diagram:

graph TD
    A[Input: '{"name": "Alice", "age": 30}'] --> B{Data Type?}
    B -->|JSON| C[Store as Text<br/>Exact copy<br/>Preserves whitespace & order]
    B -->|JSONB| D[Convert to Binary<br/>Remove whitespace<br/>Decompose structure]

    E[Query Request] --> F{Data Type?}
    F -->|JSON| G[Parse Text<br/>Slow]
    F -->|JSONB| H[Binary Access<br/>Fast]

    style C fill:#ffcccc
    style D fill:#ccffcc
    style G fill:#ffcccc
    style H fill:#ccffcc

References:

↑ Back to top

What is a tsvector and tsquery?

The 30-Second Answer: tsvector is a sorted list of normalized, unique lexemes (word stems) with positional information, representing a searchable document. tsquery is a search expression containing lexemes and Boolean operators (AND, OR, NOT) representing search criteria. They work together using the @@ operator to perform full-text searches in PostgreSQL.

The 2-Minute Answer (If They Want More): tsvector and tsquery are the fundamental data types powering PostgreSQL's full-text search system. Understanding their structure and relationship is key to effective text searching.

A tsvector is a document representation optimized for searching. When text is converted to tsvector, PostgreSQL tokenizes it into words, normalizes each word to its base form (lexeme) using linguistic rules, removes duplicates and stop words, and stores the result as a sorted list. Each lexeme includes positional information showing where it appeared in the original text and optionally weight labels (A, B, C, D) indicating importance. For example, the text "The cats are running" becomes a tsvector like 'cat':2 'run':4, where numbers indicate word positions and the articles "the" and "are" are removed as stop words.

A tsquery represents a search condition using normalized lexemes combined with Boolean operators. It supports AND (&), OR (|), NOT (!), phrase matching (<->), and prefix matching (:*). When you create a tsquery, PostgreSQL applies the same normalization rules as tsvector, ensuring "running" in your query matches "run" in the document. The query structure forms a tree that can efficiently match against tsvector data.

The @@ operator performs the actual matching, checking if a tsvector satisfies a tsquery condition. PostgreSQL evaluates the Boolean logic tree against the lexemes and positions in the tsvector, returning true or false. When combined with GIN indexes, this operation becomes extremely fast even across millions of documents.

These types also support composition: tsvectors can be concatenated with || to combine documents, and tsqueries can be combined with Boolean operators to create complex search conditions. This flexibility makes them powerful building blocks for sophisticated search systems.

Code Example:

-- 1. Creating and examining tsvector
-- Simple conversion
SELECT to_tsvector('english', 'The quick brown fox jumps over the lazy dog');
-- Result: 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2
-- Notice: "the", "over" removed (stop words), "jumps"→"jump", "lazy"→"lazi"

-- See individual components
SELECT unnest(to_tsvector('english', 'running cats and running dogs')) AS lexeme;
-- Result shows unique lexemes: 'cat', 'dog', 'run'

-- Concatenating tsvectors
SELECT
    to_tsvector('english', 'PostgreSQL database') ||
    to_tsvector('english', 'relational storage');
-- Result: 'databas':2 'postgresql':1 'relat':3 'storag':4

-- 2. Weighted tsvector (for relevance ranking)
SELECT
    setweight(to_tsvector('english', 'PostgreSQL'), 'A') ||  -- Title (highest weight)
    setweight(to_tsvector('english', 'Database system'), 'B'); -- Content
-- Result: 'databas':2B 'postgresql':1A 'system':3B
-- Letters A-D indicate weight (importance)

-- 3. Creating tsquery - various methods
-- Standard tsquery (manual operators)
SELECT to_tsquery('english', 'PostgreSQL & (database | storage)');
-- Result: 'postgresql' & ( 'databas' | 'storag' )

-- Plain text to tsquery (converts spaces to AND)
SELECT plainto_tsquery('english', 'PostgreSQL full text search');
-- Result: 'postgresql' & 'full' & 'text' & 'search'

-- Phrase to tsquery (preserves word order)
SELECT phraseto_tsquery('english', 'full text search');
-- Result: 'full' <-> 'text' <-> 'search'

-- Websearch syntax (Google-like)
SELECT websearch_to_tsquery('english', '"full text" search -database');
-- Result: 'full' <-> 'text' & 'search' & !'databas'

-- 4. Matching with @@ operator
SELECT
    to_tsvector('english', 'PostgreSQL is a powerful database system') @@
    to_tsquery('english', 'PostgreSQL & database');
-- Result: true (both terms found)

SELECT
    to_tsvector('english', 'PostgreSQL is great') @@
    to_tsquery('english', 'PostgreSQL & MySQL');
-- Result: false (MySQL not found)

-- 5. Practical example with table
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT,
    body TEXT,
    search_vector tsvector
);

INSERT INTO documents (title, body) VALUES
('PostgreSQL Tutorial', 'Learn PostgreSQL database fundamentals and advanced features'),
('MySQL Guide', 'Comprehensive guide to MySQL database administration'),
('Database Comparison', 'Comparing PostgreSQL, MySQL, and MongoDB databases');

-- Update search_vector with weighted content
UPDATE documents
SET search_vector =
    setweight(to_tsvector('english', title), 'A') ||
    setweight(to_tsvector('english', body), 'B');

-- Search using @@
SELECT title, body
FROM documents
WHERE search_vector @@ to_tsquery('english', 'PostgreSQL & database');

-- 6. Understanding positions in tsvector
SELECT to_tsvector('english', 'First sentence. Second sentence with more words.');
-- Positions track word locations: 'first':1 'second':3 'sentenc':2,4 'word':7
-- Notice 'sentence' appears at positions 2 and 4

-- 7. Phrase matching using positions
SELECT
    to_tsvector('english', 'full text search capabilities') @@
    phraseto_tsquery('english', 'full text search');
-- Result: true (words are adjacent in correct order)

SELECT
    to_tsvector('english', 'full database text search') @@
    phraseto_tsquery('english', 'full text search');
-- Result: false (words not adjacent - "database" is between)

-- 8. Prefix matching with :*
SELECT
    to_tsvector('english', 'PostgreSQL database') @@
    to_tsquery('english', 'post:*');
-- Result: true (matches "postgresql")

-- 9. Complex query composition
SELECT to_tsquery('english',
    '(PostgreSQL | MySQL) & database & !(MongoDB | NoSQL)'
);
-- Finds documents with (PostgreSQL OR MySQL) AND database, excluding NoSQL mentions

-- 10. Debugging and introspection
-- See how text is broken down
SELECT ts_debug('english', 'The PostgreSQL database system');
-- Returns detailed tokenization: token type, dictionary, lexemes

-- Extract lexemes from tsvector
SELECT (ts_stat('SELECT search_vector FROM documents')).word AS lexeme,
       (ts_stat('SELECT search_vector FROM documents')).ndoc AS document_count
FROM ts_stat('SELECT search_vector FROM documents')
ORDER BY document_count DESC
LIMIT 10;

-- 11. Position-based proximity search
-- Adjacent words (distance 0)
SELECT to_tsquery('english', 'database <-> system');
-- Matches "database system" (words next to each other)

-- Words with distance 1 (one word between)
SELECT to_tsquery('english', 'database <2> system');
-- Matches "database management system"

-- 12. Working with tsvector directly (without conversion)
-- Sometimes useful for pre-processed data
SELECT 'cat:1 dog:2 bird:3'::tsvector @@ 'cat'::tsquery;
-- Result: true

-- Combine pre-built tsvectors
SELECT 'cat:1 dog:2'::tsvector || 'bird:3 fish:4'::tsvector;
-- Result: 'bird':3 'cat':1 'dog':2 'fish':4

tsvector Structure:

graph TD
    A[Input Text:<br/>'The cats are running quickly'] --> B[Tokenize]
    B --> C[Normalize/Stem]
    C --> D[Remove Stop Words]
    D --> E[tsvector:<br/>'cat':2 'quick':5 'run':4]

    E --> F[Lexeme: 'cat']
    E --> G[Lexeme: 'run']
    E --> H[Lexeme: 'quick']

    F --> F1[Position: 2<br/>Weight: none]
    G --> G1[Position: 4<br/>Weight: none]
    H --> H1[Position: 5<br/>Weight: none]

    style E fill:#ccffcc
    style F fill:#cce5ff
    style G fill:#cce5ff
    style H fill:#cce5ff

tsquery Structure:

graph TD
    A[Query: 'PostgreSQL & database | search'] --> B[Parse & Normalize]
    B --> C[tsquery Tree]

    C --> D[OR |]
    D --> E[AND &]
    D --> F[search]

    E --> G[PostgreSQL]
    E --> H[database]

    I[Text: 'postgresql' & 'databas' | 'search'] -.->|Normalized| C

    style C fill:#ffffcc
    style D fill:#ffcccc
    style E fill:#ffcccc
    style F fill:#cce5ff
    style G fill:#cce5ff
    style H fill:#cce5ff

Matching Process:

sequenceDiagram
    participant Doc as Document Text
    participant TSV as tsvector
    participant TSQ as tsquery
    participant Match as @@ Operator
    participant Result as Result

    Doc->>TSV: to_tsvector('english', text)
    Note over TSV: Normalize, stem, position

    Note over TSQ: to_tsquery('english', query)
    TSQ->>Match: Query tree
    TSV->>Match: Lexeme list

    Match->>Match: Evaluate Boolean logic
    Match->>Match: Check positions
    Match->>Match: Apply operators

    Match->>Result: true/false

    Note over Result: Can calculate rank<br/>with ts_rank()

Advanced Operations:

-- Strip weights from tsvector
SELECT strip(setweight(to_tsvector('english', 'PostgreSQL'), 'A'));

-- Get length (number of lexemes)
SELECT length(to_tsvector('english', 'The quick brown fox'));
-- Result: 3 (after removing stop words)

-- Convert tsvector back to array of lexemes
SELECT array_agg(lexeme)
FROM unnest(to_tsvector('english', 'PostgreSQL database system')) AS lexeme;

-- Negation in tsquery
SELECT to_tsquery('english', '!database');
-- Matches documents that DON'T contain "database"

-- Query rewriting (for synonyms, spelling)
SELECT tsq_rewrite(
    to_tsquery('english', 'PostgreSQL'),
    to_tsquery('english', 'PostgreSQL'),
    to_tsquery('english', 'PostgreSQL | Postgres | PG')
);

Performance Considerations:

-- Pre-compute tsvector for better performance
ALTER TABLE documents
ADD COLUMN search_vector tsvector
GENERATED ALWAYS AS (
    setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
    setweight(to_tsvector('english', coalesce(body, '')), 'B')
) STORED;

-- Index for fast searching
CREATE INDEX idx_documents_search
ON documents USING GIN (search_vector);

-- Now searches are fast
EXPLAIN ANALYZE
SELECT title
FROM documents
WHERE search_vector @@ to_tsquery('english', 'PostgreSQL & database');

References:

↑ Back to top

VACUUM and Maintenance

What is the difference between VACUUM and VACUUM FULL?

The 30-Second Answer: VACUUM reclaims space within a table by marking dead tuples as reusable but doesn't shrink the table file, while VACUUM FULL completely rewrites the table to remove all dead space and returns disk space to the operating system. VACUUM FULL requires an exclusive lock and is much slower, whereas regular VACUUM runs concurrently with normal operations.

The 2-Minute Answer (If They Want More): Regular VACUUM is a lightweight maintenance operation that runs concurrently with read and write operations. It scans the table, marks dead tuples as available for reuse, and updates internal maps, but the table file size remains the same. The reclaimed space is available for future inserts and updates within that table, but the disk space isn't returned to the operating system. This makes VACUUM fast and non-blocking, suitable for frequent execution.

VACUUM FULL, on the other hand, is a heavy operation that completely rewrites the entire table and its indexes into new files, removing all dead space. It requires an ACCESS EXCLUSIVE lock on the table, blocking all reads and writes during execution. After completion, it returns the freed disk space to the operating system, resulting in a smaller table file. This operation can take hours on large tables and requires additional disk space equal to the size of the table and its indexes.

You should use regular VACUUM for routine maintenance (usually handled by autovacuum) and only use VACUUM FULL in exceptional circumstances when severe bloat has occurred and you have a maintenance window available. In most cases, pg_repack or similar tools are preferred over VACUUM FULL because they allow concurrent access.

The key trade-off is availability versus space reclamation. Regular VACUUM maintains performance with minimal disruption, while VACUUM FULL provides maximum space reclamation at the cost of downtime. Modern best practices emphasize preventing bloat through proper autovacuum tuning rather than relying on VACUUM FULL.

Code Example:

-- Regular VACUUM - runs concurrently, no exclusive lock
VACUUM my_table;

-- Check table size before VACUUM FULL
SELECT
    pg_size_pretty(pg_total_relation_size('my_table')) as total_size,
    pg_size_pretty(pg_relation_size('my_table')) as table_size,
    pg_size_pretty(pg_indexes_size('my_table')) as indexes_size;

-- VACUUM FULL - requires exclusive lock, returns space to OS
VACUUM FULL my_table;

-- Check table size after VACUUM FULL
SELECT
    pg_size_pretty(pg_total_relation_size('my_table')) as total_size,
    pg_size_pretty(pg_relation_size('my_table')) as table_size,
    pg_size_pretty(pg_indexes_size('my_table')) as indexes_size;

-- Alternative: Use pg_repack (requires extension)
-- This rewrites the table without blocking reads/writes
CREATE EXTENSION pg_repack;
-- Then run from command line: pg_repack -t my_table dbname

-- Monitor VACUUM progress (PostgreSQL 12+)
SELECT
    p.pid,
    now() - a.xact_start AS duration,
    p.phase,
    p.heap_blks_total,
    p.heap_blks_scanned,
    p.heap_blks_vacuumed,
    round(100.0 * p.heap_blks_scanned / nullif(p.heap_blks_total, 0), 1) AS percent_complete
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a ON p.pid = a.pid;

Comparison Table:

Feature VACUUM VACUUM FULL
Locking No exclusive lock ACCESS EXCLUSIVE lock
Concurrent Access Allows reads/writes Blocks all access
Disk Space Marks space reusable Returns space to OS
Table File Size Remains same Shrinks to minimum
Speed Fast (minutes) Slow (hours for large tables)
Extra Disk Needed None ~2x table size temporarily
Use Case Routine maintenance Emergency bloat removal

Visual Explanation:

graph LR
    subgraph "VACUUM (Standard)"
        A1[Table: 100GB] --> B1[VACUUM]
        B1 --> C1[Table: 100GB]
        C1 --> D1[Dead space marked reusable]
        D1 --> E1[No OS disk space returned]
    end

    subgraph "VACUUM FULL"
        A2[Table: 100GB, 40GB dead] --> B2[VACUUM FULL]
        B2 --> C2[Exclusive Lock]
        C2 --> D2[Complete Table Rewrite]
        D2 --> E2[Table: 60GB]
        E2 --> F2[40GB returned to OS]
    end

    style C2 fill:#ffcccc
    style E1 fill:#ffffcc
    style F2 fill:#ccffcc

References:

↑ Back to top

Replication

What is the difference between synchronous and asynchronous replication?

The 30-Second Answer: Synchronous replication waits for at least one standby to confirm WAL records are written before committing a transaction on the primary, guaranteeing zero data loss but with higher latency. Asynchronous replication commits immediately on the primary without waiting for standbys, offering better performance but risking data loss if the primary fails before WAL is replicated.

The 2-Minute Answer (If They Want More): The fundamental difference lies in the transaction commit behavior and the durability guarantees provided. In asynchronous replication (the default), when a client commits a transaction, the primary server confirms the commit as soon as the WAL is written to its own disk, without waiting for any standby servers. WAL records are streamed to standbys, but this happens independently of the commit. This provides the best performance but means a primary failure could result in committed transactions that were never replicated.

In synchronous replication, the primary waits for acknowledgment from one or more standby servers before confirming the commit to the client. The level of synchronization can be configured: 'remote_write' waits until WAL is written to the standby's OS cache, 'on' (or 'remote_apply') waits until WAL is written to standby's disk, and 'remote_apply' waits until changes are actually applied and visible on the standby. This guarantees that committed data exists on multiple servers but increases commit latency proportional to network round-trip time.

The choice between synchronous and asynchronous replication involves a classic trade-off between consistency/durability and performance/availability. Many production systems use a hybrid approach: synchronous replication to one local standby for zero data loss, and asynchronous to remote disaster recovery sites to avoid geographic latency penalties. PostgreSQL allows fine-grained control through the synchronous_standby_names parameter, supporting configurations like "first N standbys" or "any N standbys."

Comparison Diagram:

sequenceDiagram
    participant Client
    participant Primary
    participant Standby

    Note over Client,Standby: Asynchronous Replication
    Client->>Primary: COMMIT
    Primary->>Primary: Write WAL to disk
    Primary-->>Client: OK (Immediate)
    Primary->>Standby: Stream WAL (async)
    Standby->>Standby: Apply WAL

    Note over Client,Standby: Synchronous Replication
    Client->>Primary: COMMIT
    Primary->>Primary: Write WAL to disk
    Primary->>Standby: Stream WAL
    Standby->>Standby: Write/Apply WAL
    Standby-->>Primary: ACK
    Primary-->>Client: OK (After ACK)

Configuration Example:

-- Asynchronous Replication (Default)
-- No special configuration needed - just set up streaming replication
-- postgresql.conf on Primary:
synchronous_commit = off               -- Or simply don't configure synchronous_standby_names

-- Synchronous Replication
-- postgresql.conf on Primary:
synchronous_commit = on                -- Wait for WAL to be written to disk on standby
-- OR
synchronous_commit = remote_apply      -- Wait for WAL to be applied on standby (highest consistency)
-- OR
synchronous_commit = remote_write      -- Wait for WAL to be written to OS cache on standby

-- Specify which standbys to synchronize with
synchronous_standby_names = 'standby1,standby2'  -- First available standby from this list

-- Advanced: Quorum-based synchronous replication (PostgreSQL 10+)
synchronous_standby_names = 'FIRST 1 (standby1, standby2, standby3)'  -- Wait for first standby
synchronous_standby_names = 'ANY 2 (standby1, standby2, standby3)'     -- Wait for any 2 standbys

-- On Standby: Set application_name to match synchronous_standby_names
-- postgresql.conf or recovery.conf:
primary_conninfo = 'host=primary port=5432 user=replicator password=pass application_name=standby1'

Trade-offs Comparison:

-- Demonstrating transaction behavior differences

-- Asynchronous Replication
BEGIN;
INSERT INTO orders (id, customer, amount) VALUES (1001, 'Alice', 500.00);
COMMIT;  -- Returns immediately after WAL written on primary
         -- Risk: If primary crashes now, standby might not have this transaction

-- Synchronous Replication
BEGIN;
INSERT INTO orders (id, customer, amount) VALUES (1002, 'Bob', 750.00);
COMMIT;  -- Waits for standby acknowledgment (adds network latency)
         -- Guarantee: Transaction exists on primary AND at least one standby

-- Check current synchronous commit mode
SHOW synchronous_commit;

-- Temporarily change for a specific transaction
BEGIN;
SET LOCAL synchronous_commit = off;  -- Use async for this transaction only
INSERT INTO logs (message) VALUES ('Non-critical log entry');
COMMIT;

-- Monitoring synchronous replication
SELECT
    application_name,
    client_addr,
    state,
    sync_state,  -- 'sync' = synchronous, 'async' = asynchronous, 'potential' = could become sync
    sync_priority,
    pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag_bytes,
    pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_lag_bytes,
    pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication
ORDER BY sync_priority;

Performance Impact Analysis:

-- Measure commit latency difference
-- Create test table
CREATE TABLE replication_test (
    id SERIAL PRIMARY KEY,
    data TEXT,
    created_at TIMESTAMP DEFAULT now()
);

-- Test asynchronous replication performance
SET synchronous_commit = off;
\timing on
INSERT INTO replication_test (data) VALUES ('async test');
COMMIT;
-- Typical: 1-5ms

-- Test synchronous replication performance
SET synchronous_commit = on;
\timing on
INSERT INTO replication_test (data) VALUES ('sync test');
COMMIT;
-- Typical: 5-50ms (depends on network latency to standby)

References:

↑ Back to top

High Availability and Backup

What is the difference between logical and physical backups?

The 30-Second Answer: Physical backups copy the entire data directory as binary files (using pg_basebackup or filesystem snapshots), enabling fast full-cluster recovery but requiring identical PostgreSQL versions. Logical backups export database objects as SQL statements (using pg_dump), allowing selective restore, cross-version migration, and portability, but are slower for large databases.

The 2-Minute Answer (If They Want More): Physical backups are block-level copies of PostgreSQL's data files, including all tablespaces, WAL files, and configuration files. Tools like pg_basebackup create consistent snapshots while the database is running. These backups are fast to create and restore because they copy files directly without parsing data. However, they're platform-specific and must be restored to the same PostgreSQL major version and architecture. Physical backups are ideal for disaster recovery, point-in-time recovery (PITR), and setting up replication standbys.

Logical backups, created with pg_dump or pg_dumpall, export database contents as SQL statements (CREATE TABLE, INSERT, etc.) or custom binary formats. They're portable across PostgreSQL versions, operating systems, and architectures. You can selectively restore specific schemas, tables, or even individual rows. This makes them perfect for database migrations, version upgrades, and partial restores. However, for multi-terabyte databases, logical dumps can take hours and don't capture point-in-time consistency across databases.

Modern backup strategies often combine both approaches: continuous archiving (physical) for PITR and disaster recovery, plus periodic logical dumps for version upgrades and selective restores. Cloud services and tools like pgBackRest, Barman, or WAL-G automate this hybrid approach.

Comparison Table:

Aspect Physical Backup Logical Backup
Tools pg_basebackup, filesystem snapshots, pgBackRest pg_dump, pg_dumpall
Speed Very fast (file copying) Slower (data parsing)
Size Larger (includes indexes, dead tuples) Smaller (only live data)
Restore Full cluster restore Selective (schema/table level)
PITR Supported (with WAL archiving) Not supported
Version compatibility Same major version only Cross-version compatible
Platform Platform-specific Platform-independent
Use case Disaster recovery, replication Migrations, partial restore

Code Examples:

# ============================================
# PHYSICAL BACKUP - pg_basebackup
# ============================================

# Create a physical backup
pg_basebackup -h localhost -U postgres -D /backup/base -Ft -z -P

# Options:
#   -D: destination directory
#   -Ft: tar format
#   -z: gzip compression
#   -P: show progress

# Create backup with WAL files for PITR
pg_basebackup -h localhost -U postgres \
  -D /backup/base \
  -Ft -z -P \
  -X stream  # Stream WAL files during backup

# Restore from physical backup
# 1. Stop PostgreSQL
systemctl stop postgresql

# 2. Clear data directory
rm -rf /var/lib/postgresql/14/main/*

# 3. Extract backup
tar -xzf /backup/base/base.tar.gz -C /var/lib/postgresql/14/main/
tar -xzf /backup/base/pg_wal.tar.gz -C /var/lib/postgresql/14/main/pg_wal/

# 4. Set permissions
chown -R postgres:postgres /var/lib/postgresql/14/main/

# 5. Start PostgreSQL
systemctl start postgresql
# ============================================
# LOGICAL BACKUP - pg_dump
# ============================================

# Dump entire database
pg_dump -h localhost -U postgres -d mydb -F c -f mydb.dump

# Options:
#   -F c: custom format (compressed, allows parallel restore)
#   -F p: plain SQL format
#   -F t: tar format
#   -F d: directory format (parallel dump)

# Dump specific tables
pg_dump -h localhost -U postgres -d mydb \
  -t users -t orders \
  -F c -f partial.dump

# Dump specific schema
pg_dump -h localhost -U postgres -d mydb \
  -n public \
  -F c -f public_schema.dump

# Dump only schema (no data)
pg_dump -h localhost -U postgres -d mydb \
  --schema-only \
  -F c -f schema.dump

# Dump only data (no schema)
pg_dump -h localhost -U postgres -d mydb \
  --data-only \
  -F c -f data.dump

# Parallel dump for large databases
pg_dump -h localhost -U postgres -d mydb \
  -F d -j 4 \
  -f /backup/mydb_dir

# Dump all databases
pg_dumpall -h localhost -U postgres -f all_databases.sql

# Dump only global objects (roles, tablespaces)
pg_dumpall -h localhost -U postgres --globals-only -f globals.sql
# ============================================
# RESTORE LOGICAL BACKUP
# ============================================

# Restore from custom format
pg_restore -h localhost -U postgres -d mydb -v mydb.dump

# Parallel restore
pg_restore -h localhost -U postgres -d mydb \
  -j 4 -v /backup/mydb_dir

# Restore specific table
pg_restore -h localhost -U postgres -d mydb \
  -t users \
  mydb.dump

# Restore with clean (drop existing objects first)
pg_restore -h localhost -U postgres -d mydb \
  --clean \
  mydb.dump

# Restore with create database
pg_restore -h localhost -U postgres \
  --create -d postgres \
  mydb.dump

# Restore from plain SQL format
psql -h localhost -U postgres -d mydb -f backup.sql

# Restore all databases
psql -h localhost -U postgres -f all_databases.sql
-- ============================================
-- VERIFY BACKUPS
-- ============================================

-- Check database size before backup
SELECT
    pg_database.datname,
    pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;

-- Monitor long-running pg_dump
SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    query_start,
    now() - query_start AS duration,
    query
FROM pg_stat_activity
WHERE query LIKE '%pg_dump%'
   OR query LIKE '%COPY%';

-- After restore, verify table counts
SELECT
    schemaname,
    tablename,
    n_live_tup AS row_count
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
# ============================================
# AUTOMATED BACKUP SCRIPT
# ============================================

#!/bin/bash
# backup_postgres.sh - Combined physical and logical backup

DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backup"
RETENTION_DAYS=7

# Physical backup with WAL
pg_basebackup -h localhost -U postgres \
  -D ${BACKUP_DIR}/physical_${DATE} \
  -Ft -z -P -X stream

# Logical backup
pg_dumpall -h localhost -U postgres \
  -f ${BACKUP_DIR}/logical_${DATE}.sql

# Compress logical backup
gzip ${BACKUP_DIR}/logical_${DATE}.sql

# Remove old backups
find ${BACKUP_DIR} -type f -mtime +${RETENTION_DAYS} -delete
find ${BACKUP_DIR} -type d -mtime +${RETENTION_DAYS} -exec rm -rf {} +

echo "Backup completed: ${DATE}"

References:

↑ Back to top

What is PgBouncer and why is connection pooling important?

The 30-Second Answer: PgBouncer is a lightweight connection pooler for PostgreSQL that sits between applications and the database, reusing a small pool of database connections to serve many client connections. It prevents connection exhaustion, reduces memory overhead (each PostgreSQL connection uses ~10MB), and improves performance by eliminating connection setup latency for every request.

The 2-Minute Answer (If They Want More): PostgreSQL handles each connection as a separate backend process, which consumes memory and system resources. In modern microservices architectures where hundreds of application instances each open dozens of connections, databases can quickly hit connection limits (default max_connections is 100). Even with increased limits, thousands of idle connections waste gigabytes of RAM and degrade performance due to context switching.

PgBouncer solves this by maintaining a small pool of active database connections (e.g., 20-50) and multiplexing hundreds or thousands of client connections through them. When an application needs to query the database, PgBouncer assigns it an available connection from the pool, executes the query, and immediately returns the connection to the pool for reuse. This dramatically reduces database memory footprint and allows databases to handle far more concurrent applications.

PgBouncer offers three pooling modes: session pooling (connection held until client disconnects), transaction pooling (connection held only during transaction), and statement pooling (connection held only during query execution). Transaction pooling is most common as it balances safety with efficiency. However, it has limitations: session-level features like prepared statements, LISTEN/NOTIFY, and temporary tables don't work across pooled connections.

Beyond connection management, PgBouncer provides features like connection limits per database, authentication, query routing, and online reconfiguration without downtime. It's so lightweight (~10-20MB RAM for thousands of connections) that it's often deployed on the same server as the application or database. Cloud platforms like Heroku and AWS RDS Proxy implement similar connection pooling as a managed service.

Architecture Diagram:

graph TB
    subgraph "Application Layer (100s of instances)"
        APP1[App Instance 1<br/>20 connections]
        APP2[App Instance 2<br/>20 connections]
        APP3[App Instance 3<br/>20 connections]
        APP4[App Instance 4<br/>20 connections]
        APP5[App Instance 5<br/>20 connections]
    end

    subgraph "PgBouncer Layer"
        PGB[PgBouncer<br/>500 client connections<br/>↓<br/>25 server connections]
    end

    subgraph "PostgreSQL"
        PG[(PostgreSQL<br/>25 active connections<br/>vs 500 without pooling)]
    end

    APP1 -.->|Pooled| PGB
    APP2 -.->|Pooled| PGB
    APP3 -.->|Pooled| PGB
    APP4 -.->|Pooled| PGB
    APP5 -.->|Pooled| PGB

    PGB -->|Reused| PG

    style PGB fill:#FFD700
    style PG fill:#90EE90

    note1[Without PgBouncer:<br/>500 connections = ~5GB RAM]
    note2[With PgBouncer:<br/>25 connections = ~250MB RAM]

Configuration Example:

# ============================================
# /etc/pgbouncer/pgbouncer.ini
# ============================================

[databases]
# Database connection strings
mydb = host=localhost port=5432 dbname=mydb
production = host=db.example.com port=5432 dbname=prod user=app_user password=secret

# Connection via Unix socket
localdb = host=/var/run/postgresql port=5432 dbname=local

# Per-database connection limits
analytics = host=localhost port=5432 dbname=analytics pool_size=10 max_db_connections=15

[pgbouncer]
# ============================================
# Basic Settings
# ============================================

# Listen address and port
listen_addr = 0.0.0.0
listen_port = 6432

# Authentication
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

# Admin interface
admin_users = admin
stats_users = monitoring

# ============================================
# Pooling Configuration
# ============================================

# Pooling mode
# - session: Connection held until client disconnects (safest, least efficient)
# - transaction: Connection held only during transaction (recommended)
# - statement: Connection held only during query (breaks most applications)
pool_mode = transaction

# Connection limits
max_client_conn = 500           # Maximum client connections to PgBouncer
default_pool_size = 25          # Default server connections per database
min_pool_size = 5               # Minimum server connections to keep open
reserve_pool_size = 5           # Emergency connections for admin
reserve_pool_timeout = 3        # How long to wait for free connection

# Maximum connections per user/database pair
max_user_connections = 50
max_db_connections = 50

# ============================================
# Performance Tuning
# ============================================

# Server connection lifetime
server_lifetime = 3600          # Close server connection after 1 hour
server_idle_timeout = 600       # Close idle server connection after 10 minutes
server_connect_timeout = 15     # Timeout for connecting to PostgreSQL

# Client settings
client_idle_timeout = 0         # 0 = disabled (client never kicked)
client_login_timeout = 60       # Timeout for client login

# Query settings
query_timeout = 0               # 0 = disabled
query_wait_timeout = 120        # How long queries wait for connection

# ============================================
# Logging
# ============================================

log_connections = 1
log_disconnections = 1
log_pooler_errors = 1
# log_stats = 1                 # Log stats every stats_period

# Verbosity: 0 = quiet, 1 = normal, 2 = verbose
verbose = 0

# ============================================
# Advanced Settings
# ============================================

# Disable certain features for transaction pooling safety
ignore_startup_parameters = extra_float_digits,application_name

# DNS settings
dns_max_ttl = 15
dns_nxdomain_ttl = 15

# TLS/SSL
# client_tls_sslmode = prefer
# client_tls_cert_file = /path/to/cert.pem
# client_tls_key_file = /path/to/key.pem
# ============================================
# /etc/pgbouncer/userlist.txt
# Format: "username" "password_hash"
# ============================================

"app_user" "md5d8578edf8458ce06fbc5bb76a58c5ca4"
"admin" "md5c93ccd78b2076528346216b3b2f701e6"
"monitoring" "md5a029d0df84eb5549c641e04a9ef389e5"

# Generate hash with:
# echo -n "passwordusername" | md5sum
# For example, password "secret" for user "app_user":
# echo -n "secretapp_user" | md5sum
# ============================================
# Install and Start PgBouncer
# ============================================

# Install on Ubuntu/Debian
apt-get update
apt-get install pgbouncer

# Install on CentOS/RHEL
yum install pgbouncer

# Start PgBouncer
systemctl start pgbouncer
systemctl enable pgbouncer

# Check status
systemctl status pgbouncer

# Test connection through PgBouncer
psql -h localhost -p 6432 -U app_user mydb

# Check logs
tail -f /var/log/postgresql/pgbouncer.log
-- ============================================
-- PgBouncer Admin Console
-- ============================================

-- Connect to admin console
psql -h localhost -p 6432 -U admin pgbouncer

-- Show pools
SHOW POOLS;
-- Output columns:
--   database   | user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait
--   mydb       | app_user  | 15        | 0          | 10        | 5       | 0       | 0         | 0        | 0

-- Show clients
SHOW CLIENTS;

-- Show servers
SHOW SERVERS;

-- Show statistics
SHOW STATS;
-- Useful columns:
--   total_xact_count  : Total transactions
--   total_query_count : Total queries
--   total_wait_time   : Time clients waited for connection
--   avg_query_time    : Average query execution time

-- Show configuration
SHOW CONFIG;

-- Show databases
SHOW DATABASES;

-- Show users
SHOW USERS;

-- Reload configuration (without restart)
RELOAD;

-- Pause database (new connections wait)
PAUSE mydb;

-- Resume database
RESUME mydb;

-- Disable database (reject new connections)
DISABLE mydb;

-- Enable database
ENABLE mydb;

-- Kill database connections
KILL mydb;

-- Reconnect all server connections
RECONNECT mydb;

-- Shutdown PgBouncer
SHUTDOWN;
# ============================================
# Application Configuration
# ============================================

# Python (psycopg2)
import psycopg2
from psycopg2 import pool

# Connect through PgBouncer
conn = psycopg2.connect(
    host="localhost",
    port=6432,  # PgBouncer port
    database="mydb",
    user="app_user",
    password="secret"
)

# Connection pooling in application + PgBouncer
# Application pool: 5 connections per worker
# PgBouncer pool: 25 connections to database
# 100 workers * 5 app connections = 500 total, but only 25 to database

app_pool = psycopg2.pool.SimpleConnectionPool(
    minconn=1,
    maxconn=5,
    host="localhost",
    port=6432,
    database="mydb",
    user="app_user",
    password="secret"
)

# Get connection from pool
conn = app_pool.getconn()
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE id = %s", (123,))
app_pool.putconn(conn)
// ============================================
// Node.js Configuration
// ============================================

// Using node-postgres (pg)
const { Pool } = require('pg');

// Application pool connected to PgBouncer
const pool = new Pool({
  host: 'localhost',
  port: 6432,  // PgBouncer port
  database: 'mydb',
  user: 'app_user',
  password: 'secret',
  max: 10,  // Application pool size
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
});

// Query through the pool
async function getUser(id) {
  const client = await pool.connect();
  try {
    const result = await client.query('SELECT * FROM users WHERE id = $1', [id]);
    return result.rows[0];
  } finally {
    client.release();
  }
}
# ============================================
# Monitoring PgBouncer
# ============================================

# Check PgBouncer stats via admin console
psql -h localhost -p 6432 -U monitoring pgbouncer -c "SHOW STATS"

# Monitor connection pool usage
watch -n 2 'psql -h localhost -p 6432 -U monitoring pgbouncer -c "SHOW POOLS"'

# Check for waiting clients (indicates pool exhaustion)
psql -h localhost -p 6432 -U monitoring pgbouncer -c "SHOW POOLS" \
  | awk '$4 > 0 {print "WARNING: " $4 " clients waiting for connection"}'

# Log analysis for slow queries
grep "query_time" /var/log/postgresql/pgbouncer.log | \
  awk '{print $NF}' | \
  sort -n | \
  tail -20
# ============================================
# PgBouncer Performance Tuning
# ============================================

# Calculate optimal pool size
# Rule of thumb: (2 * num_cores) + effective_spindle_count
# For 8-core server with SSD: (2 * 8) + 1 = 17
# Start with this, then monitor and adjust

# Monitor PostgreSQL connection count
psql -h localhost -U postgres -c "
  SELECT
    count(*) as total_connections,
    count(*) FILTER (WHERE state = 'active') as active,
    count(*) FILTER (WHERE state = 'idle') as idle
  FROM pg_stat_activity;
"

# Compare with PgBouncer
psql -h localhost -p 6432 -U monitoring pgbouncer -c "
  SELECT
    sum(cl_active) as client_active,
    sum(sv_active) as server_active,
    sum(sv_idle) as server_idle
  FROM pg_stats_databases;
"

# Memory savings calculation
# Without PgBouncer: 500 connections * 10MB = 5GB
# With PgBouncer: 25 connections * 10MB = 250MB
# Savings: 4.75GB (95% reduction)

References:

↑ Back to top

Performance Tuning

What is the difference between session, transaction, and statement pooling modes?

The 30-Second Answer: Session pooling assigns one connection per client for the entire session (least efficient, most compatible). Transaction pooling releases the connection after each transaction (better efficiency, some restrictions). Statement pooling releases the connection after each statement (most efficient, many restrictions on features like prepared statements and temp tables).

The 2-Minute Answer (If They Want More):

Connection pooling is essential for PostgreSQL scalability because each PostgreSQL connection consumes significant memory (typically 5-10MB) and CPU for process management. Connection poolers like PgBouncer sit between clients and PostgreSQL, multiplexing many client connections onto fewer database connections.

Session pooling (also called client pooling) maintains a 1:1 mapping between client and server connections for the duration of the client session. When a client connects, it gets a dedicated PostgreSQL connection until disconnect. This is the most compatible mode - all PostgreSQL features work, including prepared statements, LISTEN/NOTIFY, advisory locks, and temporary tables. However, it provides minimal pooling benefits, mainly reducing connection establishment overhead. Use this mode when applications rely heavily on session-level features or when migrating from direct connections.

Transaction pooling releases the server connection back to the pool when the client transaction completes (COMMIT or ROLLBACK). This allows many more clients than database connections, as most applications spend time idle between transactions. Prepared statements must be unnamed (protocol-level), SET commands persist only within the transaction, and session-level features like temporary tables don't work across transactions. This is the recommended mode for most OLTP applications - it provides excellent efficiency with reasonable compatibility. Applications must be written to avoid holding transactions open during user think time.

Statement pooling returns the connection to the pool after every SQL statement, achieving maximum multiplexing. This severely restricts PostgreSQL features: transactions must be explicitly managed with BEGIN/COMMIT in single statements, prepared statements are very limited, and most session state is lost between statements. This mode is rarely used except for specific scenarios like analytics dashboards where each query is independent and maximum connection sharing is needed. Most applications cannot use this mode without significant modifications.

graph TB
    subgraph "Session Pooling"
        C1[Client 1] -->|entire session| PC1[Pool Conn 1]
        C2[Client 2] -->|entire session| PC2[Pool Conn 2]
        C3[Client 3] -->|entire session| PC3[Pool Conn 3]
        PC1 --> PG1[(PostgreSQL)]
        PC2 --> PG1
        PC3 --> PG1
    end

    subgraph "Transaction Pooling"
        C4[Client 1] -->|during txn| PC4[Pool Conn 1]
        C5[Client 2] -->|during txn| PC4
        C6[Client 3] -->|during txn| PC5[Pool Conn 2]
        C7[Client 4] -->|during txn| PC5
        PC4 --> PG2[(PostgreSQL)]
        PC5 --> PG2
    end

    subgraph "Statement Pooling"
        C8[Client 1] -->|per statement| PC6[Pool Conn 1]
        C9[Client 2] -->|per statement| PC6
        C10[Client 3] -->|per statement| PC6
        C11[Client 4] -->|per statement| PC6
        PC6 --> PG3[(PostgreSQL)]
    end

    style C1 fill:#e1f5ff
    style C2 fill:#e1f5ff
    style C3 fill:#e1f5ff
    style C4 fill:#e1f5ff
    style C5 fill:#e1f5ff
    style C6 fill:#e1f5ff
    style C7 fill:#e1f5ff
    style C8 fill:#e1f5ff
    style C9 fill:#e1f5ff
    style C10 fill:#e1f5ff
    style C11 fill:#e1f5ff

Code Example (if applicable):

-- PgBouncer configuration examples (pgbouncer.ini)

-- SESSION POOLING (pool_mode = session)
-- Most compatible, least efficient
-- Use when: applications need temporary tables, prepared statements, advisory locks
[databases]
mydb = host=localhost port=5432 dbname=production

[pgbouncer]
pool_mode = session
max_client_conn = 1000          -- Many clients allowed
default_pool_size = 50          -- But need many PostgreSQL connections
reserve_pool_size = 10
reserve_pool_timeout = 5
-- All PostgreSQL features work:
-- - Named prepared statements
-- - Temporary tables
-- - LISTEN/NOTIFY
-- - Advisory locks
-- - SET LOCAL and SET SESSION

-- TRANSACTION POOLING (pool_mode = transaction) - RECOMMENDED
-- Good balance of efficiency and compatibility
-- Use for: most OLTP applications
[databases]
mydb = host=localhost port=5432 dbname=production

[pgbouncer]
pool_mode = transaction
max_client_conn = 5000          -- Many more clients
default_pool_size = 20          -- Fewer PostgreSQL connections needed
reserve_pool_size = 5
max_db_connections = 50
server_idle_timeout = 600

-- Compatible features:
-- - Unnamed prepared statements (PREPARE works in same txn)
-- - Multi-statement transactions
-- - Transaction-level SET statements

-- NOT compatible:
-- - Temporary tables (destroyed when connection released)
-- - SET SESSION (only SET LOCAL works)
-- - Prepared statements across transactions
-- - LISTEN/NOTIFY
-- - WITH HOLD cursors
-- - Advisory locks across transactions

-- STATEMENT POOLING (pool_mode = statement)
-- Maximum efficiency, many restrictions
-- Use for: independent query workloads, analytics dashboards
[databases]
mydb = host=localhost port=5432 dbname=production

[pgbouncer]
pool_mode = statement
max_client_conn = 10000         -- Maximum clients
default_pool_size = 10          -- Minimum PostgreSQL connections
reserve_pool_size = 2

-- Very limited compatibility:
-- - No multi-statement transactions (must use BEGIN; ...; COMMIT; as one statement)
-- - No prepared statements
-- - No session variables
-- - No temporary tables

-- Application code examples for transaction pooling

-- GOOD: Short-lived transactions
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
-- Connection immediately returned to pool

-- BAD: Transaction held during user interaction
BEGIN;
SELECT * FROM products WHERE id = 123;
-- Application waits for user input (holds connection!)
-- ... user thinks for 30 seconds ...
UPDATE cart ADD product_id = 123;
COMMIT;
-- Connection unavailable for 30+ seconds

-- GOOD: Separate transactions
SELECT * FROM products WHERE id = 123;
-- User thinks (no connection held)
BEGIN;
UPDATE cart ADD product_id = 123;
COMMIT;

-- Monitoring PgBouncer performance
-- Connect to pgbouncer admin console
psql -p 6432 -U pgbouncer pgbouncer

-- Show pool statistics
SHOW POOLS;
-- Look at:
-- - cl_active: active client connections
-- - cl_waiting: queued clients (should be low)
-- - sv_active: active server connections
-- - sv_idle: idle server connections
-- - maxwait: max queue time (should be < 1s)

-- Show per-database stats
SHOW STATS;

-- Show configuration
SHOW CONFIG;

-- Example pool sizing calculation
-- Target: 1000 concurrent users, average 100ms transaction time
-- User think time: 5 seconds between transactions
-- Concurrent transactions = 1000 * (0.1 / 5.1) = ~20
-- With overhead and spikes: default_pool_size = 40-50

-- Formula: pool_size = (concurrent_users * avg_transaction_time) /
--                      (avg_transaction_time + avg_think_time)
SELECT
    1000 AS concurrent_users,
    0.1 AS avg_txn_time_sec,
    5.0 AS avg_think_time_sec,
    round((1000 * 0.1) / (0.1 + 5.0)) AS calculated_pool_size,
    round((1000 * 0.1) / (0.1 + 5.0) * 1.5) AS recommended_with_buffer;

References:

↑ Back to top

Partitioning

What is partition pruning and how does it improve performance?

The 30-Second Answer: Partition pruning is PostgreSQL's query optimization technique that eliminates scanning irrelevant partitions by analyzing query WHERE clauses against partition constraints. Instead of scanning all partitions, PostgreSQL identifies and scans only the partitions that could contain matching rows, dramatically reducing I/O operations, improving query performance, and decreasing resource consumption for large partitioned tables.

The 2-Minute Answer (If They Want More): When you query a partitioned table, PostgreSQL's query planner examines the WHERE clause conditions and compares them against each partition's constraints (the range bounds, list values, or hash distribution). If the planner can prove that a partition cannot contain any matching rows, it excludes that partition from the query execution plan entirely. This happens at plan time, before any data is read from disk.

Partition pruning provides several performance benefits: It reduces the number of disk blocks that need to be read, minimizes memory usage for query execution, decreases CPU time spent processing irrelevant data, enables more efficient use of indexes (fewer index pages to scan), and allows parallel queries to focus workers on relevant partitions only. The performance improvement is proportional to the number of partitions eliminated.

For partition pruning to work effectively, queries must include conditions on the partition key. The planner performs both static pruning (during plan creation, based on constant values) and dynamic pruning (during execution, for parameterized values like prepared statements). Partition pruning works with equality conditions, range conditions (BETWEEN, <, >), and IN lists for the partition key.

To maximize partition pruning effectiveness: choose partition keys that align with your query patterns, ensure WHERE clauses filter on partition keys, use appropriate comparison operators, avoid functions on partition keys that prevent pruning (unless immutable), and monitor EXPLAIN output to verify pruning is occurring. You can also use the enable_partition_pruning configuration parameter to control this behavior.

Code Example:

-- Create a partitioned table for demonstration
CREATE TABLE access_logs (
    log_id BIGSERIAL,
    log_date DATE NOT NULL,
    user_id INT,
    action VARCHAR(50),
    ip_address INET,
    response_time INT
) PARTITION BY RANGE (log_date);

-- Create monthly partitions for 2024
CREATE TABLE access_logs_2024_01 PARTITION OF access_logs
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE access_logs_2024_02 PARTITION OF access_logs
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

CREATE TABLE access_logs_2024_03 PARTITION OF access_logs
    FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');

CREATE TABLE access_logs_2024_04 PARTITION OF access_logs
    FOR VALUES FROM ('2024-04-01') TO ('2024-05-01');

CREATE TABLE access_logs_2024_05 PARTITION OF access_logs
    FOR VALUES FROM ('2024-05-01') TO ('2024-06-01');

CREATE TABLE access_logs_2024_06 PARTITION OF access_logs
    FOR VALUES FROM ('2024-06-01') TO ('2024-07-01');

-- Insert sample data
INSERT INTO access_logs (log_date, user_id, action, ip_address, response_time)
SELECT
    DATE '2024-01-01' + (random() * 180)::int,
    (random() * 1000)::int,
    (ARRAY['login', 'logout', 'view', 'edit', 'delete'])[floor(random() * 5 + 1)],
    ('192.168.1.' || floor(random() * 255))::inet,
    (random() * 1000)::int
FROM generate_series(1, 100000);

-- ========================================
-- EXAMPLE 1: Effective partition pruning
-- ========================================
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM access_logs
WHERE log_date = '2024-02-15';

-- Output shows only access_logs_2024_02 is scanned:
-- Seq Scan on access_logs_2024_02 access_logs
-- Partitions selected: 1 out of 6

-- ========================================
-- EXAMPLE 2: Range query with pruning
-- ========================================
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM access_logs
WHERE log_date BETWEEN '2024-02-01' AND '2024-03-31';

-- Only scans 2 partitions (Feb and Mar):
-- Partitions selected: 2 out of 6

-- ========================================
-- EXAMPLE 3: No partition pruning (bad)
-- ========================================
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM access_logs
WHERE EXTRACT(MONTH FROM log_date) = 2;

-- Scans ALL partitions because function prevents pruning:
-- Partitions selected: 6 out of 6

-- ========================================
-- EXAMPLE 4: Dynamic partition pruning
-- ========================================
-- With prepared statement
PREPARE log_query (DATE) AS
SELECT * FROM access_logs WHERE log_date = $1;

EXPLAIN (ANALYZE, BUFFERS) EXECUTE log_query('2024-03-10');
-- Dynamic pruning at execution time
-- Partitions selected: 1 out of 6

-- ========================================
-- EXAMPLE 5: Multi-column partitioning
-- ========================================
CREATE TABLE events (
    event_id BIGSERIAL,
    event_date DATE NOT NULL,
    region VARCHAR(50) NOT NULL,
    event_type VARCHAR(50),
    data JSONB
) PARTITION BY RANGE (event_date);

CREATE TABLE events_2024_q1 PARTITION OF events
    FOR VALUES FROM ('2024-01-01') TO ('2024-04-01')
    PARTITION BY LIST (region);

CREATE TABLE events_2024_q1_usa PARTITION OF events_2024_q1
    FOR VALUES IN ('USA');

CREATE TABLE events_2024_q1_europe PARTITION OF events_2024_q1
    FOR VALUES IN ('Europe');

-- Query with both partition keys - maximum pruning
EXPLAIN SELECT * FROM events
WHERE event_date = '2024-02-15' AND region = 'USA';
-- Only scans events_2024_q1_usa

-- ========================================
-- Monitoring partition pruning
-- ========================================
-- Check if partition pruning is enabled
SHOW enable_partition_pruning;

-- Disable for testing (not recommended in production)
SET enable_partition_pruning = off;

-- Re-run query to see difference
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM access_logs WHERE log_date = '2024-02-15';
-- Now scans all 6 partitions (slower)

-- Re-enable pruning
SET enable_partition_pruning = on;

-- ========================================
-- View partition sizes and pruning stats
-- ========================================
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
    pg_stat_get_live_tuples(
        (schemaname||'.'||tablename)::regclass::oid
    ) AS live_tuples
FROM pg_tables
WHERE tablename LIKE 'access_logs_%'
ORDER BY tablename;

-- ========================================
-- Performance comparison
-- ========================================
-- Without partition key (full scan)
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM access_logs WHERE user_id = 500;
-- Scans: All 6 partitions

-- With partition key (pruned)
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM access_logs
WHERE log_date = '2024-02-15' AND user_id = 500;
-- Scans: Only 1 partition (6x reduction in I/O)

Partition Pruning Visualization:

graph TD
    A[Query: SELECT * FROM logs<br/>WHERE log_date = '2024-02-15'] --> B{Query Planner<br/>Partition Pruning}

    B -->|Analyze WHERE clause| C{Check Partition<br/>Constraints}

    C -->|Jan 1 - Feb 1| D[logs_2024_01<br/>❌ PRUNED<br/>Feb 15 not in range]
    C -->|Feb 1 - Mar 1| E[logs_2024_02<br/>âś“ SCANNED<br/>Feb 15 in range]
    C -->|Mar 1 - Apr 1| F[logs_2024_03<br/>❌ PRUNED<br/>Feb 15 not in range]
    C -->|Apr 1 - May 1| G[logs_2024_04<br/>❌ PRUNED<br/>Feb 15 not in range]
    C -->|May 1 - Jun 1| H[logs_2024_05<br/>❌ PRUNED<br/>Feb 15 not in range]
    C -->|Jun 1 - Jul 1| I[logs_2024_06<br/>❌ PRUNED<br/>Feb 15 not in range]

    E --> J[Query Result<br/>83% I/O reduction<br/>1/6 partitions scanned]

    style A fill:#fff3cd
    style B fill:#e1f5ff
    style C fill:#e1f5ff
    style D fill:#f8d7da
    style E fill:#d4edda
    style F fill:#f8d7da
    style G fill:#f8d7da
    style H fill:#f8d7da
    style I fill:#f8d7da
    style J fill:#d1ecf1

Performance Impact Table:

Scenario Partitions Scanned I/O Reduction Performance Gain
Query with partition key (exact match) 1 of 12 92% 10-12x faster
Query with partition key (range: 2 months) 2 of 12 83% 5-6x faster
Query with partition key (range: quarter) 3 of 12 75% 3-4x faster
Query without partition key 12 of 12 0% No improvement
Query with function on partition key 12 of 12 0% No improvement

References:

↑ Back to top

MVCC and Concurrency Control

What is transaction ID wraparound and how do you prevent it?

The 30-Second Answer: Transaction ID wraparound occurs when PostgreSQL's 32-bit transaction ID counter reaches its maximum and wraps around to zero, potentially making old data appear to be in the future. It's prevented by regular VACUUM operations that "freeze" old transaction IDs before the wraparound threshold is reached.

The 2-Minute Answer (If They Want More): PostgreSQL uses 32-bit integers for transaction IDs (XIDs), providing approximately 4 billion unique transaction IDs. Due to MVCC's visibility rules, transaction IDs are compared to determine which row versions are visible. However, if the XID counter wraps around from 2^31-1 back to 0, old transactions would suddenly appear to be in the future, making their data invisible and causing catastrophic data loss.

To prevent this, PostgreSQL uses a "frozen" transaction ID (typically XID 2) for rows that are old enough that they should be visible to all current and future transactions. The autovacuum process regularly scans tables and "freezes" rows whose transaction IDs are approaching the wraparound threshold (typically 200 million transactions old).

The critical parameters are autovacuum_freeze_max_age (default 200 million) and vacuum_freeze_min_age (default 50 million). When a table's oldest unfrozen XID reaches autovacuum_freeze_max_age, an aggressive autovacuum is triggered regardless of other settings. If autovacuum can't keep up, PostgreSQL will eventually shut down to prevent data loss when the wraparound point is reached.

Warning signs include increasing transaction ID age, autovacuum running constantly on specific tables, and log messages about wraparound prevention. Preventive measures include monitoring transaction ID age, ensuring autovacuum is running efficiently, manually running VACUUM FREEZE on large tables during maintenance windows, and tuning autovacuum parameters for high-transaction workloads.

Code Example:

-- Check transaction ID age for all databases
SELECT datname,
       age(datfrozenxid) as xid_age,
       pg_size_pretty(pg_database_size(datname)) as size,
       round(100.0 * age(datfrozenxid) / 2000000000, 2) as percent_to_wraparound
FROM pg_database
ORDER BY age(datfrozenxid) DESC;

-- Check tables closest to needing freezing
SELECT schemaname, tablename,
       age(relfrozenxid) as xid_age,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as table_size,
       round(100.0 * age(relfrozenxid) / 2000000000, 2) as percent_to_emergency
FROM pg_stat_user_tables
ORDER BY age(relfrozenxid) DESC
LIMIT 20;

-- Check autovacuum settings
SHOW autovacuum_freeze_max_age;  -- Default: 200000000
SHOW vacuum_freeze_min_age;      -- Default: 50000000
SHOW vacuum_freeze_table_age;    -- Default: 150000000

-- Manual freeze for a specific table (during maintenance window)
VACUUM FREEZE users;

-- Aggressive vacuum for entire database
VACUUM FREEZE;

-- Monitor autovacuum activity
SELECT schemaname, tablename,
       last_vacuum,
       last_autovacuum,
       n_tup_ins + n_tup_upd + n_tup_del as total_changes,
       n_dead_tup,
       autovacuum_count
FROM pg_stat_user_tables
WHERE autovacuum_count > 0
ORDER BY last_autovacuum DESC NULLS LAST
LIMIT 20;

-- Check for wraparound emergency
SELECT datname,
       age(datfrozenxid),
       2000000000 - age(datfrozenxid) as xids_remaining,
       CASE
           WHEN age(datfrozenxid) > 1800000000 THEN 'CRITICAL - Emergency!'
           WHEN age(datfrozenxid) > 1500000000 THEN 'WARNING - Take action soon'
           WHEN age(datfrozenxid) > 1000000000 THEN 'CAUTION - Monitor closely'
           ELSE 'OK'
       END as status
FROM pg_database
ORDER BY age(datfrozenxid) DESC;

-- Tuning autovacuum for high-transaction tables
ALTER TABLE high_activity_table SET (
    autovacuum_vacuum_scale_factor = 0.05,  -- More aggressive
    autovacuum_vacuum_threshold = 1000,
    autovacuum_freeze_min_age = 10000000,
    autovacuum_freeze_max_age = 150000000
);

-- Check current transaction ID
SELECT txid_current();

-- Monitor long-running transactions (they prevent freezing)
SELECT pid,
       now() - xact_start as duration,
       state,
       query,
       age(backend_xid) as xid_age
FROM pg_stat_activity
WHERE state != 'idle'
  AND xact_start IS NOT NULL
ORDER BY xact_start
LIMIT 10;

Transaction ID Wraparound Diagram:

graph TD
    A[Transaction Counter] -->|Increases with each transaction| B{Age Check}
    B -->|< 50M XID old| C[Normal Operation]
    B -->|50M-200M XID old| D[Autovacuum Freezes Rows]
    B -->|> 200M XID old| E[Aggressive Autovacuum]
    E --> F[Freeze Old XIDs]
    B -->|> 2B XID old| G[EMERGENCY: DB Shutdown]

    D --> H[Row Updated: xmin = FrozenXID]
    F --> H
    H --> I[Row Always Visible]

    G --> J[Manual VACUUM FREEZE Required]
    J --> K[Database Restart]

    style G fill:#ff0000,color:#fff
    style E fill:#ff9900,color:#000
    style D fill:#ffff00,color:#000
    style C fill:#00ff00,color:#000

References:

↑ Back to top

Indexing

What is a BRIN index and when is it most effective?

The 30-Second Answer: BRIN (Block Range Index) stores min/max values for groups of table blocks instead of indexing individual rows, resulting in tiny indexes (often 1000x smaller than B-tree). It's most effective for very large tables with natural physical ordering, like time-series data or append-only logs.

The 2-Minute Answer (If They Want More):

BRIN indexes work by dividing the table into ranges of consecutive pages (default 128 pages) and storing summary information (min/max values) for each range. When querying, PostgreSQL checks if the search value could exist in each range and only scans those ranges where it's possible.

The effectiveness of BRIN depends heavily on physical correlation - how well the indexed column's values correlate with their physical storage order. For example, a log table where new rows are always appended with increasing timestamps has perfect correlation. A table that's frequently updated or has random insertions will have poor correlation.

BRIN indexes are ideal for: massive tables (billions of rows), naturally ordered data (timestamps, auto-incrementing IDs), append-only workloads, and data warehousing scenarios. They use minimal storage (often a few MB for multi-TB tables) and have very fast creation times.

However, BRIN is ineffective for: frequently updated tables, randomly distributed values, point lookups requiring exact matches, or small tables where the overhead isn't worth it. The tradeoff is between index size/maintenance cost and query precision - BRIN might scan more rows than B-tree but uses far less space.

Code Example:

-- Create a time-series table with naturally ordered data
CREATE TABLE sensor_readings (
    id BIGSERIAL,
    sensor_id INTEGER,
    temperature DECIMAL(5,2),
    reading_time TIMESTAMP,
    metadata JSONB
);

-- Insert time-ordered data (simulating append-only)
INSERT INTO sensor_readings (sensor_id, temperature, reading_time, metadata)
SELECT
    (random() * 1000)::INTEGER,
    random() * 100,
    TIMESTAMP '2024-01-01' + (n || ' seconds')::INTERVAL,
    '{}'::JSONB
FROM generate_series(1, 10000000) AS n;

-- Create BRIN index on timestamp column
CREATE INDEX idx_readings_time_brin
ON sensor_readings USING BRIN(reading_time);

-- Compare with B-tree index
CREATE INDEX idx_readings_time_btree
ON sensor_readings(reading_time);

-- Check index sizes
SELECT
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE tablename = 'sensor_readings';
-- BRIN: typically < 1 MB
-- B-tree: typically 200-500 MB

-- Check physical correlation (closer to 1.0 or -1.0 is better for BRIN)
SELECT tablename, attname, correlation
FROM pg_stats
WHERE tablename = 'sensor_readings'
  AND attname = 'reading_time';
-- Should be close to 1.0 for timestamp in append-only table

-- Effective BRIN query (range scan)
EXPLAIN ANALYZE
SELECT AVG(temperature)
FROM sensor_readings
WHERE reading_time BETWEEN '2024-06-01' AND '2024-06-30';

-- Adjust pages_per_range for tuning
CREATE INDEX idx_readings_custom_brin
ON sensor_readings USING BRIN(reading_time)
WITH (pages_per_range = 256);

-- Vacuum to update BRIN summarization
VACUUM sensor_readings;

References:

↑ Back to top

Security

What are the security best practices for PostgreSQL in production?

The 30-Second Answer: PostgreSQL production security requires multiple layers: use SCRAM-SHA-256 authentication with strong passwords, enforce SSL/TLS for all connections, apply the principle of least privilege with specific role permissions, enable audit logging, keep the database updated with security patches, restrict network access with pg_hba.conf and firewalls, and implement row-level security for sensitive data. Regular security audits and monitoring are essential.

The 2-Minute Answer (If They Want More):

Production PostgreSQL security follows defense-in-depth principles across network, authentication, authorization, encryption, and monitoring layers. Start with network security: never expose PostgreSQL directly to the internet, use VPCs or private networks, configure pg_hba.conf to whitelist only necessary IP ranges, and enforce SSL/TLS for all connections using certificates.

Authentication and authorization require careful planning. Use SCRAM-SHA-256 (not MD5), enforce strong password policies, implement separate roles for different application components (read-only, read-write, admin), and never use the postgres superuser for application connections. Apply the principle of least privilege—grant only necessary permissions on specific schemas and tables, revoke public schema permissions, and use GRANT statements precisely.

Data protection involves multiple techniques: encrypt data at rest using PostgreSQL's built-in encryption or disk-level encryption, encrypt data in transit with SSL/TLS, use column-level encryption for highly sensitive data (like credit cards or SSNs), implement row-level security for multi-tenant isolation, and regularly backup data with encrypted backups stored securely off-site.

Monitoring and auditing are critical: enable pgaudit or similar extensions for comprehensive logging, monitor for suspicious queries and connection patterns, set up alerts for failed authentication attempts, log all DDL changes, and regularly review granted permissions. Keep PostgreSQL updated with security patches, harden the OS, disable unnecessary extensions, and conduct regular security assessments including penetration testing.

Code Example:

-- Comprehensive PostgreSQL Security Configuration

-- ============================================
-- 1. AUTHENTICATION & USER MANAGEMENT
-- ============================================

-- Set password encryption to modern standard
SET password_encryption = 'scram-sha-256';

-- Create application-specific roles (not superusers!)
CREATE ROLE app_readonly LOGIN PASSWORD 'strong_readonly_password'
    CONNECTION LIMIT 50
    VALID UNTIL '2026-12-31';

CREATE ROLE app_readwrite LOGIN PASSWORD 'strong_readwrite_password'
    CONNECTION LIMIT 100
    VALID UNTIL '2026-12-31';

CREATE ROLE app_admin LOGIN PASSWORD 'strong_admin_password'
    CONNECTION LIMIT 10
    VALID UNTIL '2026-12-31'
    CREATEROLE;  -- Can manage other roles but not superuser

-- Create a role for database owner (not superuser)
CREATE ROLE db_owner LOGIN PASSWORD 'owner_password'
    CREATEDB;

-- Never use these in production applications:
-- DO NOT: GRANT ALL PRIVILEGES ON DATABASE mydb TO app_user;
-- DO NOT: ALTER ROLE app_user SUPERUSER;
-- DO NOT: CREATE ROLE app_user WITH PASSWORD 'password123';  -- weak password

-- ============================================
-- 2. DATABASE & SCHEMA SECURITY
-- ============================================

-- Create application database
CREATE DATABASE production_app OWNER db_owner;

\c production_app

-- Revoke default public schema permissions
REVOKE ALL ON SCHEMA public FROM PUBLIC;
REVOKE ALL ON DATABASE production_app FROM PUBLIC;

-- Create separate schemas for different security levels
CREATE SCHEMA app_data AUTHORIZATION db_owner;
CREATE SCHEMA sensitive_data AUTHORIZATION db_owner;
CREATE SCHEMA audit_logs AUTHORIZATION db_owner;

-- Grant specific schema access
GRANT USAGE ON SCHEMA app_data TO app_readonly, app_readwrite;
GRANT USAGE ON SCHEMA sensitive_data TO app_admin;

-- ============================================
-- 3. TABLE-LEVEL PERMISSIONS
-- ============================================

-- Create tables with proper ownership
CREATE TABLE app_data.users (
    id SERIAL PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    email VARCHAR(100) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE sensitive_data.payment_info (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES app_data.users(id),
    encrypted_card_data BYTEA,  -- Always encrypt sensitive data
    last_four CHAR(4),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE audit_logs.access_log (
    id BIGSERIAL PRIMARY KEY,
    user_role TEXT,
    action TEXT,
    table_name TEXT,
    row_id INTEGER,
    timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    ip_address INET
);

-- Grant specific table permissions (principle of least privilege)
GRANT SELECT ON ALL TABLES IN SCHEMA app_data TO app_readonly;
GRANT SELECT, INSERT, UPDATE ON app_data.users TO app_readwrite;
-- Note: No DELETE permission for app_readwrite (soft deletes only)

GRANT SELECT, INSERT ON sensitive_data.payment_info TO app_admin;
-- Note: Even admin can't UPDATE or DELETE payment info (audit trail)

-- Ensure future tables inherit permissions
ALTER DEFAULT PRIVILEGES IN SCHEMA app_data
    GRANT SELECT ON TABLES TO app_readonly;
ALTER DEFAULT PRIVILEGES IN SCHEMA app_data
    GRANT SELECT, INSERT, UPDATE ON TABLES TO app_readwrite;

-- ============================================
-- 4. ROW-LEVEL SECURITY
-- ============================================

ALTER TABLE sensitive_data.payment_info ENABLE ROW LEVEL SECURITY;
ALTER TABLE sensitive_data.payment_info FORCE ROW LEVEL SECURITY;

-- Only allow users to see their own payment information
CREATE POLICY user_own_payment ON sensitive_data.payment_info
    FOR ALL
    TO app_readwrite
    USING (user_id = current_setting('app.current_user_id')::INTEGER);

-- Admin can see all, but only with explicit session variable set
CREATE POLICY admin_audit_payment ON sensitive_data.payment_info
    FOR SELECT
    TO app_admin
    USING (current_setting('app.admin_audit_mode', true) = 'true');

-- ============================================
-- 5. ENCRYPTION
-- ============================================

-- Install pgcrypto for encryption functions
CREATE EXTENSION IF NOT EXISTS pgcrypto;

-- Function for encrypting sensitive data
CREATE OR REPLACE FUNCTION encrypt_sensitive_data(data TEXT)
RETURNS BYTEA AS $$
BEGIN
    -- Use application-managed encryption key (store securely, not in DB!)
    RETURN pgp_sym_encrypt(data, current_setting('app.encryption_key'));
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

-- Function for decrypting (only accessible to authorized roles)
CREATE OR REPLACE FUNCTION decrypt_sensitive_data(encrypted_data BYTEA)
RETURNS TEXT AS $$
BEGIN
    RETURN pgp_sym_decrypt(encrypted_data, current_setting('app.encryption_key'));
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

REVOKE ALL ON FUNCTION decrypt_sensitive_data(BYTEA) FROM PUBLIC;
GRANT EXECUTE ON FUNCTION decrypt_sensitive_data(BYTEA) TO app_admin;

-- ============================================
-- 6. AUDIT LOGGING
-- ============================================

-- Install pgaudit extension (if available)
-- CREATE EXTENSION IF NOT EXISTS pgaudit;

-- Custom audit trigger for sensitive tables
CREATE OR REPLACE FUNCTION audit_trigger_func()
RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
        INSERT INTO audit_logs.access_log (user_role, action, table_name, row_id)
        VALUES (current_user, TG_OP, TG_TABLE_NAME, NEW.id);
        RETURN NEW;
    ELSIF TG_OP = 'DELETE' THEN
        INSERT INTO audit_logs.access_log (user_role, action, table_name, row_id)
        VALUES (current_user, TG_OP, TG_TABLE_NAME, OLD.id);
        RETURN OLD;
    END IF;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

-- Apply audit trigger to sensitive tables
CREATE TRIGGER audit_payment_info
    AFTER INSERT OR UPDATE OR DELETE ON sensitive_data.payment_info
    FOR EACH ROW EXECUTE FUNCTION audit_trigger_func();

-- ============================================
-- 7. CONNECTION SECURITY
-- ============================================

-- Configure postgresql.conf settings (edit postgresql.conf file):
-- ssl = on
-- ssl_cert_file = '/path/to/server.crt'
-- ssl_key_file = '/path/to/server.key'
-- ssl_ca_file = '/path/to/root.crt'
-- ssl_min_protocol_version = 'TLSv1.2'
-- password_encryption = 'scram-sha-256'

-- Configure pg_hba.conf for strict access control:
-- hostssl all             app_readonly    10.0.1.0/24     scram-sha-256
-- hostssl all             app_readwrite   10.0.1.0/24     scram-sha-256
-- hostssl all             app_admin       10.0.2.0/24     scram-sha-256 clientcert=1
-- host    all             all             0.0.0.0/0       reject

-- ============================================
-- 8. MONITORING & ALERTING
-- ============================================

-- View for monitoring failed login attempts
CREATE VIEW audit_logs.failed_logins AS
SELECT
    usename,
    client_addr,
    COUNT(*) as attempt_count,
    MAX(backend_start) as last_attempt
FROM pg_stat_activity
WHERE state = 'idle'
GROUP BY usename, client_addr
HAVING COUNT(*) > 5;

-- View for monitoring privilege escalations
CREATE VIEW audit_logs.privilege_grants AS
SELECT
    grantee,
    table_schema,
    table_name,
    privilege_type
FROM information_schema.role_table_grants
WHERE grantee NOT IN ('postgres', 'db_owner')
ORDER BY grantee, table_schema, table_name;

-- Function to audit current connections
CREATE OR REPLACE FUNCTION audit_current_connections()
RETURNS TABLE (
    username TEXT,
    database TEXT,
    client_addr INET,
    state TEXT,
    query_start TIMESTAMP,
    current_query TEXT
) AS $$
BEGIN
    RETURN QUERY
    SELECT
        usename::TEXT,
        datname::TEXT,
        client_addr,
        state::TEXT,
        query_start,
        LEFT(query, 100)::TEXT
    FROM pg_stat_activity
    WHERE datname IS NOT NULL
    ORDER BY query_start DESC;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

GRANT EXECUTE ON FUNCTION audit_current_connections() TO app_admin;

-- ============================================
-- 9. SECURITY MAINTENANCE QUERIES
-- ============================================

-- Check for users with superuser privileges (should be minimal)
SELECT rolname, rolsuper, rolcreaterole, rolcreatedb
FROM pg_roles
WHERE rolsuper = true;

-- Check for tables without RLS when they should have it
SELECT schemaname, tablename, rowsecurity
FROM pg_tables
WHERE schemaname IN ('sensitive_data')
AND rowsecurity = false;

-- Find users with excessive privileges
SELECT grantee, COUNT(*) as privilege_count
FROM information_schema.role_table_grants
WHERE privilege_type IN ('INSERT', 'UPDATE', 'DELETE')
GROUP BY grantee
HAVING COUNT(*) > 10;

-- Check password encryption method
SELECT rolname, rolpassword
FROM pg_authid
WHERE rolpassword NOT LIKE 'SCRAM-SHA-256$%'
AND rolcanlogin = true;

-- ============================================
-- 10. REGULAR SECURITY TASKS (Schedule these)
-- ============================================

-- Rotate passwords regularly (example for one role)
-- ALTER ROLE app_readwrite PASSWORD 'new_strong_password' VALID UNTIL '2027-12-31';

-- Review and revoke unnecessary privileges
-- REVOKE INSERT ON app_data.users FROM app_readonly;

-- Update statistics for performance (prevents DoS via slow queries)
-- ANALYZE;

-- Clean old audit logs (retain per compliance requirements)
-- DELETE FROM audit_logs.access_log WHERE timestamp < NOW() - INTERVAL '90 days';

Security Layers Architecture:

graph TB
    subgraph External
        A[Internet]
    end

    subgraph Network Layer
        B[Firewall]
        C[VPC/Private Network]
        D[Load Balancer with SSL]
    end

    subgraph Application Layer
        E[Application Server]
        F[Connection Pool]
    end

    subgraph Database Layer
        G[pg_hba.conf - IP Whitelist]
        H[SSL/TLS Encryption]
        I[Authentication - SCRAM-SHA-256]
    end

    subgraph Authorization Layer
        J[Role-Based Access Control]
        K[Schema-Level Permissions]
        L[Table-Level Permissions]
        M[Row-Level Security]
    end

    subgraph Data Protection
        N[Encryption at Rest]
        O[Column Encryption]
        P[Backup Encryption]
    end

    subgraph Monitoring
        Q[Audit Logging]
        R[Connection Monitoring]
        S[Query Analysis]
        T[Anomaly Detection]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H
    H --> I
    I --> J
    J --> K
    K --> L
    L --> M
    M --> N
    M --> O

    N --> P

    J -.Monitor.-> Q
    G -.Monitor.-> R
    F -.Monitor.-> S
    R --> T

Security Checklist:

graph LR
    A[Security Checklist] --> B[Network Security]
    A --> C[Authentication]
    A --> D[Authorization]
    A --> E[Encryption]
    A --> F[Monitoring]

    B --> B1[âś“ Firewall configured]
    B --> B2[âś“ Private network only]
    B --> B3[âś“ pg_hba.conf restricted]

    C --> C1[âś“ SCRAM-SHA-256 enabled]
    C --> C2[âś“ Strong passwords]
    C --> C3[âś“ No default passwords]
    C --> C4[âś“ Password expiration]

    D --> D1[âś“ Least privilege applied]
    D --> D2[âś“ No superuser in apps]
    D --> D3[âś“ Schema isolation]
    D --> D4[âś“ RLS for sensitive data]

    E --> E1[âś“ SSL/TLS enforced]
    E --> E2[âś“ Data at rest encrypted]
    E --> E3[âś“ Column encryption for PII]
    E --> E4[âś“ Encrypted backups]

    F --> F1[âś“ pgaudit enabled]
    F --> F2[âś“ Failed login alerts]
    F --> F3[âś“ Privilege change logs]
    F --> F4[âś“ Query monitoring]

References:

↑ Back to top

PostgreSQL Architecture

Explain the PostgreSQL process architecture (postmaster, backend processes, background workers).

The 30-Second Answer: PostgreSQL uses a multi-process architecture where the postmaster is the main supervisor process that listens for client connections and spawns dedicated backend processes for each connection. Background workers handle maintenance tasks like checkpointing, WAL writing, and autovacuum, ensuring database reliability and performance without blocking user queries.

The 2-Minute Answer (If They Want More): The PostgreSQL architecture is built around three main process types. The postmaster (also called postgres) is the parent process that starts when you launch PostgreSQL. It listens on the configured port (default 5432), manages server initialization, spawns all other processes, and monitors their health. When a client connects, postmaster authenticates the connection and forks a new dedicated backend process.

Backend processes are the workhorses - each client connection gets its own dedicated backend process that executes queries, manages transactions, and accesses shared memory and disk. This process-per-connection model provides strong isolation between clients and simplifies crash recovery, though it has higher memory overhead than thread-based architectures.

Background workers run continuously to maintain database health: the checkpointer writes dirty buffers to disk, the WAL writer flushes write-ahead logs, the background writer reduces I/O spikes, autovacuum launcher spawns workers to reclaim space and update statistics, and the stats collector gathers performance metrics. This division of labor allows PostgreSQL to maintain ACID guarantees while serving concurrent queries efficiently.

The process architecture also includes auxiliary processes like the archiver (for WAL archiving), logical replication workers, and parallel query workers. This multi-process design trades some resource overhead for robustness - if one backend crashes, it doesn't bring down the entire server.

Code Example:

-- View all PostgreSQL processes and their roles
SELECT
    pid,
    usename,
    application_name,
    backend_type,
    state,
    query_start,
    LEFT(query, 50) as query_preview
FROM pg_stat_activity
ORDER BY backend_type, pid;

-- Check background worker processes
SELECT
    pid,
    backend_type,
    backend_start
FROM pg_stat_activity
WHERE backend_type IN (
    'autovacuum launcher',
    'autovacuum worker',
    'background writer',
    'checkpointer',
    'walwriter',
    'archiver',
    'stats collector'
);

-- View process hierarchy on Linux/Unix
-- Run in shell: ps aux | grep postgres
-- Output shows postmaster and all child processes

Process Architecture Diagram:

graph TD
    A[Postmaster<br/>Main Supervisor Process] --> B[Backend Process 1<br/>Client Connection 1]
    A --> C[Backend Process 2<br/>Client Connection 2]
    A --> D[Backend Process N<br/>Client Connection N]
    A --> E[Checkpointer<br/>Write dirty buffers to disk]
    A --> F[Background Writer<br/>Reduce I/O spikes]
    A --> G[WAL Writer<br/>Flush WAL buffers]
    A --> H[Autovacuum Launcher<br/>Spawns vacuum workers]
    A --> I[Stats Collector<br/>Gather metrics]
    A --> J[Archiver<br/>Archive WAL segments]

    H --> K[Autovacuum Worker 1]
    H --> L[Autovacuum Worker 2]

    style A fill:#ff9999
    style B fill:#99ccff
    style C fill:#99ccff
    style D fill:#99ccff
    style E fill:#99ff99
    style F fill:#99ff99
    style G fill:#99ff99
    style H fill:#99ff99
    style I fill:#99ff99
    style J fill:#99ff99

References:

↑ Back to top

Want more questions?

You've seen 15 sample questions. Unlock all 50 En interview questions with detailed explanations, code examples, and expert insights.

50+ questions
Code examples
Expert explanations
Instant access
Unlock Full Access