What is BSON and how does it relate to JSON?

BSON (Binary JSON) is a binary-encoded serialization format that MongoDB uses internally to store documents and transmit data. While JSON is text-based and human-readable, BSON extends JSON with additional data types like Date, Binary data, and ObjectId, and is optimized for fast encoding, decoding, and traversal.

What is a MongoDB document and what is its maximum size?

A MongoDB document is a BSON object consisting of field-value pairs, similar to a JSON object. Documents can contain nested documents, arrays, and various data types, with a maximum size limit of 16 megabytes to ensure efficient performance and prevent excessive RAM usage.

What is a collection in MongoDB?

A collection is a group of MongoDB documents, analogous to a table in relational databases. Collections don't enforce a schema, so documents within the same collection can have different fields and structures, though they typically store related data with similar structure for organizational purposes.

What is the _id field and ObjectId in MongoDB?

The _id field is the primary key for every MongoDB document, automatically created if not provided. ObjectId is the default type for _id - a 12-byte unique identifier that includes timestamp, machine identifier, process ID, and counter components, guaranteeing uniqueness across distributed systems without coordination.

What are indexes in MongoDB and why are they important?

Indexes are special data structures that store a small portion of the collection's data in an easy-to-traverse form, dramatically improving query performance. Without indexes, MongoDB must perform a collection scan examining every document, while indexed queries can locate documents in logarithmic time using B-tree structures.

What is a covered query in MongoDB?

A covered query is one where all queried fields exist in the index, allowing MongoDB to return results directly from the index without examining any documents. This is extremely fast because indexes are stored in memory and are much smaller than full documents, reducing query execution time by 10x or more.

What is embedding vs referencing in MongoDB data modeling?

Embedding stores related data within a single document as nested objects or arrays, providing excellent read performance. Referencing stores relationships using document IDs that point to data in other collections, keeping documents smaller and avoiding duplication but requiring multiple queries or $lookup operations to retrieve related data.

What is the oplog in MongoDB?

The oplog (operations log) is a special capped collection that records all write operations that modify data in a MongoDB replica set. Secondaries read from the primary's oplog and apply these operations to maintain data consistency - it's the mechanism that powers replication.

What is sharding in MongoDB and when would you use it?

Sharding is MongoDB's horizontal scaling strategy that distributes data across multiple servers based on a shard key. You use sharding when your dataset exceeds a single server's storage capacity, when you need to distribute read/write operations for better performance, or when you need geographic data distribution for compliance or latency optimization.

What is a shard key and how do you choose one?

A shard key is the indexed field or fields that MongoDB uses to partition data across shards. A good shard key has high cardinality (many unique values), even distribution to avoid hotspots, and alignment with your most common query patterns to minimize cross-shard queries.

What are multi-document transactions in MongoDB?

Multi-document transactions allow you to execute multiple operations across multiple documents and collections as a single atomic unit. If any operation fails, the entire transaction is rolled back, ensuring data consistency with ACID guarantees similar to traditional RDBMS transactions.

What is the difference between MongoDB and SQL databases?

MongoDB uses a flexible document model with JSON-like documents and dynamic schemas, while SQL databases use rigid table structures with predefined schemas. MongoDB encourages denormalization and embedding data for read performance, whereas SQL databases normalize data across tables and use JOINs, making MongoDB better suited for applications with evolving data models and horizontal scaling needs.

MongoDB Interview Questions (Free Preview)

CRUD Operations

What is the difference between remove() and deleteOne()/deleteMany()?

The 30-Second Answer: The remove() method is deprecated and has been replaced by deleteOne() and deleteMany(). While remove() could delete one or all matching documents depending on the justOne parameter, the new methods have clearer semantics, better performance, and are the recommended approach in modern MongoDB applications.

The 2-Minute Answer (If They Want More): The remove() method was the original deletion method in MongoDB, but it had ambiguous behavior that led to confusion and potential bugs. By default, remove(filter) would delete all matching documents, but passing {justOne: true} as a second parameter would delete only one. This dual behavior made code harder to read and maintain.

MongoDB introduced deleteOne() and deleteMany() to provide explicit, self-documenting code. When you see deleteOne(), you immediately know only one document will be deleted. When you see deleteMany(), you know all matching documents will be deleted. This clarity reduces bugs and makes code reviews easier.

Beyond clarity, the new methods also have better performance characteristics. The deleteOne() method can stop searching after finding the first match, whereas remove() with justOne had to use a different code path. The return values are also more consistentâ€”both new methods return an object with deletedCount and acknowledged, while remove() had different return formats in different MongoDB versions.

In production, you should never use remove() in new code. If you're maintaining legacy code that uses remove(), I recommend refactoring to the new methods during your next update cycle. Most MongoDB drivers have deprecated remove() and will show warnings, and it may be removed entirely in future versions.

Code Example:

// OLD WAY (deprecated) - remove()
// Delete all matching documents (default behavior)
await db.collection('users').remove({ inactive: true });

// Delete only one matching document
await db.collection('users').remove(
  { email: 'test@example.com' },
  { justOne: true }
);

// Delete all documents in collection (dangerous!)
await db.collection('temp').remove({});

// NEW WAY (recommended) - deleteOne() and deleteMany()
// Delete only one matching document (clear intent)
const deleteOneResult = await db.collection('users').deleteOne(
  { email: 'test@example.com' }
);
console.log(`Deleted ${deleteOneResult.deletedCount} document`);

// Delete all matching documents (clear intent)
const deleteManyResult = await db.collection('users').deleteMany(
  { inactive: true }
);
console.log(`Deleted ${deleteManyResult.deletedCount} documents`);

// Delete all documents in collection (clear but dangerous)
await db.collection('temp').deleteMany({});

// Migration example: Converting remove() to new methods
// BEFORE:
// await collection.remove({ status: 'expired' });
// await collection.remove({ _id: id }, { justOne: true });

// AFTER:
await collection.deleteMany({ status: 'expired' });
await collection.deleteOne({ _id: id });

// Consistent return values with new methods
const result = await db.collection('logs').deleteMany({
  timestamp: { $lt: cutoffDate }
});

console.log(`Operation acknowledged: ${result.acknowledged}`);
console.log(`Deleted count: ${result.deletedCount}`);

// The new methods work better with async/await patterns
async function cleanupOldData() {
  const cutoffDate = new Date(Date.now() - 30 * 24 * 60 * 60 * 1000);

  // Clear and explicit
  const logsDeleted = await db.collection('logs').deleteMany({
    createdAt: { $lt: cutoffDate }
  });

  const sessionsDeleted = await db.collection('sessions').deleteMany({
    expiresAt: { $lt: new Date() }
  });

  return {
    logsDeleted: logsDeleted.deletedCount,
    sessionsDeleted: sessionsDeleted.deletedCount
  };
}

References:

↑ Back to top

What is the difference between find() and findOne()?

The 30-Second Answer: The find() method returns a cursor to all matching documents and supports chaining methods like sort(), limit(), and skip(), while findOne() returns only the first matching document directly as an object (or null if no match) and doesn't support cursor operations.

The 2-Minute Answer (If They Want More): The find() method returns a cursor, not the actual documents. A cursor is like a pointer to the result set that you can iterate over, and it's lazyâ€”documents are only fetched from the database as you iterate through them. This makes find() memory-efficient even with large result sets. You can chain cursor methods to sort results, limit the number of documents, skip documents for pagination, or apply projections to include/exclude fields.

The findOne() method is optimized for retrieving a single document and immediately returns the document object itself (or null if nothing matches). It's essentially equivalent to find().limit(1) but more convenient and slightly more efficient. You cannot chain cursor methods to findOne(), and it doesn't matter if 100 documents match your filterâ€”you'll only get one.

In production, I use findOne() when querying by unique identifiers like _id or email, or when I know I only need one document and don't care about which one if multiple match. I use find() when I need multiple documents, need to apply sorting or pagination, or need cursor-based iteration for large datasets. For performance, findOne() can stop searching after finding the first match, especially if you have an appropriate index.

One important gotcha: findOne() returns null if no document matches, while find() returns an empty cursor. When converting a cursor to an array with toArray(), you get an empty array [] if nothing matches, not null.

Code Example:

// findOne() - Returns a single document object or null
const user = await db.collection('users').findOne(
  { email: 'john@example.com' }
);

if (user) {
  console.log(`Found user: ${user.name}`);
} else {
  console.log('User not found');
}

// findOne() with projection (select specific fields)
const userBasicInfo = await db.collection('users').findOne(
  { _id: userId },
  { projection: { name: 1, email: 1, _id: 0 } }  // Include name and email, exclude _id
);

// find() - Returns a cursor to multiple documents
const cursor = db.collection('products').find(
  { category: 'electronics' }
);

// Iterate through cursor
await cursor.forEach(product => {
  console.log(product.name);
});

// find() with cursor methods chained
const topProducts = await db.collection('products')
  .find({ inStock: true })
  .sort({ rating: -1 })      // Sort by rating descending
  .limit(10)                  // Return only 10 documents
  .skip(0)                    // Skip 0 (useful for pagination)
  .project({ name: 1, price: 1, rating: 1 })  // Select specific fields
  .toArray();                 // Convert cursor to array

// Pagination with find()
const page = 2;
const pageSize = 20;
const paginatedResults = await db.collection('articles')
  .find({ published: true })
  .sort({ publishedDate: -1 })
  .skip((page - 1) * pageSize)
  .limit(pageSize)
  .toArray();

// find() with iteration for large datasets (memory efficient)
const largeCursor = db.collection('logs').find(
  { timestamp: { $gte: startDate } }
);

// Process one document at a time without loading all into memory
for await (const log of largeCursor) {
  await processLog(log);  // Process each log individually
}

// Common mistake: forgetting to await toArray()
const wrongProducts = db.collection('products').find({});  // This is a cursor!
console.log(wrongProducts);  // Logs cursor object, not documents

const rightProducts = await db.collection('products').find({}).toArray();  // Correct
console.log(rightProducts);  // Logs array of documents

// Comparing behavior when no documents match
const notFoundOne = await db.collection('users').findOne({ age: 200 });
console.log(notFoundOne);  // null

const notFoundMany = await db.collection('users').find({ age: 200 }).toArray();
console.log(notFoundMany);  // []

References:

↑ Back to top

MongoDB Fundamentals

What is the difference between MongoDB and other NoSQL databases like Cassandra or Redis?

The 30-Second Answer: MongoDB is a document-oriented database optimized for flexible schemas and complex queries, Cassandra is a wide-column store designed for high write throughput and linear scalability across multiple data centers, and Redis is an in-memory key-value store optimized for sub-millisecond latency and caching. Each excels at different use cases based on their underlying architecture and trade-offs.

The 2-Minute Answer (If They Want More): These NoSQL databases solve different problems and make different trade-offs based on the CAP theorem (Consistency, Availability, Partition tolerance). MongoDB is a CP system that prioritizes consistency and partition tolerance, using a document model that's intuitive for developers and supports rich queries, secondary indexes, and aggregation pipelines. It's ideal for applications needing flexible schemas, complex queries, and strong consistency within a replica set.

Cassandra is an AP system prioritizing availability and partition tolerance, designed after Amazon's Dynamo and Google's BigTable. It uses a wide-column store model where data is organized into rows with potentially billions of columns, and it's masterless - all nodes are equal, eliminating single points of failure. Cassandra excels at write-heavy workloads, multi-datacenter deployments, and applications requiring linear scalability and 100% uptime. The trade-off is eventual consistency by default and a more restrictive query model (CQL) compared to MongoDB.

Redis is fundamentally different - it's an in-memory data structure store that can function as a database, cache, or message broker. It supports various data structures (strings, hashes, lists, sets, sorted sets, streams) and provides sub-millisecond latency by keeping data in RAM. Redis is primarily used for caching, session storage, real-time analytics, and pub/sub messaging. While it offers persistence options, it's not designed for large datasets that exceed available memory.

The choice between them depends on your requirements: use MongoDB for general-purpose applications needing rich queries and flexible schemas, Cassandra for massive scale write-heavy workloads across multiple data centers, and Redis for caching, real-time features, or when you need specialized data structures with extremely low latency. Many architectures use multiple databases together - for example, MongoDB for primary storage with Redis for caching.

Code Example:

// MongoDB - Rich document queries:
db.users.find({
  "address.city": "San Francisco",
  age: { $gte: 25, $lte: 35 },
  tags: { $in: ["premium", "verified"] }
}).sort({ created: -1 }).limit(10)

// Aggregation pipeline:
db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: { _id: "$userId", total: { $sum: "$amount" } } },
  { $sort: { total: -1 } },
  { $limit: 10 }
])

// Cassandra (CQL) - Write-optimized, restricted queries:
// Note: Cassandra requires partition key in queries
CREATE TABLE users_by_city (
  city text,
  user_id uuid,
  name text,
  age int,
  PRIMARY KEY (city, user_id)
)

INSERT INTO users_by_city (city, user_id, name, age)
VALUES ('San Francisco', uuid(), 'Alice', 28)

-- Must query by partition key (city):
SELECT * FROM users_by_city
WHERE city = 'San Francisco'
AND user_id > minTimeuuid('2025-01-01')

-- Cassandra excels at time-series data:
CREATE TABLE sensor_data (
  sensor_id int,
  timestamp timestamp,
  temperature float,
  PRIMARY KEY (sensor_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)

// Redis - In-memory key-value operations:
// Set simple key-value
SET user:1000:name "Alice"
GET user:1000:name

// Hash for structured data:
HSET user:1000 name "Alice" email "alice@example.com" age 28
HGETALL user:1000

// Sorted set for leaderboards:
ZADD leaderboard 1500 "player1" 1750 "player2" 2000 "player3"
ZREVRANGE leaderboard 0 9 WITHSCORES  // Top 10 players

// Caching with expiration:
SETEX session:abc123 3600 '{"userId": 1000, "role": "admin"}'

// Pub/Sub for real-time messaging:
PUBLISH notifications '{"type": "new_message", "from": "Alice"}'

// Redis Streams for event sourcing:
XADD events * user_id 1000 action "purchase" amount 99.99

References:

↑ Back to top

What is MongoDB and how does it differ from traditional relational databases?

The 30-Second Answer: MongoDB is a document-oriented NoSQL database that stores data in flexible, JSON-like documents instead of tables with fixed schemas. Unlike relational databases that use SQL and require predefined schemas with tables and rows, MongoDB allows you to store semi-structured data with varying fields, making it ideal for applications with evolving data models.

The 2-Minute Answer (If They Want More): MongoDB represents a fundamental shift from the relational database paradigm. While traditional RDBMS like MySQL or PostgreSQL organize data into tables with strictly defined columns and rows, MongoDB stores data as documents in collections. These documents are BSON (Binary JSON) objects that can have different structures from one another, even within the same collection.

The key differences impact how you design and scale applications. In relational databases, you normalize data across multiple tables and use foreign keys to establish relationships, then JOIN them at query time. MongoDB encourages denormalization and embedding related data within a single document, which can significantly improve read performance by eliminating the need for expensive JOIN operations. This makes MongoDB particularly well-suited for applications with high read-to-write ratios or where data is naturally hierarchical.

MongoDB also differs in its scaling approach. While relational databases typically scale vertically (more powerful hardware), MongoDB is designed for horizontal scaling through sharding, distributing data across multiple servers. The schema flexibility allows developers to iterate quickly without migrations, though this flexibility requires discipline to avoid data inconsistency issues.

Another significant difference is the query language. MongoDB uses a rich document-based query language with method chaining and aggregation pipelines, rather than SQL. Transactions in MongoDB (available in 4.0+) work differently than in RDBMS, with multi-document ACID transactions being a more recent addition compared to the mature transaction systems in relational databases.

Code Example:

// Relational approach (conceptual):
// Users table: id, name, email
// Orders table: id, user_id, total
// OrderItems table: id, order_id, product, quantity

// MongoDB document approach - embedding related data:
{
  _id: ObjectId("507f1f77bcf86cd799439011"),
  name: "John Doe",
  email: "john@example.com",
  orders: [
    {
      orderId: "ORD-001",
      date: ISODate("2025-01-15"),
      total: 299.99,
      items: [
        { product: "Laptop", quantity: 1, price: 299.99 }
      ]
    }
  ],
  preferences: {
    newsletter: true,
    theme: "dark"
  }
}

// Querying in MongoDB:
db.users.findOne({ email: "john@example.com" })

// Equivalent relational query would require JOINs:
// SELECT * FROM users u
// LEFT JOIN orders o ON u.id = o.user_id
// LEFT JOIN order_items oi ON o.id = oi.order_id
// WHERE u.email = 'john@example.com'

References:

↑ Back to top

MongoDB Drivers and Applications

What is the difference between callback, promise, and async/await patterns with MongoDB?

The 30-Second Answer: MongoDB's Node.js driver evolved from callback-based (v2.x) to promise-based (v3.x+), with async/await as syntactic sugar over promises. Callbacks use error-first pattern and can lead to callback hell, promises chain with .then(), and async/await provides synchronous-looking code. Modern code should use async/await for readability and better error handling.

The 2-Minute Answer (If They Want More): The MongoDB Node.js driver has gone through three generations of async patterns. Early versions (2.x) used error-first callbacks exclusively, requiring nested callbacks for sequential operations. This led to "callback hell" and made error handling verbose and error-prone.

Version 3.0+ introduced promise support, allowing method chaining with .then() and .catch(). Promises improved error handling and made sequential operations more readable, but chaining still created deeply nested code for complex workflows.

Async/await, introduced in Node.js 8+, provides the cleanest syntax by making asynchronous code look synchronous. Under the hood, it's still promises, but with better readability, linear error handling via try/catch, and easier debugging. Modern MongoDB applications should exclusively use async/await unless integrating with legacy callback-based code.

The performance difference is negligibleâ€”they're different syntax for the same underlying mechanism. The choice is about readability, maintainability, and error handling. Async/await wins on all counts for new code. One caveat: be careful with async/await in loops; use Promise.all() for parallel operations instead of sequential await calls.

Code Example:

const { MongoClient } = require('mongodb');
const client = new MongoClient('mongodb://localhost:27017');

// ========================================
// 1. CALLBACK PATTERN (legacy, avoid)
// ========================================
function findUserCallback(email, callback) {
  client.connect((err) => {
    if (err) return callback(err);

    const db = client.db('mydb');
    const users = db.collection('users');

    users.findOne({ email }, (err, user) => {
      if (err) return callback(err);

      if (!user) {
        return callback(new Error('User not found'));
      }

      // Get user's orders (nested callback = "callback hell")
      const orders = db.collection('orders');
      orders.find({ userId: user._id }).toArray((err, userOrders) => {
        if (err) return callback(err);

        user.orders = userOrders;
        callback(null, user);
      });
    });
  });
}

// Usage - error handling is verbose and error-prone
findUserCallback('john@example.com', (err, user) => {
  if (err) {
    console.error('Error:', err);
    return;
  }
  console.log('User:', user);
});

// ========================================
// 2. PROMISE PATTERN (.then/.catch)
// ========================================
function findUserPromise(email) {
  return client.connect()
    .then(() => {
      const db = client.db('mydb');
      return db.collection('users').findOne({ email });
    })
    .then(user => {
      if (!user) {
        throw new Error('User not found');
      }

      const db = client.db('mydb');
      return db.collection('orders')
        .find({ userId: user._id })
        .toArray()
        .then(orders => {
          user.orders = orders;
          return user;
        });
    })
    .catch(error => {
      console.error('Error:', error);
      throw error;
    });
}

// Usage - better than callbacks but still nested
findUserPromise('john@example.com')
  .then(user => console.log('User:', user))
  .catch(err => console.error('Error:', err));

// ========================================
// 3. ASYNC/AWAIT PATTERN (modern, recommended)
// ========================================
async function findUserAsync(email) {
  try {
    await client.connect();

    const db = client.db('mydb');
    const users = db.collection('users');
    const orders = db.collection('orders');

    const user = await users.findOne({ email });

    if (!user) {
      throw new Error('User not found');
    }

    // Sequential operations are clear and linear
    const userOrders = await orders.find({ userId: user._id }).toArray();
    user.orders = userOrders;

    return user;

  } catch (error) {
    console.error('Error:', error);
    throw error;
  }
}

// Usage - clean and synchronous-looking
async function main() {
  try {
    const user = await findUserAsync('john@example.com');
    console.log('User:', user);
  } catch (error) {
    console.error('Failed:', error);
  }
}

// ========================================
// PARALLEL OPERATIONS
// ========================================

// WRONG: Sequential awaits (slow)
async function getUserDataSequential(userId) {
  const user = await db.collection('users').findOne({ _id: userId });
  const orders = await db.collection('orders').find({ userId }).toArray();
  const reviews = await db.collection('reviews').find({ userId }).toArray();
  // Takes 3x as long - each await waits for previous to complete
  return { user, orders, reviews };
}

// CORRECT: Parallel execution with Promise.all (fast)
async function getUserDataParallel(userId) {
  const [user, orders, reviews] = await Promise.all([
    db.collection('users').findOne({ _id: userId }),
    db.collection('orders').find({ userId }).toArray(),
    db.collection('reviews').find({ userId }).toArray()
  ]);
  // All queries run simultaneously
  return { user, orders, reviews };
}

// ========================================
// ERROR HANDLING COMPARISON
// ========================================

// Async/await: try/catch handles all errors in one place
async function robustOperation() {
  try {
    await client.connect();
    const db = client.db('mydb');

    const user = await db.collection('users').findOne({ email: 'test@example.com' });
    const orders = await db.collection('orders').find({ userId: user._id }).toArray();

    return { user, orders };

  } catch (error) {
    // Single error handler for all operations
    if (error.code === 11000) {
      console.error('Duplicate key error');
    } else if (error.name === 'MongoNetworkError') {
      console.error('Network error, retry logic here');
    } else {
      console.error('Unexpected error:', error);
    }
    throw error;
  } finally {
    // Cleanup always runs
    await client.close();
  }
}

// ========================================
// REAL-WORLD EXAMPLE: Express API
// ========================================
const express = require('express');
const app = express();

// Async/await with Express (requires Express 5+ or wrapper)
app.get('/users/:email', async (req, res, next) => {
  try {
    const db = client.db('mydb');
    const user = await db.collection('users').findOne({
      email: req.params.email
    });

    if (!user) {
      return res.status(404).json({ error: 'User not found' });
    }

    res.json(user);

  } catch (error) {
    next(error); // Pass to error handler middleware
  }
});

// Error handler middleware
app.use((error, req, res, next) => {
  console.error('Request failed:', error);
  res.status(500).json({
    error: 'Internal server error',
    message: error.message
  });
});

// ========================================
// MIGRATION HELPER: Callback to Promise
// ========================================
const { promisify } = require('util');

// Convert callback-based function to promise
function legacyCallbackFunction(email, callback) {
  // ... legacy code ...
}

const promisifiedFunction = promisify(legacyCallbackFunction);

// Now can use with async/await
async function usePromisified() {
  const result = await promisifiedFunction('john@example.com');
  console.log(result);
}

References:

↑ Back to top

MongoDB Atlas and Cloud

What is Atlas Search and how does it work?

The 30-Second Answer: Atlas Search is a full-text search engine built on Apache Lucene that's integrated directly into MongoDB Atlas. It allows me to perform advanced search operations like fuzzy matching, autocomplete, and relevance scoring using the familiar MongoDB aggregation pipeline syntax.

The 2-Minute Answer (If They Want More):

Atlas Search provides sophisticated text search capabilities without requiring a separate search infrastructure like Elasticsearch. It creates and maintains search indexes automatically, syncing them with my MongoDB data in near real-time. Under the hood, it uses Apache Lucene for indexing and searching, but exposes the functionality through MongoDB's native query language.

When I create a search index, I define which fields to index and how to analyze them (tokenization, stemming, language-specific analyzers). Atlas Search then maintains this index automatically as documents are inserted, updated, or deleted. The search functionality is accessed through the $search aggregation stage, which integrates seamlessly with other pipeline stages for filtering, sorting, and transforming results.

In production applications, I use Atlas Search for features like product search with autocomplete, content discovery with fuzzy matching to handle typos, and relevance-based ranking using custom scoring functions. It supports advanced features like synonyms, highlighting matching terms, faceted search for filtering results, and geo-spatial search combined with text search.

The key advantage over traditional text indexes in MongoDB is the richness of search featuresâ€”I can perform phrase matching, proximity searches, wildcard searches, and combine multiple search criteria with boolean operators. The relevance scoring helps surface the most appropriate results, and I can customize scoring weights based on field importance or recency.

Code Example:

// Define a search index (done in Atlas UI or via API)
{
  "mappings": {
    "fields": {
      "title": {
        "type": "string",
        "analyzer": "lucene.standard"
      },
      "description": {
        "type": "string",
        "analyzer": "lucene.english"
      },
      "tags": {
        "type": "string"
      }
    }
  }
}

// Use $search in aggregation pipeline
db.products.aggregate([
  {
    $search: {
      "index": "product_search_index",
      "compound": {
        "should": [
          {
            "text": {
              "query": "wireless headphones",
              "path": "title",
              "score": { "boost": { "value": 3 } }
            }
          },
          {
            "text": {
              "query": "wireless headphones",
              "path": "description",
              "fuzzy": { "maxEdits": 1 }
            }
          }
        ]
      }
    }
  },
  {
    $project: {
      title: 1,
      description: 1,
      score: { $meta: "searchScore" }
    }
  },
  {
    $limit: 10
  }
]);

// Autocomplete example
db.products.aggregate([
  {
    $search: {
      "index": "autocomplete_index",
      "autocomplete": {
        "query": "wire",
        "path": "title",
        "tokenOrder": "sequential",
        "fuzzy": { "maxEdits": 1 }
      }
    }
  },
  {
    $project: {
      title: 1,
      _id: 0
    }
  },
  {
    $limit: 5
  }
]);

References:

↑ Back to top

Advanced Topics

What is the difference between mongod and mongos?

The 30-Second Answer: mongod is the core MongoDB database server process that stores and manages data, while mongos is a routing service for sharded clusters that directs client requests to the appropriate shards. In a standalone or replica set deployment, you only run mongod processes; mongos only comes into play when you implement sharding.

The 2-Minute Answer (If They Want More): mongod is the primary daemon process for MongoDB - it handles data requests, manages background operations, and maintains data files on disk. When you run a standalone MongoDB instance or set up a replica set, you're running mongod processes. Each mongod instance manages a single database server with its own data directory, listens for client connections, and performs all database operations.

mongos, on the other hand, is a query router that sits in front of sharded clusters. It doesn't store any data itself - instead, it routes queries to the appropriate shards based on the shard key, aggregates results from multiple shards, and presents a unified interface to clients. Applications connect to mongos instances exactly like they would connect to mongod, but mongos handles the complexity of distributing operations across shards.

In a typical sharded cluster architecture, you have multiple mongod processes running as shards (each potentially part of a replica set), multiple config servers (also mongod processes storing cluster metadata), and one or more mongos routers. The mongos processes consult the config servers to understand the cluster topology and data distribution, then route queries accordingly.

From an operational perspective, mongod requires disk storage and performs memory-intensive operations, while mongos is relatively lightweight since it's just routing. You typically run mongos on application servers or dedicated routing servers, and you can scale mongos instances independently of your data shards. For development or small deployments, you work directly with mongod; mongos becomes necessary when you need horizontal scaling through sharding.

Code Example:

# Starting mongod (standalone instance)
mongod --dbpath /data/db --port 27017

# Starting mongod with replica set
mongod --replSet rs0 --dbpath /data/db --port 27017

# Starting mongod as a config server for sharded cluster
mongod --configsvr --replSet configRS --dbpath /data/configdb --port 27019

# Starting mongod as a shard server
mongod --shardsvr --replSet shard1 --dbpath /data/shard1 --port 27018

# Starting mongos (query router)
mongos --configdb configRS/config1:27019,config2:27019,config3:27019 --port 27017

# mongos does NOT use --dbpath (it doesn't store data)
# mongos REQUIRES --configdb (to know cluster topology)

# Configuration file examples

# mongod.conf (standalone or replica set member)

storage:
  dbPath: /data/db
  journal:
    enabled: true
  wiredTiger:
    engineConfig:
      cacheSizeGB: 4

systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true

net:
  port: 27017
  bindIp: 0.0.0.0

replication:
  replSetName: rs0

# For shard server, add:
sharding:
  clusterRole: shardsvr

# mongos.conf (query router)
systemLog:
  destination: file
  path: /var/log/mongodb/mongos.log
  logAppend: true

net:
  port: 27017
  bindIp: 0.0.0.0

sharding:
  configDB: configRS/config1:27019,config2:27019,config3:27019

# Note: No storage section - mongos doesn't store data

// Connecting to mongod (standalone or replica set)
const { MongoClient } = require('mongodb');

// Direct connection to mongod
const clientDirect = new MongoClient('mongodb://localhost:27017');

// Replica set connection (connects to mongod instances)
const clientRS = new MongoClient('mongodb://host1:27017,host2:27017,host3:27017/?replicaSet=rs0');

// Connecting to sharded cluster via mongos
const clientSharded = new MongoClient('mongodb://mongos1:27017,mongos2:27017,mongos3:27017');

// From application perspective, usage is identical
// mongos handles all routing internally
await clientSharded.db('mydb').collection('users').find({});

# Sharded cluster setup example

# 1. Start config servers (mongod with --configsvr)
mongod --configsvr --replSet configRS --port 27019 --dbpath /data/config1
mongod --configsvr --replSet configRS --port 27020 --dbpath /data/config2
mongod --configsvr --replSet configRS --port 27021 --dbpath /data/config3

# Initialize config server replica set
mongosh --port 27019
rs.initiate({
  _id: "configRS",
  configsvr: true,
  members: [
    { _id: 0, host: "localhost:27019" },
    { _id: 1, host: "localhost:27020" },
    { _id: 2, host: "localhost:27021" }
  ]
})

# 2. Start shard servers (mongod with --shardsvr)
# Shard 1
mongod --shardsvr --replSet shard1 --port 27018 --dbpath /data/shard1-1
mongod --shardsvr --replSet shard1 --port 28018 --dbpath /data/shard1-2

# Shard 2
mongod --shardsvr --replSet shard2 --port 27028 --dbpath /data/shard2-1
mongod --shardsvr --replSet shard2 --port 28028 --dbpath /data/shard2-2

# Initialize each shard replica set
mongosh --port 27018
rs.initiate({
  _id: "shard1",
  members: [
    { _id: 0, host: "localhost:27018" },
    { _id: 1, host: "localhost:28018" }
  ]
})

# 3. Start mongos routers
mongos --configdb configRS/localhost:27019,localhost:27020,localhost:27021 --port 27017
mongos --configdb configRS/localhost:27019,localhost:27020,localhost:27021 --port 27117

# 4. Connect to mongos and add shards
mongosh --port 27017
sh.addShard("shard1/localhost:27018,localhost:28018")
sh.addShard("shard2/localhost:27028,localhost:28028")

# 5. Enable sharding on database and collection
sh.enableSharding("mydb")
sh.shardCollection("mydb.users", { userId: "hashed" })

# Check cluster status
sh.status()

# mongos routing example
# When you insert/query via mongos:
db.users.insertOne({ userId: 12345, name: "Alice" })
# mongos:
# 1. Hashes userId to determine target shard
# 2. Routes insert to appropriate mongod (shard)
# 3. Returns result to client

# Query routing
db.users.find({ userId: 12345 })
# mongos routes to specific shard (targeted query)

db.users.find({ name: "Alice" })
# mongos broadcasts to all shards (scatter-gather query)

References:

↑ Back to top

Sharding

What is scatter-gather and how does it affect performance?

The 30-Second Answer: Scatter-gather is when mongos sends a query to all shards (scatter), waits for responses, and merges results (gather) because the query lacks the shard key. It multiplies query latency by the number of shards, increases network traffic, and can exhaust mongos memory when merging large result sets, making it a major performance bottleneck.

The 2-Minute Answer (If They Want More): Scatter-gather occurs when mongos cannot determine which shard(s) hold the relevant data, so it broadcasts the query to all shards. Each shard executes the query independently and returns results to mongos, which merges them and returns to the client. The latency is roughly: max(shard1_latency, shard2_latency, ..., shardN_latency) + merge_time, meaning your slowest shard determines overall latency.

The performance impact compounds with cluster size. A query taking 50ms on a single shard takes 50ms on a 2-shard cluster but still ~50ms on a 10-shard cluster (assuming parallel execution), however merge overhead increases with more shards and result volume. If each shard returns 10,000 documents, mongos must merge 100,000 documents in memory for a 10-shard cluster. I've seen mongos processes OOM (out of memory) when scatter-gather queries returned millions of documents.

Sort operations are especially problematic. If you query db.orders.find({}).sort({ order_date: -1 }).limit(100) without the shard key, mongos must fetch ALL documents from ALL shards (potentially millions), sort them all in memory, then return the top 100. This is catastrophic for performance. The optimization is to push $sort and $limit to individual shards firstâ€”each shard returns its top 100, then mongos only merges NĂ—100 documents.

In production, I minimize scatter-gather in several ways: include the shard key in all queries when possible; use covered queries with indexes so shards return less data; use aggregation pipelines that push filtering stages to shards; implement application-level sharding awareness (query known shards for multi-tenant apps); and for unavoidable scatter-gather queries like admin dashboards, cache results aggressively or run them async. I also monitor scatter-gather queries using the profiler and set up alerts when they exceed latency thresholds.

Code Example:

// Shard key: { customer_id: 1 }

// SCATTER-GATHER: No shard key in query
const query1 = db.orders.find({ status: "pending" }).explain("executionStats");
// query1.executionStats.totalDocsExamined - summed across all shards
// query1.executionStats.executionTimeMillis - reflects scatter-gather latency

// SCATTER-GATHER WITH SORT: Very expensive
db.orders.find({ status: "pending" })
  .sort({ order_date: -1 })
  .limit(100)
  .explain("executionStats");
// mongos fetches ALL pending orders from ALL shards, sorts in memory

// OPTIMIZED: Use aggregation pipeline
db.orders.aggregate([
  { $match: { status: "pending" } },
  { $sort: { order_date: -1 } },
  { $limit: 100 }
]).explain();
// $sort and $limit pushed to each shard
// Each shard returns top 100
// mongos merges (shardCount Ă— 100) documents

// TARGETED: Include shard key
db.orders.find({
  customer_id: "C123",
  status: "pending"
}).explain("executionStats");
// Only 1 shard queried - much faster

// Monitor scatter-gather impact
db.currentOp({
  "command.filter.customer_id": { $exists: false },
  ns: "mydb.orders",
  waitingForLock: false
});

// Profile scatter-gather queries
db.setProfilingLevel(1, { slowms: 100 });
db.system.profile.find({
  ns: "mydb.orders",
  millis: { $gt: 100 }
}).sort({ millis: -1 }).forEach(doc => {
  print(`Query: ${JSON.stringify(doc.command)}`);
  print(`Millis: ${doc.millis}`);
  print(`Docs examined: ${doc.docsExamined}`);
  print(`---`);
});

// Application-level optimization: shard-aware queries
// For multi-tenant app sharded by tenant_id
const tenantId = getCurrentTenantId();
db.orders.find({
  tenant_id: tenantId, // Always include shard key
  status: "pending"
});

// Use $facet to combine multiple scatter-gather queries
db.orders.aggregate([
  {
    $facet: {
      pendingOrders: [
        { $match: { status: "pending" } },
        { $sort: { order_date: -1 } },
        { $limit: 10 }
      ],
      recentOrders: [
        { $match: { order_date: { $gte: ISODate("2024-01-01") } } },
        { $sort: { order_date: -1 } },
        { $limit: 10 }
      ]
    }
  }
]);
// Better than two separate scatter-gather queries

// Set maxTimeMS to prevent runaway scatter-gather queries
db.orders.find({ status: "pending" })
  .maxTimeMS(5000) // Abort after 5 seconds
  .toArray();

References:

↑ Back to top

Transactions

What is the difference between read concern levels (local, majority, snapshot, linearizable)?

The 30-Second Answer: Read concern levels define the consistency guarantees for read operations. local returns the most recent data (fastest, but might be rolled back), majority returns only data acknowledged by a majority of nodes (prevents rollback reads), snapshot provides point-in-time consistency for multi-document reads in transactions, and linearizable provides the strongest guarantee that reads reflect all prior majority-acknowledged writes.

The 2-Minute Answer (If They Want More): Each read concern level offers different trade-offs between consistency and performance. Understanding these differences is crucial for building reliable distributed applications.

local is the default and fastest option. It returns the most recent data available on the node being queried, regardless of whether that data has been replicated. The risk is that if the primary fails, this data might be rolled back, meaning your application could have read data that effectively "never existed" from the cluster's perspective.

majority ensures you only read data that has been acknowledged by a majority of replica set members. This prevents reading data that could be rolled back. In practice, I use this for any operation where consistency matters more than low latency - user account balances, order statuses, or any data where reading stale or potentially rolled-back data would cause problems.

snapshot is specifically designed for multi-document transactions. It provides a point-in-time view of data, ensuring all reads within a transaction see a consistent snapshot. This prevents anomalies like non-repeatable reads or phantom reads. Without snapshot isolation, a transaction might see different versions of documents if they're being modified by other operations.

linearizable provides the strongest guarantee: reads reflect all prior majority-acknowledged writes, and no concurrent writes can affect the returned data. It only works for reading a single document and queries the primary with special locking. I rarely use this because it has significant performance overhead, but it's valuable when you need absolute certainty about ordering - for example, reading a configuration value that was just updated and must be consistent across all operations.

Code Example:

const collection = db.collection('inventory');

// LOCAL: Fastest, but might read data that gets rolled back
// Good for: Analytics, dashboards, non-critical data
const localRead = await collection.find({ category: 'electronics' })
  .readConcern({ level: 'local' })
  .toArray();

// MAJORITY: Only reads data acknowledged by majority of nodes
// Good for: Financial data, user accounts, critical business data
const majorityRead = await collection.find({ status: 'in-stock' })
  .readConcern({ level: 'majority' })
  .toArray();

// SNAPSHOT: Point-in-time consistency for transactions
// Good for: Multi-document operations requiring consistency
const session = client.startSession();
try {
  await session.withTransaction(async () => {
    // All reads see the same snapshot
    const product = await collection.findOne(
      { _id: 'product-123' },
      { session }
    );

    const relatedProducts = await collection.find(
      { category: product.category },
      { session }
    ).toArray();

    // Both reads see consistent data from the same point in time
  }, {
    readConcern: { level: 'snapshot' }
  });
} finally {
  await session.endSession();
}

// LINEARIZABLE: Strongest consistency, single document only
// Good for: Configuration values, leader election, critical flags
const configDoc = await collection.findOne(
  { _id: 'app-config' },
  {
    readConcern: { level: 'linearizable' },
    // Linearizable only works on primary
    readPreference: 'primary'
  }
);

// Comparison example: Demonstrating the difference
async function demonstrateReadConcerns() {
  // Write with majority concern
  await collection.updateOne(
    { _id: 'doc-1' },
    { $set: { value: 100, timestamp: new Date() } },
    { writeConcern: { w: 'majority' } }
  );

  // Local might see newer uncommitted writes
  const local = await collection.findOne(
    { _id: 'doc-1' },
    { readConcern: { level: 'local' } }
  );

  // Majority only sees majority-committed data
  const majority = await collection.findOne(
    { _id: 'doc-1' },
    { readConcern: { level: 'majority' } }
  );

  // These might differ in edge cases (network partition, replication lag)
  console.log('Local:', local);
  console.log('Majority:', majority);
}

References:

↑ Back to top

Indexing

What is the impact of indexes on write performance?

The 30-Second Answer: Indexes slow down write operations because MongoDB must update every relevant index whenever documents are inserted, updated, or deleted. Each additional index adds overhead, with compound indexes and multikey indexes being particularly expensive. The trade-off is faster reads versus slower writes.

The 2-Minute Answer (If They Want More): Every index on a collection creates write overhead because MongoDB maintains indexes synchronously with document changes. For an insert operation, MongoDB must write the document and create entries in all indexes. For updates, if indexed fields change, MongoDB must remove old index entries and insert new ones. For deletes, all index entries must be removed.

The overhead varies by index type. Simple single-field indexes add minimal overhead - just one B-tree insertion per write. Compound indexes are slightly more expensive but still manageable. Multikey indexes on array fields can be very expensive because updating a single array might require adding or removing many index entries (one per array element).

Index overhead also affects memory and disk I/O. Index updates must be written to disk and loaded into memory, potentially evicting useful data from the cache. During bulk write operations on heavily indexed collections, index maintenance can become the bottleneck, sometimes consuming more time than writing the actual documents.

In production, I balance read and write performance by strategically choosing which indexes to create. For write-heavy workloads (logging, time-series data, high-throughput ingestion), I minimize indexes and sometimes use partial indexes to reduce the indexed document set. For read-heavy workloads, I'm more liberal with indexes since the read performance gains outweigh write costs. I also monitor index usage with $indexStats to identify and remove unused indexes.

Code Example:

// Measure write performance without indexes
db.products.drop()

// Insert 10000 documents without indexes
const startNoIndex = Date.now()
for (let i = 0; i < 10000; i++) {
  db.products.insertOne({
    name: `Product ${i}`,
    category: `Category ${i % 10}`,
    price: Math.random() * 100,
    tags: [`tag${i % 5}`, `tag${i % 7}`],
    inStock: i % 2 === 0
  })
}
const timeNoIndex = Date.now() - startNoIndex
print(`No indexes: ${timeNoIndex}ms`)

// Create multiple indexes
db.products.drop()
db.products.createIndex({ category: 1 })
db.products.createIndex({ price: 1 })
db.products.createIndex({ tags: 1 })  // Multikey index
db.products.createIndex({ category: 1, price: 1 })  // Compound
db.products.createIndex({ inStock: 1 })

// Insert same 10000 documents with indexes
const startWithIndexes = Date.now()
for (let i = 0; i < 10000; i++) {
  db.products.insertOne({
    name: `Product ${i}`,
    category: `Category ${i % 10}`,
    price: Math.random() * 100,
    tags: [`tag${i % 5}`, `tag${i % 7}`],
    inStock: i % 2 === 0
  })
}
const timeWithIndexes = Date.now() - startWithIndexes
print(`With 5 indexes: ${timeWithIndexes}ms`)
print(`Slowdown: ${(timeWithIndexes / timeNoIndex).toFixed(2)}x`)

// Update performance impact
db.products.createIndex({ category: 1, price: 1 })

// Update that changes indexed field
db.products.updateMany(
  { category: "Category 1" },
  { $set: { category: "Category 99" } }
)
// Must update compound index entries for all matching documents

// Update that doesn't change indexed fields
db.products.updateMany(
  { category: "Category 1" },
  { $set: { description: "New description" } }
)
// No index updates needed - faster

// Multikey index write overhead
db.posts.createIndex({ tags: 1 })

// Adding tag to array - creates new index entry
db.posts.updateOne(
  { _id: 1 },
  { $push: { tags: "newtag" } }
)
// Must insert one new index entry

// Replacing entire array - expensive!
db.posts.updateOne(
  { _id: 1 },
  { $set: { tags: ["tag1", "tag2", "tag3", "tag4", "tag5"] } }
)
// Must remove old index entries and create 5 new ones

// Bulk write operations
// Unordered bulk write continues on errors
const bulk = db.products.initializeUnorderedBulkOp()
for (let i = 0; i < 1000; i++) {
  bulk.insert({ name: `Product ${i}`, category: `Cat ${i % 10}` })
}
bulk.execute()
// All indexes updated in bulk - more efficient than individual inserts

// Unique index write failures
db.users.createIndex({ email: 1 }, { unique: true })

try {
  db.users.insertOne({ email: "duplicate@example.com" })
  db.users.insertOne({ email: "duplicate@example.com" })
} catch (e) {
  print("Duplicate key error: " + e.message)
}
// Second insert fails due to unique constraint

// Partial index reduces write overhead
// Full index on all documents
db.logs.createIndex({ userId: 1, timestamp: -1 })

// Partial index on active logs only (much smaller)
db.logs.createIndex(
  { userId: 1, timestamp: -1 },
  {
    partialFilterExpression: { status: "active" },
    name: "active_logs_index"
  }
)
// Inserts with status != "active" don't update this index (faster)

// TTL index cleanup overhead
db.sessions.createIndex(
  { createdAt: 1 },
  { expireAfterSeconds: 3600 }
)
// Background thread periodically deletes expired documents
// Deletion removes documents AND their index entries

// Monitor write performance
db.products.insertOne({ name: "Test" })
db.serverStatus().opcounters
// Shows insert/update/delete counts

// Profile slow writes
db.setProfilingLevel(1, { slowms: 100 })
db.system.profile.find({
  op: "insert",
  millis: { $gt: 100 }
}).sort({ ts: -1 }).limit(5)

// Identify unused indexes (candidates for removal)
db.products.aggregate([{ $indexStats: {} }])
// Check 'accesses' field - indexes with 0 accesses can be dropped

References:

↑ Back to top

Performance and Optimization

What is the impact of document size on performance?

The 30-Second Answer: Larger documents consume more memory and network bandwidth, reduce the number of documents that fit in cache, and increase page fault likelihood. MongoDB has a 16MB document limit, but I keep documents well below thisâ€”typically under 1-2MBâ€”to optimize cache efficiency, reduce network overhead, and improve query performance.

The 2-Minute Answer (If They Want More): Document size directly impacts MongoDB performance across multiple dimensions. First, memory and cache efficiency: WiredTiger cache stores documents in memory, and larger documents mean fewer fit in cache. If your average document is 100KB vs 10KB, you can cache 10x fewer documents in the same memory, increasing cache misses and disk reads. This effect compounds when documents are frequently updated, as the entire document must be rewritten on disk.

Network bandwidth and latency are significantly affected by document size. Transferring a 5MB document takes longer than 50KB, especially in distributed systems or when clients are geographically distant from databases. This latency multiplies across query result setsâ€”returning 100 large documents can saturate network connections and cause timeout issues. Using projection to return only needed fields from large documents is critical.

Write performance degrades with larger documents because MongoDB must rewrite the entire document on updates (unless using specific operators like $inc or $set on small fields). If documents frequently grow, MongoDB may need to relocate them on disk when they exceed their allocated space, causing fragmentation and performance overhead. I monitor padding factor and document moves to identify this issue.

Index efficiency is also impacted. While indexes themselves reference documents by location, larger documents mean indexes represent a smaller percentage of your working set, reducing the effectiveness of index-only queries (covered queries). Additionally, if documents contain large arrays being indexed, the index size grows significantly.

The 16MB document limit is a hard constraint, but reaching even half this size is problematic. I design schemas to keep documents focused and reasonably sizedâ€”typically under 100KB for high-traffic collections. For large data like images or long text, I store in GridFS or external storage (S3, etc.) and reference from documents. When documents naturally become large (e.g., denormalized product catalogs), I use techniques like bucketing or splitting data across multiple documents with references.

Code Example:

// Check average document size
const stats = db.collection.stats();
console.log({
  avgDocSize: stats.avgObjSize,
  totalSize: stats.size,
  storageSize: stats.storageSize,
  count: stats.count
});

// âťŚ BAD: Storing large binary data in documents
db.users.insertOne({
  name: "John Doe",
  profileImage: new Binary(/* 5MB image buffer */),  // Huge document!
  photos: [/* Array of large images */]
});

// âś… GOOD: Reference external storage
db.users.insertOne({
  name: "John Doe",
  profileImageUrl: "https://cdn.example.com/images/john-doe.jpg",
  photoUrls: ["url1", "url2", "url3"]
});

// âťŚ BAD: Unbounded arrays causing document growth
db.posts.updateOne(
  { _id: postId },
  { $push: { comments: newComment } }  // Array grows indefinitely
);

// âś… GOOD: Bucketing pattern for large arrays
// Store comments in separate collection
db.comments.insertOne({
  postId: postId,
  comment: newComment,
  created: new Date()
});

// Or use bucketing with size limits
db.posts.updateOne(
  { _id: postId, commentCount: { $lt: 100 } },
  {
    $push: { comments: newComment },
    $inc: { commentCount: 1 }
  }
);

// Use projection to avoid transferring large fields
// âťŚ BAD: Return entire large document
db.products.findOne({ sku: "ABC123" });

// âś… GOOD: Project only needed fields
db.products.findOne(
  { sku: "ABC123" },
  {
    name: 1,
    price: 1,
    inStock: 1,
    // Exclude large fields like 'reviews', 'fullDescription'
  }
);

// Monitor document growth and fragmentation
db.collection.aggregate([
  { $project: {
    size: { $bsonSize: "$$ROOT" }
  }},
  { $group: {
    _id: null,
    avgSize: { $avg: "$size" },
    maxSize: { $max: "$size" },
    minSize: { $min: "$size" }
  }}
]);

// Use GridFS for files > 16MB or better organization
const { GridFSBucket } = require('mongodb');
const bucket = new GridFSBucket(db);

// Store large file
const uploadStream = bucket.openUploadStream('large-document.pdf');
fileStream.pipe(uploadStream);

// Reference in document
db.documents.insertOne({
  title: "User Manual",
  gridFsFileId: uploadStream.id
});

References:

↑ Back to top

Data Modeling

What is the Computed pattern?

The 30-Second Answer: The Computed pattern pre-calculates and stores expensive-to-compute values rather than calculating them on every query. For example, storing a user's total post count or a product's average rating directly in the document instead of aggregating it each time, improving read performance at the cost of additional write complexity to keep computed values synchronized.

The 2-Minute Answer (If They Want More): The Computed pattern addresses the tradeoff between read and write performance. Some values are expensive to calculate - requiring aggregations across many documents, complex mathematical operations, or multiple collection scans. If these computed values are read frequently but change infrequently, it makes sense to calculate them once, store the result, and update it only when the underlying data changes.

In production, I use this pattern for dashboard statistics, leaderboards, analytics summaries, and derived metrics. For instance, instead of counting a user's posts with an aggregation on every profile view, I store the post count directly in the user document and increment it when a post is created. This turns an O(n) aggregation into an O(1) field lookup, dramatically improving performance for high-traffic pages.

The implementation requires careful synchronization between the source data and computed values. I use several strategies: atomic updates with $inc for simple counters, background jobs for batch recomputation of complex metrics, and change streams or database triggers to update computed values when source data changes. The key decision is determining the acceptable staleness - can the computed value lag behind reality, or must it be perfectly synchronized?

I also consider the computation frequency and cost. For values that change constantly (like social media trending scores), I might compute them periodically in batches rather than on every write. For critical business metrics (like account balances), I ensure they're updated atomically within the same transaction as the source data. The Computed pattern is most valuable when reads vastly outnumber writes and when computation cost is high relative to storage cost.

Code Example:

// WITHOUT COMPUTED PATTERN - Expensive aggregation on every query
// User document (no computed fields)
{
  _id: ObjectId("507f1f77bcf86cd799439011"),
  username: "jane_dev",
  email: "jane@example.com"
}

// Posts in separate collection
{
  _id: ObjectId("607f1f77bcf86cd799439021"),
  authorId: ObjectId("507f1f77bcf86cd799439011"),
  title: "MongoDB Tips",
  content: "..."
}

// Every profile view requires expensive aggregation
const userWithStats = await db.users.aggregate([
  { $match: { _id: userId } },
  {
    $lookup: {
      from: "posts",
      localField: "_id",
      foreignField: "authorId",
      as: "posts"
    }
  },
  {
    $addFields: {
      postCount: { $size: "$posts" }
    }
  },
  { $project: { posts: 0 } }  // Don't return all posts, just the count
]);
// This aggregation runs on EVERY profile view - expensive!

// WITH COMPUTED PATTERN - Pre-computed field
{
  _id: ObjectId("507f1f77bcf86cd799439011"),
  username: "jane_dev",
  email: "jane@example.com",
  // Pre-computed statistics
  stats: {
    postCount: 47,           // Updated when posts created/deleted
    followerCount: 823,      // Updated when followed/unfollowed
    totalLikes: 1523,        // Aggregated from all posts
    lastPostDate: ISODate("2024-01-20T10:00:00Z")
  }
}

// Simple O(1) query - no aggregation needed!
const user = await db.users.findOne({ _id: userId });
console.log(user.stats.postCount);  // Instant result

// MAINTAINING COMPUTED VALUES - Update on write
async function createPost(userId, postData) {
  const session = client.startSession();

  await session.withTransaction(async () => {
    // 1. Insert the post
    await db.posts.insertOne({
      authorId: userId,
      title: postData.title,
      content: postData.content,
      createdAt: new Date(),
      likes: 0
    }, { session });

    // 2. Update computed statistics atomically
    await db.users.updateOne(
      { _id: userId },
      {
        $inc: { "stats.postCount": 1 },
        $set: { "stats.lastPostDate": new Date() }
      },
      { session }
    );
  });

  await session.endSession();
}

async function deletePost(postId, userId) {
  const session = client.startSession();

  await session.withTransaction(async () => {
    // 1. Delete the post
    await db.posts.deleteOne({ _id: postId }, { session });

    // 2. Decrement computed count
    await db.users.updateOne(
      { _id: userId },
      { $inc: { "stats.postCount": -1 } },
      { session }
    );
  });

  await session.endSession();
}

async function likePost(postId, authorId) {
  const session = client.startSession();

  await session.withTransaction(async () => {
    // 1. Increment likes on post
    await db.posts.updateOne(
      { _id: postId },
      { $inc: { likes: 1 } },
      { session }
    );

    // 2. Increment author's total likes (computed across all posts)
    await db.users.updateOne(
      { _id: authorId },
      { $inc: { "stats.totalLikes": 1 } },
      { session }
    );
  });

  await session.endSession();
}

// PRODUCT WITH COMPUTED AVERAGE RATING
{
  _id: ObjectId("507f1f77bcf86cd799439011"),
  name: "MongoDB Course",
  price: 49.99,
  // Computed statistics
  stats: {
    totalReviews: 1247,
    averageRating: 4.7,      // Pre-computed average
    ratingDistribution: {     // Pre-computed histogram
      5: 823,
      4: 312,
      3: 87,
      2: 18,
      1: 7
    },
    totalRevenue: 62347.53   // Pre-computed from all purchases
  }
}

// Update rating statistics when review added
async function addReview(productId, rating, reviewText) {
  const session = client.startSession();

  await session.withTransaction(async () => {
    // 1. Insert review
    await db.reviews.insertOne({
      productId: productId,
      rating: rating,
      text: reviewText,
      date: new Date()
    }, { session });

    // 2. Update computed statistics
    const product = await db.products.findOne({ _id: productId }, { session });
    const currentTotal = product.stats.totalReviews;
    const currentAvg = product.stats.averageRating;

    // Calculate new rolling average
    const newAverage = ((currentAvg * currentTotal) + rating) / (currentTotal + 1);

    await db.products.updateOne(
      { _id: productId },
      {
        $inc: {
          "stats.totalReviews": 1,
          [`stats.ratingDistribution.${rating}`]: 1
        },
        $set: {
          "stats.averageRating": Math.round(newAverage * 10) / 10  // Round to 1 decimal
        }
      },
      { session }
    );
  });

  await session.endSession();
}

// BATCH RECOMPUTATION - For complex computed values
// Run periodically (e.g., nightly) to fix any drift
async function recomputeUserStatistics(userId) {
  const pipeline = [
    { $match: { authorId: userId } },
    {
      $group: {
        _id: "$authorId",
        postCount: { $sum: 1 },
        totalLikes: { $sum: "$likes" },
        lastPostDate: { $max: "$createdAt" }
      }
    }
  ];

  const stats = await db.posts.aggregate(pipeline).toArray();

  if (stats.length > 0) {
    await db.users.updateOne(
      { _id: userId },
      {
        $set: {
          "stats.postCount": stats[0].postCount,
          "stats.totalLikes": stats[0].totalLikes,
          "stats.lastPostDate": stats[0].lastPostDate
        }
      }
    );
  }
}

// USING CHANGE STREAMS for automatic recomputation
const postChangeStream = db.posts.watch();

postChangeStream.on("change", async (change) => {
  if (change.operationType === "insert" || change.operationType === "delete") {
    const authorId = change.operationType === "insert"
      ? change.fullDocument.authorId
      : change.documentKey.authorId;

    // Trigger recomputation for this user
    await recomputeUserStatistics(authorId);
  }
});

// E-COMMERCE INVENTORY EXAMPLE
{
  _id: ObjectId("507f1f77bcf86cd799439011"),
  sku: "LAPTOP-001",
  name: "Developer Laptop",
  // Computed inventory across all warehouses
  inventory: {
    totalAvailable: 47,      // Sum across all locations
    totalReserved: 12,       // Reserved for pending orders
    totalInStock: 59,        // Available + Reserved
    locations: [
      { warehouse: "US-WEST", available: 23, reserved: 5 },
      { warehouse: "US-EAST", available: 24, reserved: 7 }
    ],
    lastUpdated: ISODate("2024-01-20T10:00:00Z")
  }
}

// DASHBOARD METRICS - Computed daily
{
  _id: ObjectId("507f1f77bcf86cd799439011"),
  date: ISODate("2024-01-20T00:00:00Z"),
  // All computed metrics for fast dashboard loading
  metrics: {
    totalUsers: 15234,
    activeUsers: 8234,
    totalRevenue: 234523.45,
    averageOrderValue: 87.23,
    conversionRate: 0.034,
    topProducts: [
      { productId: ObjectId("..."), name: "MongoDB Course", sales: 234 }
    ]
  },
  computedAt: ISODate("2024-01-20T23:00:00Z")
}

References:

↑ Back to top

Security

What is field-level encryption in MongoDB?

The 30-Second Answer: Field-level encryption (FLE) encrypts specific document fields on the client side before sending data to MongoDB, ensuring sensitive data remains encrypted at rest, in transit, and in memory. MongoDB supports both manual encryption (CSFLE - Client-Side Field Level Encryption) and automatic encryption with queryable capabilities.

The 2-Minute Answer (If They Want More): Field-Level Encryption in MongoDB allows you to encrypt specific sensitive fields like social security numbers, credit card data, or medical records while leaving non-sensitive fields searchable and indexable. The encryption happens on the client side using encryption keys that MongoDB never sees, providing end-to-end security even if the database or backups are compromised.

MongoDB offers two FLE implementations: Client-Side Field Level Encryption (CSFLE) and Queryable Encryption (available in MongoDB 6.0+). CSFLE requires developers to explicitly encrypt/decrypt fields in application code, while Queryable Encryption can automatically handle encryption and supports equality queries on encrypted fields. Both use envelope encryption with a master key stored in a Key Management System (KMS) like AWS KMS, Azure Key Vault, or Google Cloud KMS.

The encryption uses deterministic or random algorithms. Deterministic encryption produces the same ciphertext for the same plaintext, enabling equality queries but with slightly less security. Random encryption provides stronger security but doesn't support queries - you must decrypt to read the value.

In production environments handling sensitive data, I implement FLE for fields containing PII, financial data, or health information. This ensures compliance with regulations like GDPR, HIPAA, or PCI-DSS even if database backups are exposed or administrators have full database access.

Code Example:

// Set up Client-Side Field Level Encryption
const { MongoClient, ClientEncryption } = require('mongodb');

// Configure KMS providers (AWS KMS example)
const kmsProviders = {
  aws: {
    accessKeyId: process.env.AWS_ACCESS_KEY_ID,
    secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
  }
};

// Create a Data Encryption Key
const client = new MongoClient(uri);
const encryption = new ClientEncryption(client, {
  keyVaultNamespace: 'encryption.__keyVault',
  kmsProviders
});

const dataKey = await encryption.createDataKey('aws', {
  masterKey: {
    key: 'arn:aws:kms:us-east-1:123456789012:key/...',
    region: 'us-east-1'
  }
});

// Define encrypted fields schema
const encryptedFieldsMap = {
  'myapp.users': {
    fields: [
      {
        path: 'ssn',
        bsonType: 'string',
        keyId: [dataKey],
        queries: { queryType: 'equality' }  // For Queryable Encryption
      },
      {
        path: 'medicalRecord',
        bsonType: 'string',
        keyId: [dataKey]
        // No queries - random encryption
      }
    ]
  }
};

// Connect with automatic encryption
const secureClient = new MongoClient(uri, {
  autoEncryption: {
    keyVaultNamespace: 'encryption.__keyVault',
    kmsProviders,
    encryptedFieldsMap
  }
});

// Insert document - sensitive fields automatically encrypted
await secureClient.db('myapp').collection('users').insertOne({
  name: 'John Doe',  // Not encrypted
  email: 'john@example.com',  // Not encrypted
  ssn: '123-45-6789',  // Automatically encrypted
  medicalRecord: 'Patient history...'  // Automatically encrypted
});

// Query encrypted field (works with deterministic/queryable encryption)
const user = await secureClient.db('myapp').collection('users')
  .findOne({ ssn: '123-45-6789' });  // Automatically encrypts query
// Result automatically decrypts fields

References:

↑ Back to top

Aggregation Framework

What is the aggregation framework in MongoDB?

The 30-Second Answer: The aggregation framework is MongoDB's powerful data processing pipeline that transforms and analyzes documents through a sequence of stages. It's like Unix pipes for data - you pass documents through stages like $match, $group, and $project to filter, reshape, and compute results, making it essential for analytics, reporting, and complex queries that go beyond simple find() operations.

The 2-Minute Answer (If They Want More): The aggregation framework is MongoDB's primary tool for running analytics and data transformations directly in the database. Unlike the older MapReduce approach, it provides a declarative pipeline-based syntax that's both more performant and easier to understand.

Each aggregation operation is a pipeline consisting of one or more stages. Documents enter the first stage, get processed, and the output flows to the next stage - similar to how Unix commands can be piped together. Common stages include $match (filtering), $group (aggregating), $project (reshaping), $sort, $lookup (joins), and many others.

The framework can handle complex operations like multi-collection joins, recursive graph traversal, statistical calculations, and data transformations. It's optimized to push operations down to indexes when possible and can leverage MongoDB's query optimizer. For large datasets, aggregation pipelines can be more efficient than client-side processing since they reduce network transfer and utilize the database's native optimization.

In production, I use aggregation for everything from simple grouping operations to complex ETL workflows, real-time analytics dashboards, and generating derived collections. It's particularly powerful when combined with indexes and proper pipeline ordering to maximize performance.

Code Example:

// Example: Calculate total sales by category with average price
db.orders.aggregate([
  // Stage 1: Filter recent orders
  {
    $match: {
      orderDate: { $gte: ISODate("2025-01-01") },
      status: "completed"
    }
  },
  // Stage 2: Unwind items array to process each item
  {
    $unwind: "$items"
  },
  // Stage 3: Group by category and calculate metrics
  {
    $group: {
      _id: "$items.category",
      totalRevenue: { $sum: { $multiply: ["$items.price", "$items.quantity"] } },
      avgPrice: { $avg: "$items.price" },
      orderCount: { $sum: 1 }
    }
  },
  // Stage 4: Sort by revenue descending
  {
    $sort: { totalRevenue: -1 }
  },
  // Stage 5: Reshape output
  {
    $project: {
      category: "$_id",
      totalRevenue: { $round: ["$totalRevenue", 2] },
      avgPrice: { $round: ["$avgPrice", 2] },
      orderCount: 1,
      _id: 0
    }
  }
]);

References:

↑ Back to top

Replication and High Availability

What is a replica set in MongoDB?

The 30-Second Answer: A replica set is MongoDB's built-in replication mechanism consisting of multiple MongoDB instances that maintain the same data set. It provides redundancy and high availability with automatic failover - if the primary node fails, the replica set automatically elects a new primary from the secondary nodes.

The 2-Minute Answer (If They Want More): A replica set is a group of MongoDB servers that maintain identical copies of data, providing fault tolerance and high availability. The typical configuration includes one primary node that receives all write operations and multiple secondary nodes that replicate data from the primary asynchronously.

The replica set uses an election protocol based on the Raft consensus algorithm to automatically elect a new primary if the current primary becomes unavailable. This ensures your application continues to function even when individual nodes fail. Replica sets require an odd number of voting members (or an arbiter) to maintain a majority for election purposes - common configurations are 3, 5, or 7 members.

Beyond high availability, replica sets enable read scaling by allowing applications to read from secondary nodes, geographic distribution of data by placing members in different data centers, and dedicated analytics nodes using hidden or delayed members. The oplog (operations log) on the primary serves as the replication source, and secondaries continuously pull and apply these operations to stay synchronized.

In production, I typically deploy a minimum of three members across different availability zones to ensure the replica set can survive zone failures while maintaining quorum for elections.

Code Example:

// Initialize a replica set (run on one member)
rs.initiate({
  _id: "myReplicaSet",
  members: [
    { _id: 0, host: "mongodb0.example.net:27017" },
    { _id: 1, host: "mongodb1.example.net:27017" },
    { _id: 2, host: "mongodb2.example.net:27017" }
  ]
})

// Check replica set status
rs.status()

// Connection string for application
const uri = "mongodb://mongodb0.example.net:27017,mongodb1.example.net:27017,mongodb2.example.net:27017/?replicaSet=myReplicaSet"

References:

↑ Back to top

Item added to your cart