Skip to content
Go back

Natural Language to MongoDB

Published:  at  08:00 PM

I recently completed a project measuring how well large language models (LLMs) can translate natural language queries into MongoDB queries.

The Benchmark

Over spring 2025, I developed a benchmark to evaluate how different LLMs perform at generating MongoDB Shell (mongosh) code from natural language queries. The benchmark consists of 766 test cases across 8 databases using the MongoDB Atlas sample datasets.

Here’s what I evaluated:

Models Evaluated

These models represented the state-of-the-art from major AI labs at time of benchmarking.

Key Evaluation Metrics

Key Findings

Model Performance Correlates with General Capabilities

The results show a strong correlation (R² = 0.615) between how well models perform on this MongoDB benchmark and their performance on general benchmarks like MMLU-Pro. This suggests that as models get better overall, they’ll continue improving at MongoDB-specific tasks.

xmaner-vs-mmlu-pro

Top Performers

  1. Claude 3.7 Sonnet - 86.7% average XMaNeR score
  2. o3-mini - 82.9% average XMaNeR score
  3. Gemini 2 Flash - 82.9% average XMaNeR score

xmaner_by_model

Prompting Strategy Matters (But Less for Better Models)

Different prompting approaches produced meaningfully different results. Interestingly, the highest-performing models were less sensitive to prompting variations - Claude 3.7 Sonnet had only a 5.86% range between best and worst experiments, while lower-performing models like Mistral Large 2 had an 18.72% range.

prompting avg vs range

Most Effective Prompting Strategies

prompting heatmap

Practical Recommendations

Based on these benchmark results, here are actionable recommendations for building natural language to MongoDB query systems:

  1. Invest in annotated schemas - describing what database fields mean meaningfully improves results
  2. Always include sample documents in your prompts if possible. Models understand these better than programmatically generatedDB schemas.
  3. Test different prompting strategies for your specific model and use case
  4. Consider the cost-performance trade-off of agentic approaches

Dataset

Dataset Profile

The benchmark dataset consists of 766 test cases distributed across 8 MongoDB Atlas sample databases. The dataset includes diverse query complexities and MongoDB operations.

distribution_by_dataset_name

query_operator_counts

Dataset Generation Pipeline

Rather than manually creating the necessary hundreds of test cases, I built a scalable pipeline that programmatically generates natural language queries and their corresponding MongoDB queries.

tree-of-generation

The generation process follows this flow:

  1. User personas: Generate diverse user types who might query the database
  2. Use cases: Create realistic scenarios for each user persona
  3. Natural language queries: Generate multiple ways each user might ask their question
  4. MongoDB queries: Create the corresponding mongosh code to answer each question
  5. Filter: Only include queries where a plurality of LLMs agree on the correct answer and more than one LLM is able to answer correctly. This ensures that the query is ‘answerable’.

Advantages of This Approach

  1. Scalable: Generate N users × M use cases × P natural language queries × O MongoDB queries. For example, 8×8×8×8 = 4,096 test cases.
  2. Flexible: The process adapts to any MongoDB database. Simply point it at your collections.
  3. Extensible: Intervene at any step to create targeted datasets for specific features or use cases. For example, in the NL query generation step, you could prompt it to only create timeseries-related queries.

This approach creates a comprehensive benchmark that covers realistic query patterns while maintaining quality and relevance to actual MongoDB usage.

Resources and Documentation

Dataset and Detailed Results Analysis

The complete dataset, benchmark results, and source code are available on HuggingFace.

Source Code

The benchmark generation pipeline and evaluation code can be found in the MongoDB Chatbot repository.

MongoDB Documentation

Building on these benchmark findings, the MongoDB documentation team and I created new guidance for building natural language to MongoDB query systems. You can find the official documentation at Natural Language to MongoDB Queries. This page includes practical prompting strategies and best practices derived from this research.

The documentation covers optimal prompt components, example schemas, and implementation patterns that emerged from testing thousands of natural language to MongoDB query translations across different models and configurations.


Suggest Changes

Next Post
iPad-Enforced Focus