text-embedding-3-small vs text-embedding-ada-002 vs BGE-large-en-v1.5: Which Embedding Model for RAG?

Choosing an embedding model for a Pinecone, Weaviate, or pgvector RAG pipeline involves three variables: retrieval quality (MTEB benchmark), token limit, and cost per million tokens. text-embedding-ada-002 (OpenAI) — the most widely deployed model in production LangChain and LlamaIndex pipelines — produces 1,536-dimensional vectors, has an 8,191-token limit, and costs $0.10/1M tokens. text-embedding-3-small supersedes it on every metric: higher MTEB BEIR scores, same 1,536 dimensions, same token limit, and costs $0.02/1M tokens — a 5× cost reduction with better retrieval quality. text-embedding-3-large adds a 3,072-dimension option at $0.13/1M tokens for use cases where ranking precision matters more than cost. For teams that want free, self-hosted embeddings, BGE-large-en-v1.5 (BAAI, 512-token limit) consistently outperforms ada-002 on BEIR benchmarks and runs in-process via sentence-transformers or Transformers.js without API calls. BGE-M3 extends this to multilingual retrieval with a 8,192-token limit, making it the default choice for non-English corpora.

The model you pick interacts directly with your chunking strategy: a 512-token BGE-large limit means chunks larger than ~380 words get silently truncated before embedding, ruining retrieval for long passages. The RAG Chunk Inspector lets you paste a document, set chunk size and overlap, and preview each chunk's token count against the model's limit — so you can confirm no chunks exceed the embedding boundary before indexing into Pinecone or Weaviate.