In the first post we introduced the LES benchmark and described the datasets it uses. In the second post we explained the evaluation metrics and showed why the "best" model depends on the task. In this final post we dig into the full results and extract practical guidance for choosing a Slovene embedding model.
LES: Choosing the Right Slovene Embedding Model
The analysis below is based on results from May 2026. As new models are added to the leaderboard, specific rankings may shift, but the patterns described here should remain relevant.
New to LES? Here's a quick recap from the first and second posts:
- β’ Classification: assign a text to a predefined category (e.g. sentiment, genre, topic). Evaluated with accuracy.
- β’ Clustering: group similar texts together without using labels. Evaluated with V-measure (0 = random grouping, 1 = perfect clusters).
- β’ Retrieval: find the most relevant document for a given query (the core of RAG systems). Evaluated with nDCG@10 (0 = worst ranking, 1 = perfect top-10 ranking).
Patterns Worth Noting
The leaderboard contains results for 25 models across 26 tasks. Rather than walking through every number, we focus on four questions that come up repeatedly when choosing a model for a real application.
1. Does the Evaluation Dataset Corpus Size Affect Which Model Is Best?
The WikipediaQA retrieval dataset in LES is evaluated at increasing corpus sizes: approximately 10k, 50k, 100k, 200k, and 330k documents. The queries remain the same, only the amount of noise (irrelevant documents that the model must filter through) increases.
As shown in Figure 1, all models degrade as the corpus grows, but the rate of degradation varies significantly. Among the top 10 models, google/gemini-embedding-001 experiences only a 6% drop in nDCG@10 from the smallest to the largest corpus, while openai/text-embedding-3-large drops by 14%.
More importantly, rankings shift. Qwen/Qwen3-Embedding-8B and BAAI/bge-m3 start at similar scores on the 10k corpus, but by 200k documents their gap widens to nearly 2%. intfloat/multilingual-e5-small begins as the weakest among the top 10 but overtakes 2 models by the time the full corpus is used.
At the extreme end, cjvt/GaMS-1B loses 27% of its performance, the largest drop among all evaluated models.
Key takeaway
Rankings shift as corpus size grows. Always evaluate on a corpus that matches your production scale since a model competitive at 10k documents may fall behind at 300k.
2. Is the Largest Model Worth the Cost?
Figure 2 shows model size against overall LES score which is simply the mean performance over classification, clustering and retrieval measures.
The starkest example: Snowflake/snowflake-arctic-embed-l-v2.0 (0.6B parameters) scores 55.63 overall, while Qwen/Qwen3-Embedding-8B (8B parameters, 13Γ larger) scores 55.72. That is a 0.09 point improvement for 13Γ the compute.
At the smaller end, Alibaba-NLP/gte-multilingual-base (0.3B parameters) scores 53.52, roughly 9% behind the top model but at a fraction of the size. For many production scenarios, that tradeoff is worth making.
Furthermore, higher-dimensional embeddings increase storage costs and slow down similarity search. If a 768-dim model scores close to a 1024-dim model, the smaller embedding may be preferable in a vector database at scale. Resource differences extend beyond embedding size: Qwen/Qwen3-Embedding-8B has a context window of 32,768 tokens and 8B parameters, while intfloat/multilingual-e5-large-instruct uses a 512-token window at 0.6B parameters, yet the smaller model performs only 3% worse overall. Depending on your latency and infrastructure constraints, that tradeoff may be easy to accept.
Key takeaway
A bigger model is not always better. The 0.3Bβ0.6B parameter range consistently delivers 90β95% of the top score at a fraction of the compute cost. This is a worthwhile tradeoff for most production deployments.
3. Does the Number of Classes Matter?
To isolate the effect of class granularity, we compare model performance across datasets with different numbers of target classes: FrenkBinary and FrenkMulticlass (hate speech detection, using the same texts but with 2 and 6 labels respectively), X-GENRE (text genre classification, 9 classes), and KASTitle (university thesis classification by faculty, 9 classes). The table below displays classification performance on these four datasets with the number of classes shown in parentheses. The measure displayed is accuracy (proportion of correctly classified texts, higher is better).
| Model | FrenkBinary (2) | FrenkMulticlass (6) | XGenre (9) | KASTitle (9) |
|---|---|---|---|---|
| google/gemini-embedding-001 | 0.6635 | 0.3816 | 0.5559 | 0.5958 |
| openai/text-embedding-3-large | 0.5970 | 0.3394 | 0.4877 | 0.5838 |
| cjvt/GaMS-1B | 0.5947 | 0.3011 | 0.5268 | 0.5714 |
| Qwen/Qwen3-Embedding-8B | 0.5924 | 0.2959 | 0.4363 | 0.5531 |
| intfloat/multilingual-e5-large-instruct | 0.6039 | 0.3094 | 0.4413 | 0.5647 |
| BAAI/bge-m3 | 0.5875 | 0.3192 | 0.4503 | 0.5226 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.5748 | 0.3100 | 0.4436 | 0.5487 |
| openai/text-embedding-3-small | 0.5775 | 0.2886 | 0.4553 | 0.5227 |
| Qwen/Qwen3-Embedding-4B | 0.5877 | 0.2884 | 0.4117 | 0.5399 |
| sentence-transformers/paraphrase-multilingual-mpnet-base-v2 | 0.5715 | 0.2667 | 0.4084 | 0.5402 |
Binary classification is easier for every model. But the size of the drop when moving to more classes varies.
intfloat/multilingual-e5-large-instruct ranks 6th overall on the classification tasks but places 2nd on FrenkBinary. However, its performance drops more steeply than most on the multiclass version, suggesting it captures broad polarity well but struggles with fine-grained distinctions.
Interestingly, the 9-class tasks (X-GENRE and KAS) do not always show a larger drop than the 6-class FrenkMulticlass. This suggests that difficulty depends not just on the number of classes but on how semantically distinct those classes are. Distinguishing between faculties or text genres involves clearer topical boundaries than distinguishing between types of offensive speech.
In clustering, the picture shifts:
| Model | FrenkBinary (2) | FrenkMulticlass (6) | XGenre (9) | KASTitle (9) |
|---|---|---|---|---|
| cjvt/GaMS-1B | 0.0296 | 0.0821 | 0.4230 | 0.4547 |
| Qwen/Qwen3-Embedding-4B | 0.0255 | 0.0793 | 0.4339 | 0.4180 |
| openai/text-embedding-3-large | 0.0187 | 0.0869 | 0.4354 | 0.4508 |
| Alibaba-NLP/gte-multilingual-base | 0.0040 | 0.0887 | 0.4200 | 0.4131 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.0050 | 0.1015 | 0.4023 | 0.4365 |
| intfloat/multilingual-e5-large-instruct | 0.0150 | 0.0818 | 0.4217 | 0.4738 |
| openai/text-embedding-3-small | 0.0111 | 0.0731 | 0.4192 | 0.3817 |
| Qwen/Qwen3-Embedding-8B | 0.0252 | 0.1220 | 0.3681 | 0.3809 |
| Lajavaness/bilingual-embedding-small | 0.0149 | 0.0748 | 0.4336 | 0.3756 |
| google/embeddinggemma-300m | 0.0293 | 0.0745 | 0.4146 | 0.4067 |
Here a striking inversion appears: all models score near zero on binary clustering yet achieve V-measure scores above 0.35 on the 9-class tasks. As discussed in the previous blog post, V-measure captures both cluster purity and completeness, so values close to zero suggest near-random grouping. When there are only two clusters, even small impurities destroy the score, while tasks with more classes offer richer topical structure that embeddings can separate more naturally.
Key takeaway
The right model depends on your label space. Both classification and clustering difficulty depend more on how semantically distinct your classes are than on how many there are. The two tasks may favour different models for the same label space.
4. Does Query-Document Structure Matter in Retrieval?
Retrieval tasks in LES vary in the relationship between query and document. Some tasks match short texts to longer texts (sentence-to-paragraph, or s2p), while others match long texts to long texts (paragraph-to-paragraph, or p2p).
In the RTVArticles dataset, we can compare both:
- β’ s2p: title β abstract
- β’ p2p: abstract β full article body
| Model | RTVArticles s2p | RTVArticles p2p | Average |
|---|---|---|---|
| google/gemini-embedding-001 | 0.96457 | 0.97445 | 0.96951 |
| openai/text-embedding-3-large | 0.92683 | 0.95959 | 0.94321 |
| google/text-multilingual-embedding-002 | 0.92505 | 0.95543 | 0.94024 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.92925 | 0.95067 | 0.93996 |
| Alibaba-NLP/gte-multilingual-base | 0.92765 | 0.95219 | 0.93992 |
| BAAI/bge-m3 | 0.93262 | 0.94442 | 0.93852 |
| intfloat/multilingual-e5-large-instruct | 0.91942 | 0.95079 | 0.93511 |
| Qwen/Qwen3-Embedding-8B | 0.91577 | 0.95171 | 0.93374 |
| Qwen/Qwen3-Embedding-4B | 0.89824 | 0.94060 | 0.91942 |
| openai/text-embedding-3-small | 0.82997 | 0.90855 | 0.86926 |
The results show the nDCG@10 score, which measures how well the model ranks the relevant document in the top 10 results (higher is better, 1.0 is perfect). The s2p setting is easier for all models, with scores around 0.93β0.96, while p2p scores are more spread out between 0.91 and 0.97.
The overall retrieval winner, google/gemini-embedding-001, leads in both settings. But among the remaining models, rankings shift noticeably. BAAI/bge-m3 ranks second highest in the s2p column (behind gemini) but drops to 8th in p2p, while openai/text-embedding-3-large moves in the opposite direction, jumping from 5th in s2p to 2nd in p2p.
Key takeaway
Overall retrieval rankings can be misleading. If your use case involves short queries against long documents (the typical RAG setup), filter by the s2p results specifically, since model rankings shift noticeably between s2p and p2p.
Practical Summary
Based on the full LES results, here are concrete starting points for common use cases. The table below maps common use cases to the most relevant LES datasets and recommends models in two tiers: best raw performance and best value at a smaller size. Models marked with API are API-only and require sending data to an external service.
| Your use case | What to look at in LES | Best performance | Best value (smaller) |
|---|---|---|---|
| Sentiment classification | ParlaSent, SentiNews classification | google/gemini-embedding-001API, cjvt/GaMS-1B, Qwen/Qwen3-Embedding-8B | cjvt/GaMS-1B, intfloat/multilingual-e5-large-instruct, BAAI/bge-m3 |
| Hate speech detection | FrenkBinary/Multiclass (check recall) | google/gemini-embedding-001API, intfloat/multilingual-e5-large-instruct, openai/text-embedding-3-largeAPI | intfloat/multilingual-e5-large-instruct, BAAI/bge-m3 |
| Genre or topic classification | X-GENRE, Sib200 classification | google/gemini-embedding-001API, openai/text-embedding-3-largeAPI, cjvt/GaMS-1B | BAAI/bge-m3, cjvt/GaMS-1B |
| Topic clustering | Clustering V-measure across datasets | cjvt/GaMS-1B, Qwen/Qwen3-Embedding-4B, openai/text-embedding-3-largeAPI | cjvt/GaMS-1B, Alibaba-NLP/gte-multilingual-base, Snowflake/snowflake-arctic-embed-l-v2.0 |
| RAG β small corpus (<50k docs) | WikipediaQA 10k, Zakonodaja | google/gemini-embedding-001API, Snowflake/snowflake-arctic-embed-l-v2.0, Qwen/Qwen3-Embedding-8B | BAAI/bge-m3, intfloat/multilingual-e5-large-instruct |
| RAG β large corpus (>100k docs) | WikipediaQA 300k+ | google/gemini-embedding-001API, Snowflake/snowflake-arctic-embed-l-v2.0, Qwen/Qwen3-Embedding-8B | intfloat/multilingual-e5-large-instruct |
| News retrieval | RTVArticles s2p and p2p | google/gemini-embedding-001API, Snowflake/snowflake-arctic-embed-l-v2.0, Qwen/Qwen3-Embedding-8B | BAAI/bge-m3, Alibaba-NLP/gte-multilingual-base |
Final Thoughts
Over three posts we introduced LES, explained its evaluation metrics, and analysed what the results reveal. The central finding is simple but important: there is no single best Slovene embedding model. The right choice depends on your task type, your corpus size, the granularity of your categories, and the tradeoffs you are willing to accept between performance and cost.
LES does not yet cover every embedding use case. Tasks such as semantic textual similarity, bitext mining, reranking, and pair classification are not part of the current benchmark. Some domains remain underrepresented and certain datasets are small enough that results should be interpreted with caution.
The right model is the one that fits your task, your corpus, and your constraints. LES gives you the evidence to find it.
If you have a dataset that could improve the benchmark, feel free to contact us at info@valira.ai.