LES: Choosing the Right Slovene Embedding Model

In the first post we introduced the LES benchmark and described the datasets it uses. In the second post we explained the evaluation metrics and showed why the "best" model depends on the task. In this final post we dig into the full results and extract practical guidance for choosing a Slovene embedding model.

The analysis below is based on results from May 2026. As new models are added to the leaderboard, specific rankings may shift, but the patterns described here should remain relevant.

New to LES? Here's a quick recap from the first and second posts:

• Classification: assign a text to a predefined category (e.g. sentiment, genre, topic). Evaluated with accuracy.
• Clustering: group similar texts together without using labels. Evaluated with V-measure (0 = random grouping, 1 = perfect clusters).
• Retrieval: find the most relevant document for a given query (the core of RAG systems). Evaluated with nDCG@10 (0 = worst ranking, 1 = perfect top-10 ranking).

📊View Leaderboard→

Patterns Worth Noting

The leaderboard contains results for 25 models across 26 tasks. Rather than walking through every number, we focus on four questions that come up repeatedly when choosing a model for a real application.

1. Does the Evaluation Dataset Corpus Size Affect Which Model Is Best?

The WikipediaQA retrieval dataset in LES is evaluated at increasing corpus sizes: approximately 10k, 50k, 100k, 200k, and 330k documents. The queries remain the same, only the amount of noise (irrelevant documents that the model must filter through) increases.

As shown in Figure 1, all models degrade as the corpus grows, but the rate of degradation varies significantly. Among the top 10 models, google/gemini-embedding-001 experiences only a 6% drop in nDCG@10 from the smallest to the largest corpus, while openai/text-embedding-3-large drops by 14%.

More importantly, rankings shift. Qwen/Qwen3-Embedding-8B and BAAI/bge-m3 start at similar scores on the 10k corpus, but by 200k documents their gap widens to nearly 2%. intfloat/multilingual-e5-small begins as the weakest among the top 10 but overtakes 2 models by the time the full corpus is used.

At the extreme end, cjvt/GaMS-1B loses 27% of its performance, the largest drop among all evaluated models.

Key takeaway

Rankings shift as corpus size grows. Always evaluate on a corpus that matches your production scale since a model competitive at 10k documents may fall behind at 300k.

2. Is the Largest Model Worth the Cost?

Figure 2 shows model size against overall LES score which is simply the mean performance over classification, clustering and retrieval measures.

The starkest example: Snowflake/snowflake-arctic-embed-l-v2.0 (0.6B parameters) scores 55.63 overall, while Qwen/Qwen3-Embedding-8B (8B parameters, 13× larger) scores 55.72. That is a 0.09 point improvement for 13× the compute.

At the smaller end, Alibaba-NLP/gte-multilingual-base (0.3B parameters) scores 53.52, roughly 9% behind the top model but at a fraction of the size. For many production scenarios, that tradeoff is worth making.

Furthermore, higher-dimensional embeddings increase storage costs and slow down similarity search. If a 768-dim model scores close to a 1024-dim model, the smaller embedding may be preferable in a vector database at scale. Resource differences extend beyond embedding size: Qwen/Qwen3-Embedding-8B has a context window of 32,768 tokens and 8B parameters, while intfloat/multilingual-e5-large-instruct uses a 512-token window at 0.6B parameters, yet the smaller model performs only 3% worse overall. Depending on your latency and infrastructure constraints, that tradeoff may be easy to accept.

Key takeaway

A bigger model is not always better. The 0.3B–0.6B parameter range consistently delivers 90–95% of the top score at a fraction of the compute cost. This is a worthwhile tradeoff for most production deployments.

3. Does the Number of Classes Matter?

To isolate the effect of class granularity, we compare model performance across datasets with different numbers of target classes: FrenkBinary and FrenkMulticlass (hate speech detection, using the same texts but with 2 and 6 labels respectively), X-GENRE (text genre classification, 9 classes), and KASTitle (university thesis classification by faculty, 9 classes). The table below displays classification performance on these four datasets with the number of classes shown in parentheses. The measure displayed is accuracy (proportion of correctly classified texts, higher is better).

Model	FrenkBinary (2)	FrenkMulticlass (6)	XGenre (9)	KASTitle (9)
google/gemini-embedding-001	0.6635	0.3816	0.5559	0.5958
openai/text-embedding-3-large	0.5970	0.3394	0.4877	0.5838
cjvt/GaMS-1B	0.5947	0.3011	0.5268	0.5714
Qwen/Qwen3-Embedding-8B	0.5924	0.2959	0.4363	0.5531
intfloat/multilingual-e5-large-instruct	0.6039	0.3094	0.4413	0.5647
BAAI/bge-m3	0.5875	0.3192	0.4503	0.5226
Snowflake/snowflake-arctic-embed-l-v2.0	0.5748	0.3100	0.4436	0.5487
openai/text-embedding-3-small	0.5775	0.2886	0.4553	0.5227
Qwen/Qwen3-Embedding-4B	0.5877	0.2884	0.4117	0.5399
sentence-transformers/paraphrase-multilingual-mpnet-base-v2	0.5715	0.2667	0.4084	0.5402

Binary classification is easier for every model. But the size of the drop when moving to more classes varies.

intfloat/multilingual-e5-large-instruct ranks 6th overall on the classification tasks but places 2nd on FrenkBinary. However, its performance drops more steeply than most on the multiclass version, suggesting it captures broad polarity well but struggles with fine-grained distinctions.

Interestingly, the 9-class tasks (X-GENRE and KAS) do not always show a larger drop than the 6-class FrenkMulticlass. This suggests that difficulty depends not just on the number of classes but on how semantically distinct those classes are. Distinguishing between faculties or text genres involves clearer topical boundaries than distinguishing between types of offensive speech.

In clustering, the picture shifts:

Model	FrenkBinary (2)	FrenkMulticlass (6)	XGenre (9)	KASTitle (9)
cjvt/GaMS-1B	0.0296	0.0821	0.4230	0.4547
Qwen/Qwen3-Embedding-4B	0.0255	0.0793	0.4339	0.4180
openai/text-embedding-3-large	0.0187	0.0869	0.4354	0.4508
Alibaba-NLP/gte-multilingual-base	0.0040	0.0887	0.4200	0.4131
Snowflake/snowflake-arctic-embed-l-v2.0	0.0050	0.1015	0.4023	0.4365
intfloat/multilingual-e5-large-instruct	0.0150	0.0818	0.4217	0.4738
openai/text-embedding-3-small	0.0111	0.0731	0.4192	0.3817
Qwen/Qwen3-Embedding-8B	0.0252	0.1220	0.3681	0.3809
Lajavaness/bilingual-embedding-small	0.0149	0.0748	0.4336	0.3756
google/embeddinggemma-300m	0.0293	0.0745	0.4146	0.4067

Here a striking inversion appears: all models score near zero on binary clustering yet achieve V-measure scores above 0.35 on the 9-class tasks. As discussed in the previous blog post, V-measure captures both cluster purity and completeness, so values close to zero suggest near-random grouping. When there are only two clusters, even small impurities destroy the score, while tasks with more classes offer richer topical structure that embeddings can separate more naturally.

Key takeaway

The right model depends on your label space. Both classification and clustering difficulty depend more on how semantically distinct your classes are than on how many there are. The two tasks may favour different models for the same label space.

4. Does Query-Document Structure Matter in Retrieval?

Retrieval tasks in LES vary in the relationship between query and document. Some tasks match short texts to longer texts (sentence-to-paragraph, or s2p), while others match long texts to long texts (paragraph-to-paragraph, or p2p).

In the RTVArticles dataset, we can compare both:

• s2p: title → abstract
• p2p: abstract → full article body

Model	RTVArticles s2p	RTVArticles p2p	Average
google/gemini-embedding-001	0.96457	0.97445	0.96951
openai/text-embedding-3-large	0.92683	0.95959	0.94321
google/text-multilingual-embedding-002	0.92505	0.95543	0.94024
Snowflake/snowflake-arctic-embed-l-v2.0	0.92925	0.95067	0.93996
Alibaba-NLP/gte-multilingual-base	0.92765	0.95219	0.93992
BAAI/bge-m3	0.93262	0.94442	0.93852
intfloat/multilingual-e5-large-instruct	0.91942	0.95079	0.93511
Qwen/Qwen3-Embedding-8B	0.91577	0.95171	0.93374
Qwen/Qwen3-Embedding-4B	0.89824	0.94060	0.91942
openai/text-embedding-3-small	0.82997	0.90855	0.86926

The results show the nDCG@10 score, which measures how well the model ranks the relevant document in the top 10 results (higher is better, 1.0 is perfect). The s2p setting is easier for all models, with scores around 0.93–0.96, while p2p scores are more spread out between 0.91 and 0.97.

The overall retrieval winner, google/gemini-embedding-001, leads in both settings. But among the remaining models, rankings shift noticeably. BAAI/bge-m3 ranks second highest in the s2p column (behind gemini) but drops to 8th in p2p, while openai/text-embedding-3-large moves in the opposite direction, jumping from 5th in s2p to 2nd in p2p.

Key takeaway

Overall retrieval rankings can be misleading. If your use case involves short queries against long documents (the typical RAG setup), filter by the s2p results specifically, since model rankings shift noticeably between s2p and p2p.

Practical Summary

Based on the full LES results, here are concrete starting points for common use cases. The table below maps common use cases to the most relevant LES datasets and recommends models in two tiers: best raw performance and best value at a smaller size. Models marked with API are API-only and require sending data to an external service.

Your use case	What to look at in LES	Best performance	Best value (smaller)
Sentiment classification	ParlaSent, SentiNews classification	google/gemini-embedding-001API, cjvt/GaMS-1B, Qwen/Qwen3-Embedding-8B	cjvt/GaMS-1B, intfloat/multilingual-e5-large-instruct, BAAI/bge-m3
Hate speech detection	FrenkBinary/Multiclass (check recall)	google/gemini-embedding-001API, intfloat/multilingual-e5-large-instruct, openai/text-embedding-3-largeAPI	intfloat/multilingual-e5-large-instruct, BAAI/bge-m3
Genre or topic classification	X-GENRE, Sib200 classification	google/gemini-embedding-001API, openai/text-embedding-3-largeAPI, cjvt/GaMS-1B	BAAI/bge-m3, cjvt/GaMS-1B
Topic clustering	Clustering V-measure across datasets	cjvt/GaMS-1B, Qwen/Qwen3-Embedding-4B, openai/text-embedding-3-largeAPI	cjvt/GaMS-1B, Alibaba-NLP/gte-multilingual-base, Snowflake/snowflake-arctic-embed-l-v2.0
RAG — small corpus (<50k docs)	WikipediaQA 10k, Zakonodaja	google/gemini-embedding-001API, Snowflake/snowflake-arctic-embed-l-v2.0, Qwen/Qwen3-Embedding-8B	BAAI/bge-m3, intfloat/multilingual-e5-large-instruct
RAG — large corpus (>100k docs)	WikipediaQA 300k+	google/gemini-embedding-001API, Snowflake/snowflake-arctic-embed-l-v2.0, Qwen/Qwen3-Embedding-8B	intfloat/multilingual-e5-large-instruct
News retrieval	RTVArticles s2p and p2p	google/gemini-embedding-001API, Snowflake/snowflake-arctic-embed-l-v2.0, Qwen/Qwen3-Embedding-8B	BAAI/bge-m3, Alibaba-NLP/gte-multilingual-base

Final Thoughts

Over three posts we introduced LES, explained its evaluation metrics, and analysed what the results reveal. The central finding is simple but important: there is no single best Slovene embedding model. The right choice depends on your task type, your corpus size, the granularity of your categories, and the tradeoffs you are willing to accept between performance and cost.

LES does not yet cover every embedding use case. Tasks such as semantic textual similarity, bitext mining, reranking, and pair classification are not part of the current benchmark. Some domains remain underrepresented and certain datasets are small enough that results should be interpreted with caution.

The right model is the one that fits your task, your corpus, and your constraints. LES gives you the evidence to find it.

📊View Leaderboard→

If you have a dataset that could improve the benchmark, feel free to contact us at info@valira.ai.

LES: Choosing the Right Slovene Embedding Model

Neli Čatar

June 8, 2026

Patterns Worth Noting

1. Does the Evaluation Dataset Corpus Size Affect Which Model Is Best?

2. Is the Largest Model Worth the Cost?

3. Does the Number of Classes Matter?

In clustering, the picture shifts:

4. Does Query-Document Structure Matter in Retrieval?

Practical Summary

Final Thoughts