Getting stuck on a problem used to mean searching documentation or forums. Today, it often means opening an LLM. It works surprisingly well. But many real-world problems are not conversational. We may need to classify thousands of documents, cluster articles by topic or build a retrieval system for Slovene legal texts. In these cases, the largest available model is often not the best choice. Smaller embedding models can be faster, cheaper and, when chosen correctly, more accurate.

The problem? Choosing the right embedding model for Slovene is difficult.

That is why we built Lestvica embeddingov za slovenščino (LES), a Slovene-focused embedding benchmark based on the MTEB evaluation framework. LES evaluates embedding models across classification, clustering and retrieval tasks, using exclusively Slovene datasets. This allows us to measure how well different models capture semantic similarity in Slovene text.

In this first post, we introduce the LES benchmark and the datasets used in its evaluation. In upcoming posts, we will present the initial results, compare model performance on Slovene tasks, and discuss what drives these differences.

Why Slovene Needs Its Own Benchmark

Multilingual benchmarks often contain very little Slovene data. As a result, models that perform well globally may still perform poorly on Slovene tasks. Languages differ in morphology, vocabulary and training data availability. These differences can significantly affect how well embedding models capture semantic meaning.

LES addresses this by providing:

  • • A Slovene-only evaluation benchmark
  • • Multiple task types reflecting real-world NLP use cases
  • • A public leaderboard for model comparison

Instead of guessing which model works best for Slovene, we can now measure it.

Tasks Evaluated in LES

LES currently evaluates embedding models across three task categories: Classification, Clustering and Retrieval.

In classification tasks, model must assign a text to one of several predefined categories. Typical examples include sentiment analysis, topic classification or hate speech detection. In clustering tasks, the model groups semantically similar texts together without using labels during inference. High-quality embeddings should naturally place texts from the same category into the same clusters. Lastly, in retrieval tasks, the model must identify the most relevant document for a given query. Retrieval tasks measure how well embeddings capture semantic similarity between queries and documents.

Because classification and clustering both rely on labeled text datasets, LES evaluates them using the same underlying datasets. Retrieval tasks require a different structure consisting of queries paired with relevant documents.

Datasets

LES evaluates models across datasets covering different domains, dataset sizes and classification tasks.

DatasetNr. of examplesNr. of labelsSourceLicense
Classification and Clustering
FrenkBinary10k2 (offensive, acceptable)HF
FrenkMulticlass10k6 (acceptable or type of offense)HF
X-GENRE1.8k9 (genre)HFCC-BY-SA-4.0
ParlaSent 1.02.6k3 (positive, negative, neutral)HFCC-BY-SA-4.0
Sib2001k7 (categories)HFCC-BY-SA-4.0
SentiNews10.4k3 (positive, negative, neutral)HFCC-BY-SA-4.0
KAS (titles, abstracts)22.4k9 (faculties)CLARIN.SIACA ID-BY-NC-INF-NORED v1.0
Retrieval
KAS (titles, abstracts)63kCLARIN.SIACA ID-BY-NC-INF-NORED v1.0
RTVArticles21kRTV Slovenija
WikipediaQA323kWikipedija
Zakonodaja16.4kLexpera
SodnaPraksa130kLexpera

Below is a short description of each dataset and its role in the benchmark.

Classification and Clustering
FrenkBinary & FrenkMulticlass

Hate speech detection datasets. Both contain the same texts, however the binary version labels texts as offensive or acceptable while the multiclass version additionally distinguishes between different types of offensive content. These datasets test whether embeddings can capture subtle semantic differences in offensive language.

X-GENRE

Requires models to determine the genre of a text, such as news, instruction, prose and lyrical text. This task evaluates whether embeddings can capture stylistic and structural properties of texts.

ParlaSent

Contains Slovene parliamentary debates annotated with sentiment labels. Each text is classified as positive, negative or neutral. The dataset evaluates how well embeddings represent sentiment in political discourse.

Sib200

A relatively small topic classification dataset. Texts must be assigned to categories such as politics, entertainment, health and geography. Despite its small size, it provides a useful signal about how embeddings behave in low-data scenarios.

SentiNews

A sentiment analysis dataset built from Slovene news content. Models must determine whether the text expresses positive, negative or neutral sentiment.

KAS

Dataset consists of Slovene university thesis metadata. For classification and clustering tasks, models must predict the faculty based on the thesis title or abstract. Only the nine largest faculties were retained in the dataset. This dataset evaluates whether embeddings can capture academic domain semantics.

The data is available on CLARIN.SI repository and can be accessed upon identification and acceptance of the ACA ID-BY-NC-INF-NORED license.

Retrieval Datasets
KAS

The model must retrieve the correct thesis abstract given the thesis title. This task evaluates semantic similarity between short queries and longer academic documents.

The data is available on CLARIN.SI repository and can be accessed upon identification and acceptance of the ACA ID-BY-NC-INF-NORED license.

RTVArticles

A collection of titles, abstracts and articles from RTV Slovenija. The retrieval tasks include retrieving the abstract given a title and retrieving the article body given the abstract. This dataset evaluates embeddings in a news retrieval scenario.

The dataset was put together by automatically collecting publicly available articles from the RTV Slovenija website and keeping only those written in Slovene. There was no manual selection involved, so the collection reflects a fairly natural mix of topics you would expect from a general news source. If you would like to work with a similar dataset, you can build your own version by collecting the content directly from RTV Slovenia.

WikipediaQA

Combines several datasets built from Slovene Wikipedia. Approximately 10,000 questions were automatically generated from Wikipedia articles using an LLM. The benchmark is evaluated on increasingly large corpora:

  • • 10k documents
  • • 50k documents
  • • 100k documents
  • • 200k documents
  • • ~300k documents

This setup allows us to study how embedding models behave as dataset size and noise increase.

The documents come from Slovene Wikipedia and were gathered automatically, without focusing on any specific topics. This keeps the dataset broad and a bit noisy, which is useful when testing how models behave in more realistic conditions. For the same reasons, the dataset itself is not distributed. If you’re interested in reproducing it, you can create a similar dataset by collecting the content directly from Wikipedia.

Zakonodaja

A dataset of Slovene legal documents paired with automatically generated questions. The retrieval task requires matching each question with the corresponding law or legal provision. Document length varies significantly, ranging from 5 to 10.2k words, which makes this a challenging retrieval task.

The data was provided by Lexpera and is therefore not publicly available.

Sodna Praksa

A dataset of Slovene judicial decisions where the retrieval task requires matching a legal case with the corresponding law or legal provision. This benchmark evaluates embeddings in a legal reasoning scenario, where queries and documents involve complex legal language and implicit references to legislation.

The dataset was provided by Lexpera and is not publicly available.

What Comes Next

This post introduced the idea behind LES. In the next posts we will take a closer look at the benchmark and its results:

  • • How the evaluation pipeline is implemented.
  • • Which metrics are most informative in different scenarios.
  • • How to interpret the results when selecting a model for your task.
  • • Which embedding models perform best for Slovene tasks.

Key Insights (and a Few Surprises)

Here are some interesting observations that we discovered while analysing the LES results:

  • • The best overall performing model is the Qwen3 8B embedding model, although it does not rank highest in any individual task.
  • • OpenAI's text-embedding-3-large model ranked third overall.
  • • The Lavajaness French–English bilingual embedding model is based on the multilingual E5 architecture and ranked one place above the original E5 model.
  • • The best-performing Slovene model was GaMS 1B, which ranked 16th, six places above the 2B GaMS model.
  • • GaMS 1B performed best on the clustering task and ranked second on the classification task.

Keep in mind that these insights are based on the current set of models and datasets in LES. As we expand the benchmark and include more models, we may discover new patterns and insights about how different embedding models perform across various tasks and domains.

Try LES Yourself

If you have a dataset that could improve the benchmark, feel free to contact us at info@valira.ai.