We are introducing the Slovenian Christmas Songs Dataset. A curated collection of 54 Slovenian Christmas song lyrics, cleaned and structured for NLP experimentation. It’s a small NLP playground suitable for quick analyses and a reminder that meaningful datasets don’t have to be big to be useful. The dataset is available on 🤗 Hugging Face and is accessible for everyone to download or suggest changes!
Unwrapping the Slovenian Christmas Song Dataset
The dataset was manually constructed by reviewing multiple YouTube and Spotify playlists and researching additional sources on the web. It focuses on traditional and contemporary Christmas music in the Slovenian language.
Columns:
title– The name of the song.artist– The performer or artist. Left empty if the song is a traditional folk song.year– The release year. Empty if unknown or hard to determine.native– Indicates whether the song is originally Slovenian or a cover of another song.lyrics– The song's lyrics (if available).lyrics_source– Where the lyrics were obtained from.song_link– A link to listen to the song.
Below, we briefly explore the data with simple lexical and sentiment analysis. All code is available here.
What words define Slovenian Christmas songs?
As you might expect, raw lyrics are not exactly analysis-ready. They come with punctuation, repetition, grammatical intricacies, and a surprising number of ways to say the same thing. So before drawing any conclusions about snow, peace, or love, we put the text through a preprocessing pipeline.
Step 1: Lowercase Everything
Is it Božič or božič? This is not your high school Slovenian class, here we don't care. Lowercase everything!
Step 2: Tokenization and Stopwords
Next, the lyrics were split into individual words. Using the nltk library we remove standard Slovenian stopwords—words like in, je, or na that are essential for grammar but not very helpful for understanding meaning.
# top 10 most frequent words and their corresponding counts
[('božič', 124),
('noč', 45),
('zdaj', 44),
('ljudi', 31),
('sneg', 30),
('srce', 28),
('dan', 25),
('čas', 25),
('sneži', 25),
('ljudje', 23)]Step 3: Lemmatization (Or: Why "Ljudi" and "Ljudje" Are the Same Thing)
Slovenian is a beautifully inflected language, which is great for poetry and slightly less great for word counts. The same concept can appear in many grammatical forms. For example, ljudi and ljudje are different words on the surface, but semantically they refer to the same thing: people.
To avoid counting grammatical variations as separate ideas, we use the classla library to lemmatize all words and reduce them to their base forms. After this step, different cases and tenses collapse into a single representative form. At the end of the day the analysis is about concepts, not grammar.
# top 10 most frequent words and their corresponding counts
[('biti', 553),
('jaz', 222),
('ves', 196),
('božič', 125),
('naj', 124),
('noč', 66),
('človek', 62),
('spet', 48),
('zdaj', 44),
('dan', 42)]Step 4: Less Grammar, More Christmas
After lemmatization, very frequent words such as biti (to be) or jaz (I) began to dominate the counts. While essential for grammar, these words appear in almost any text and add little insight into what distinguishes Christmas songs.
To make the results easier to interpret, we filter the vocabulary by part of speech and kept only content words: mainly nouns and adjectives that carry thematic meaning. In other words, Christmas lost a lot of being and gained more snow.
# top 10 most frequent words and their corresponding counts
[('božič', 125),
('noč', 66),
('človek', 62),
('dan', 42),
('bel', 41),
('srce', 40),
('sneg', 36),
('nebo', 35),
('božičen', 35),
('snežinka', 32)]The Result
After all this preprocessing, what remained was a compact, interpretable vocabulary. These are the words that appear again and again and are the ones that we are used to hearing on the radio these days.
To give them a festive stage, we arrange them into a Christmas tree word cloud.

The emotional journey of Slovenian Christmas songs
Ever wondered what makes a Christmas song truly resonate? We asked GPT-5.2 to dig into the lyrics of each classic and rate them from 1 to 10 across four emotional dimensions: peacefulness, joy, nostalgia, and spirituality.
- Peaceful – Calm, serene, reflective; imagery of snow, quiet, gentle scenes.
- Joyful – Upbeat, festive, celebratory; lots of laughter, dancing, gifts, excitement.
- Nostalgic – Memories, longing, warm feelings about past Christmases, family traditions.
- Spiritual – Religious themes, nativity, angels, prayers, church references.
Select two songs below to compare their emotional profiles side by side.
Interesting findings
- Using the Euclidean distance between the four emotional categories, the two most similar songs are Na božično noč and FSE, KA BI ZA BOŽIČ.
- We were surprised to see Magnifico's In ko enkrat bom umrl appear among the top five spiritual songs. However, after examining the lyrics, the rating makes sense: the song contains numerous religious references that might not be obvious at first glance. Some examples include:
"V nebesih se je vnel prepir, ko sva kalila nočni mir."
"Aleluja baby, baby, Telo in dušo sem ti dal. Hvaležna si bila, ko Satana iz tebe ven sem gnal." - Overall, the songs score highest in the Peaceful and Nostalgic categories, while Spiritual shows the greatest variance across the dataset. See table below.
| Category | Mean | Variance |
|---|---|---|
| Peaceful | 7.30 | 1.27 |
| Joyful | 6.63 | 2.77 |
| Nostalgic | 7.09 | 1.56 |
| Spiritual | 3.59 | 4.17 |
Conclusion
What did we learn? Slovenian Christmas songs love snow, quiet nights, warm hearts, and a healthy dose of nostalgia. Of course, our analysis is a bit simplified — this is Christmas science after all. Still, the dataset is there for you to play with. Whether you use it to experiment with sentiment models, build visualizations, or simply explore how Christmas sounds in Slovenian, we hope it brings a bit of insight (and maybe a bit of holiday spirit) along the way.
We also put together YouTube and Spotify playlists so you can actually listen to the songs behind the charts. Put them on, browse the dataset, and enjoy a very data‑inspired December.
Contributing 🎄🎶
We tried to gather as many Slovenian Christmas songs as we could, but Santa's sleigh might have missed a few!
If you know a song that isn't in the dataset, or spot something that needs fixing, we'd love your help to make this collection even merrier.
Ways to contribute:
- 🎁 Open an issue on the dataset repository
- 🎁 Submit a pull request with new songs or updates
- 🎁 Drop us a note with suggestions or corrections
Let's build this festive dataset together!
