Ranking Mario Kart Players with Glicko-2

We have a closed group that plays Mario Kart 8 Deluxe against each other on game nights, but we had no way of quantifying how good each player was. We were fighting only based on vibes and self-adoration: "I'm the best!" "No, I'm the best, I won the last race!"

We wanted a scoring system similar to Elo in Chess that would be made for Mario Kart and would primarily rank our local group. The game does have a global online ranking system (the VR score used in online races), but Nintendo keeps the exact formula under wraps, and many players do not even use the online game mode. More importantly, we were most interested in local comparisons in our group.

With that in mind, we applied and modified the Glicko-2 scoring system so it would rank our group and we would not argue (as much) anymore since our statements would be backed by data. We built GlickoKart—a free web app where anyone can track their own Mario Kart ratings using this system.

🏎️Try GlickoKart Now→

Track your Mario Kart ratings and settle the debate once and for all

Game Specifics

Mario Kart 8 Deluxe on Nintendo Switch supports up to 4 players, with NPC racers filling the remaining slots for a total of 12 participants per race. The game is played in sessions called Grand Prix, and each Grand Prix includes four races in a row. After all four races are finished, players are ranked from 1st to 12th based on their total points.

The point system is very easy to understand. Each race gives points depending on where you finish. First place gets 15 points, second gets 12, and it keeps going down until last place, which gets 1 point. Over a full Grand Prix, total scores can range from as low as 4 points to as high as 60 points.

Mario Kart 8 Deluxe Scoring System (per race):

Place	1	2	3	4	5	6	7	8	9	10	11	12
Score	15	12	10	9	8	7	6	5	4	3	2	1

One of the biggest reasons Mario Kart is hard to rate is the amount of luck involved. Items play a huge role in every race, and they can completely change the outcome in just a few seconds. Getting hit by a shell or pulling a powerful item like a bullet bill at the right moment can matter just as much as driving well.

This is closely tied to how the game handles AI difficulty. Mario Kart offers easy, normal, and hard NPC opponents, but Nintendo also adds extra systems on top of that. The game uses rubber-banding and position-based item probabilities to keep races close. NPCs that fall behind are often given better items and speed boosts, to keep the game more interesting even for skilled players. These mechanics are well documented in earlier Mario Kart games and still apply in Mario Kart 8 Deluxe [1, 2, 3].

This has an important effect on player ratings. For example, a mid-level player who usually finishes 2nd might not simply drop to 3rd when a stronger player joins the race. Instead, the stronger player can pull the NPCs forward through rubber-banding and item pressure, which may cause the mid-level player to finish even lower. In other words, a third place when competing with a very strong player can and should be more impresive than a second place in a group of equally skilled players.

All of this means that race results are influenced not only by player skill, but also by how the game actively reacts to the skill levels and positions of everyone in the race. Any rating system needs to take these hidden mechanics into account.

On top of these mechanics, Mario Kart also offers many different game modes. There are multiple speed classes (50cc, 100cc, 150cc, and 200cc) and different item setups (normal, custom, frantic, and others), each of which changes how chaotic or skill-based a race feels.

Experiment Setup

Because these settings have such a big impact on results, we decided to standardize the environment for our analysis and stick to a single, fixed mode. This allows us to compare performances more fairly and focus on player consistency rather than rule differences. We kept things consistent by using 150cc, 4 races per Grand Prix, normal item settings, and hard AI difficulty. This setup was the most common among people we knew who played the game, and it also ended up being the most fun for us.

Exploratory Data Analysis

Before jumping into ratings, it helps to first look at the data itself. Over the past six months, we tracked every Mario Kart session played in our group. This gives us a solid picture of how each player performs relative to the others, not just in one-off races, but over time.

In total, we analyzed 76 races across six players. Looking at this data helps build intuition: who usually finishes near the front, who is still learning, and how much overlap there is between players.

Position Distribution by Player

Shows the percentage of races each player finished in each position. Lower positions (1-3) are better!

Total Races: 59

Legend:

● 1st (Gold)● 2nd (Silver)● 3rd (Bronze)

The chart above shows how often each player finishes in each position. Even without any math, some clear patterns start to appear:

Top tier (Mr. B & Mr. L): Both players dominate first place overall. Mr. B has the highest win rate by a clear margin, both proportionally and in total wins, while Mr. L follows closely but shows slightly more lower finishes (e.g. multiple 4th-place results that Mr. B does not have).
Mid-tier (Ms. M): Fewer outright wins, but a more uniform spread across the top four positions. Compared to Mr. B and Mr. L, whose results drop off sharply after first place, Ms. M’s distribution is flatter and reflects steady competitiveness rather than dominance.
Learning tier (Mr. A, Ms. N & Ms. G): Mr. A’s results are centered around the middle of the field, forming the most symmetric distribution among all players. Ms. G and Ms. N have played fewer races overall and show wider, lower-average distributions, with higher variability in outcomes.

This already gives us a rough ranking of skill levels in the group. But it also highlights a problem: how do we turn this intuition into a single number that actually reflects current skill?

We have already revealed that we chose a rating system, but now let us motivate that choice and explain why simpler approaches fail, even though this may seem like an easy problem at first glance. We could, for example, just count who wins the most races. But not everyone plays the same number of games, so that quickly falls apart. Okay, then maybe we take the average finishing position. That works better, but it has its own issues. It reflects long-term history much more than current skill. If someone suddenly starts practicing and improves a lot, their average will take a long time to catch up. There is an additional problem we mentioned earlier, where the same placement can carry different weight depending on who your opponents are.

What we really want is a system that reacts to change. If your little brother grinds Mario Kart for a month and suddenly becomes good, the numbers should notice. We also want the system to take into account the relative difference in skills when processing results. This is exactly the kind of problem that rating systems are designed to solve: they evaluate current skill and offer a natural relative comparison between players.

Choosing the Rating System

Once we had a better feel for the data, the next question was how to actually turn all of this into a fair ranking. There are many different rating systems out there, and most people have probably heard of at least one of them.

The most famous example is Elo, which is used in chess and many online games. Over time, more advanced systems have been built on top of it to handle things Elo struggles with, such as uncertainty, uneven activity, and games with more than two players. Since Mario Kart is a chaotic, multiplayer game where players don't all play the same number of races and skill can change over time, we needed something more flexible than a simple win/loss counter. After comparing different approaches, we decided to use the Glicko-2 [4, 5] rating system. Below is the reasoning behind that choice.

Why Glicko over Elo?

Elo represents each player with a single number, which makes it easy to understand but also quite limited. Glicko [6] improves on this by adding Rating Deviation (RD), which measures how confident the system is in a player's rating.

For example, a rating of 1500 ± 350 means the system is very unsure, while 1500 ± 50 means it is quite confident. This is especially useful for new players or players who don't play very often, since their ratings should be allowed to move quickly as the system learns more about them.

This behavior matches our data much better than Elo, where every result is treated with the same level of confidence.

Why Glicko-2 over Glicko?

Glicko-2 extends the system even further by introducing volatility (σ), which captures how consistent a player's performance is over time. It influences how Rating Deviation changes through time.

A player who sometimes wins and sometimes finishes last will have high volatility, while a player who reliably finishes in the same range will have low volatility. This is a great match for Mario Kart, where randomness and items can cause swings in results.

Volatility also matters when players take breaks. If someone doesn't play for a while, the system becomes less sure about their skill and based on their volatility the players Rating Deviation will rise. For inconsistent players with higher volatility, uncertainty grows faster than for consistent players. In our implementation, RD is capped at 250; if a player's RD is above this value, we do not increase it until it falls below the cap.

Glicko-2 Parameters

Variable	Meaning	High Value	Low Value	Default
Rating (r)	Estimated skill level	Strong player	Weaker player	1500
Rating Deviation (RD)	Confidence in the rating	Uncertain rating	Confident rating	350
Volatility (σ)	How much a player's performance varies	Erratic results	Consistent results	0.06
Default values are taken from the original Glicko-2 paper [4].

How Glicko-2 Works (High Level)

At a high level, Glicko-2 works in rating periods. Instead of updating ratings after every game, all games played within a period are treated as if they happened at the same time. Once the period ends, ratings are updated in one go.

A single rating update follows the same general process each time:

1. Each player begins the rating period with a current rating, a rating deviation, and a volatility value. New players start with default values.

2. Ratings and rating deviations are converted to the Glicko-2's internal scale that centers values around zero and rescales them for better computational stability. Note: we could actually use this internal scale directly as the rating, but people are so used to the Elo-style rating range that converting back is a more user-friendly approach.

3. For each player, the system looks at all opponents faced during the period and estimates how likely each outcome was based on their ratings.

4. Expected results are compared with actual results to measure whether the player over-performed or under-performed.

5. Using this difference, the player's volatility is updated to reflect how surprising their results were during the period.

6. The rating deviation is adjusted to account for how much new information the system gained from these games.

7. The player's rating is then updated using the new deviation, volatility, and game outcomes.

8. Finally, the updated values are converted back to the familiar rating scale.

If a player does not compete during a rating period, their rating and volatility stay the same, but their rating deviation increases slightly, reflecting growing uncertainty.

This overview intentionally skips most of the math. For a full, formal description of the algorithm and its exact update equations, we strongly recommend the official Glicko-2 documentation [4].

Our Modifications to Glicko-2

Glicko-2 already solves many of the problems we care about: uncertainty, changing skill, and uneven activity. However, Mario Kart introduces challenges that the original system was never designed for, such as multiplayer races, score-based outcomes, and heavy randomness.

To make Glicko-2 better reflect how Mario Kart actually works, we made a small number of targeted modifications. Importantly, we did not change the core structure of the system, only how game outcomes are interpreted and fed into it.

1. Score Differences Instead of Win / Loss

The key place where Glicko-2 updates a player's rating is the rating update equation:

\mu' = \mu + \phi'^2 \sum_{j=1}^{m} g(\phi_j)\,\bigl(s_j - E(\mu, \mu_j, \phi_j)\bigr)

In this expression, all quantities are written on Glicko-2's internal scale, where ratings are centered around zero for numerical stability. The value $\mu$ is the player's rating at the start of the rating period, while $\mu'$ is the updated rating after all games in the period have been taken into account. The factor $\phi'$ is the updated rating deviation, and it controls how responsive the rating is to new information: higher uncertainty allows for larger changes, while lower uncertainty makes the rating more stable.

The sum runs over all opponents faced during the period. Each opponent contributes a term that is weighted by $g(\phi_j)$ , which down-weights results against opponents whose own ratings are very uncertain. The expected result $E(\mu, \mu_j, \phi_j)$ represents how well the player was predicted to perform against that opponent given both ratings and uncertainties.

What ultimately drives the update is the difference between what actually happened and what was expected; surprising results move the rating more, while unsurprising results have little effect. In short, the equation adjusts the rating in proportion to how unexpectedly well or poorly the player performed, while accounting for uncertainty on both sides.

In standard Glicko-2, the observed outcome $s_j$ is very simple:

s_j \in \{0,\;0.5,\;1\}

This works well for games with clear binary outcomes like chess: you either win, lose or come to a draw. In Mario Kart, however, this throws away a lot of useful information. A 1-point win and a 20-point win are treated exactly the same, even though they feel very different when you actually play the game.

Instead of just recording who beat whom, we use the difference in final score to represent how decisive that win or loss was.

2. Converting Score Differences to Glicko-2 Format

What ultimately matters for the rating update is the difference between what actually happened and what the system expected. This is captured by the following term inside the update equation:

\color{#999}{\mu' = \mu + \phi'^2 \sum_{j=1}^{m} g(\phi_j)\,}\color{#000}{\bigl(s_j - E(\mu, \mu_j, \phi_j)\bigr)}

If this value is positive, the player outperformed expectations; if it is negative, they underperformed. Everything else in the equation simply scales this signal based on uncertainty.

Using standard Glicko-2 would still “work”: finishing above someone counts as a win, below as a loss, and equal points as a draw. But we wanted a more game-informed signal—one that reflects how much a player won by. Glicko-2 expects observed outcomes $s_j$ to lie in the range $[0,1]$ . Score differences in Mario Kart range from −56 to +56, so we need a way to map one to the other.

First Attempt: Linear Mapping

The most straightforward idea is a linear mapping:

s_j = \frac{\text{diff}_j + \text{max\_diff}}{2 \cdot \text{max\_diff}}

where $\text{max\_diff} = 56$ . This maps the full range of possible score differences $[-56, 56]$ neatly into $[0,1]$ , with a zero difference landing exactly at 0.5. On paper, this looks ideal: simple, symmetric, and easy to interpret.

In practice, however, it doesn't reflect how Mario Kart games usually play out. The theoretical maximum difference of 56 points (60 vs. 4) is extremely rare. Most competitive races fall in the 1–20 point range.

With a linear mapping, even a very convincing 12-point win—such as winning all four races while your opponent finishes second every time—maps to only about 0.6. That is much closer to a draw than to a decisive win, which doesn't match how the result feels in the game.

The Solution: Sigmoid Mapping

To better capture the typical range of outcomes, we switched to a sigmoid function:

s_j = \frac{1}{1 + \exp(-k \cdot \text{diff}_j)}

Intuitively, this function behaves much closer to how players perceive race results:

• values close to 0.5 correspond to near draws
• values close to 1 represent convincing wins
• values close to 0 represent convincing losses

The parameter $k$ controls how quickly the function transitions from draw-like outcomes to decisive ones. A higher $k$ makes small differences matter more; a lower value treats more results as close to a draw.

After experimenting with our data, we settled on $k = 0.25$ . With this choice:

• 1–4 points stay closer to a draw than a win
• around 6 points feels like a narrow but real win (~0.8)
• around 12 points becomes a clear win (~0.95)
• very large differences approach 1 and indicate domination

Plugging this version of $s_j$ into the Glicko-2 update equation means that ratings now reflect not just who won, but how decisively they won—while preserving the original idea of “actual minus expected” performance.

Modeling Trade-offs

One important limitation of this approach is that a fixed transformation from score difference to outcome value can never be perfect. In Mario Kart, points are not distributed linearly by finishing position, which means that the same point difference does not necessarily represent the same performance gap across the full range of outcomes. Any smooth mapping inevitably compresses and stretches different regions of this space in a somewhat subjective way. We therefore choose a transformation that we felt best models the range of outcomes we care about most: results near the top of the leaderboard.

The plot below shows the binary, linear, and sigmoid mappings we considered. Although we ultimately focus on score differences, we also experimented with place differences, and the interactive plot lets you explore that idea as well.

Difference Type:

Sigmoid Steepness: 0.25✓ Our Choice

How to interpret: The x-axis shows the score difference (your score minus opponent's score), and the y-axis shows the outcome value (0 = loss, 0.5 = draw, 1 = win). We chose sigmoid with steepness 0.25 for score differences as it smoothly captures the strength of victory.

Note. When using place differences instead of score differences, the steepness parameter k must be scaled accordingly. Because place differences span a range roughly 5x larger than score differences (-11 to 11 vs. -56 to 56), the corresponding k values are approximately 5x larger to produce comparable curve shapes.

3. Pairwise Comparisons in Multiplayer

The final challenge is multiplayer races. A Mario Kart race is not a single 1-vs-1 match, so we need a way to translate one multiplayer outcome into something Glicko-2 can work with.

The solution is to break each race into a set of pairwise 1-vs-1 comparisons. After the race finishes, every player is compared against every other player exactly once. This is not an approximation or a hack—it is exactly how Glicko-2 is designed to handle multiple games within a rating period. This becomes clear when looking at the rating update equation:

\color{#999}{\mu' = \mu + \phi'^2} \color{#000}{\sum_{j=1}^{m}} \color{#999}{g(\phi_j)\,\bigl(s_j - E(\mu, \mu_j, \phi_j)\bigr)}

Each term in the sum corresponds to one opponent. In a four-player race, each player faces three opponents, resulting in three terms in their sum. Across the whole race, this creates six distinct 1-vs-1 matchups.

We treat one full race (or Grand Prix) as a single rating period. All pairwise comparisons are collected, and each player's rating is updated once at the end using the combined signal from all opponents.

Player A

Player B

Player C

Player D

Δ Score — difference in Grand Prix scoress_j — sigmoid-transformed outcome (k=0.25)

In this example, Player A is compared separately against B, C, and D; Player B is compared against A, C and D; and so on. Each comparison produces its own $s_j$ value based on the score difference, and all of them are combined in the final rating update.

Multiplayer Trade-offs

This approach works well in practice, but it does come with an important caveat. Pairwise comparisons within a multiplayer race are not truly independent: if a player beats one opponent, they are more likely to beat the others as well. Treating these comparisons as independent can slightly overstate how much information a single race provides. This is a known limitation of applying pairwise rating systems to multiplayer settings. Systems like TrueSkill [7], which are designed specifically for multiplayer games, address this more directly, but at the cost of additional complexity.

Implementation Detail: Illinois Algorithm

For computing volatility updates, we use the Illinois algorithm (a variant of regula falsi with bracketing), as recommended in Glickman's paper [4]. It is more stable than Newton-Raphson (used in some implementations [8]) and guaranteed to converge, which matters when dealing with edge cases.

Our Results: Player Ratings Over Time

Below you can explore how player ratings evolved over 76 races. Toggle between different metrics and select which players to display.

Metric:

Players:

Options:

Semi-transparent dots indicate races where a player participated.

Key Observations from the Ratings

Who is the best? From the joined position distributions alone, you might expect Mr. B to be the clear ratings winner, but the Glicko-2 system captures the time dependency as well, which reveals surprising results. Looking at the ratings alone, Mr. L ends up slightly ahead of Mr. B. However, when we include Rating Deviation (RD) and look at the 95% confidence intervals, we see they overlap significantly. We're satisfied that Glicko-2 captures Mr. L's catch-up trajectory while still reflecting the underlying uncertainty.
Prior experience shows early: At the start of our game nights, Mr. B had the most prior experience with Mario Kart; knowing the tracks, shortcuts, and optimal lines. This is visible in the big initial rating jump. Mr. L's and Ms. M's previous experience with the game also shows in their early performances. As we kept playing, we all learned the tracks better, which slowly leveled the playing field.
Confidence builds over time: RD starts at 350 for all players (very uncertain) and steadily decreases as they play more games. After 20-30 races, most players have RD values below 100, meaning the system is quite confident about their ratings. We can also see RD creeping up for players like Ms. G when they miss several games in a row, reflecting the system's growing uncertainty about inactive players.
Volatility spikes tell stories: If we look at races 61–62 in the volatility chart; both Mr. A and Mr. B show sudden spikes. This was a great day for Mr. A (1st place with 42 points, then 3rd with 39) and an unusually bad one for Mr. B (6th with 36 points, then 5th with 34). The system correctly flagged these as surprising results, increasing volatility for both players.
Randomness doesn't break it: Despite Mario Kart's notorious randomness (blue shells, lightning bolts, etc.), the Glicko-2 system does an excellent job tracking long-term skill differences. Luck averages out over multiple races.

Conclusion

What started as friendly arguments about who's the best Mario Kart player has turned into a deep dive into rating systems. We took the Glicko-2 system and adapted it to the unique challenges of the game.

The key modifications were pairwise comparisons for multiplayer, score differences instead of win/loss, and a sigmoid function to weight those differences. They turned out to work remarkably well. Despite the game's randomness and the built-in tradeoffs of the system, it accurately (based on our experience) tracks skill levels and builds confidence over time.

Of course, there's always room for improvement. We could create mode-specific ratings for different speed classes (50cc, 100cc, 150cc, 200cc) and item configurations, or extend this system to other multiplayer games with numerical scores. We could also use the ratings for matchmaking in creating balanced teams in certain game mode types or suggesting handicaps.

Most importantly, we now have data to back up our bragging rights. The arguments haven't stopped completely (this is Mario Kart after all) but at least now they're statistically informed.

We are grateful to Erik Štrumbelj for his valuable feedback, the stimulating discussions so far, and the many future debates we look forward to.

Want to settle the debate in your own group? We built GlickoKart, a free web app where you can track your Mario Kart ratings using this exact system. Create a group, log your race results, and watch the rankings evolve over time. Finally, you'll have statistical proof of who's really the best.

🏎️Start Tracking with GlickoKart→

Frequently Asked Questions

Short answer: The rating would increase, but only by a very small amount.

The key idea behind Glicko-2 is that what matters is the difference between what actually happened and what was expected to happen. For example, if I have a rating that is 400 points higher than my opponent, my expected outcome is around 0.9. If the difference is 800 points, the expected outcome rises to about 0.99 (these values depend on the opponents Rating Deviation; here we take RD ≈ 350).

In the case of a very dominant win (a 25+ score advantage), our sigmoid-based outcome function might return something like 0.999. This means that the gap between the expected and actual result is small: about 0.1 when the rating difference is 400, and only 0.01 when it is 800. This small value is then further scaled by other factors in the rating formula, resulting in a very minor rating update. Since the win was highly expected, the system deliberately limits the gain.

Now compare that to a match between two players with similar ratings. The prediction is close to 0.5. If one player then wins convincingly (say, by a 12 point margin), the outcome might be around 0.95. The difference between expectation and result is now 0.45, which leads to a much larger rating increase. Outperforming expectations is what really drives rating changes.

For matches with very large skill gaps, the system behaves quite similarly to classic Glicko, where outcomes are effectively treated as 0, 0.5, or 1. The real advantage of our approach shows up in closer matches: if two evenly matched players face each other and one wins by just a single point, the outcome might be ~0.55. That's only a 0.05 difference from the expected 0.5, so the rating change stays small. In standard Glicko, that same match would count as a full win (outcome = 1), leading to a much larger adjustment.

Interesting edge case: If a highly rated player only narrowly beats a much lower-rated opponent, the outcome might be worse than expected. For example, winning by a decent margin might give an outcome of 0.95, but if the prediction was 0.99, the difference is actually negative. In that situation, the higher-rated player can lose rating points while the lower-rated player gains some—despite losing the match.

No rating system is perfect, and Glicko-2 has its quirks. One interesting issue is "volatility farming", which is a phenomenon in competitive gaming communities [9].

Here's how it works: if a player deliberately performs erratically (winning some, losing some), their volatility increases. Later, when they play consistently at their true skill level, the high volatility causes their rating to climb faster than it should. In theory, someone could game the system this way.

Why we're not worried: This requires intentionally throwing games over an extended period, then coordinating "farming" sessions—not exactly feasible (or fun) in casual Mario Kart game nights with friends. For competitive environments with rankings or prizes, you'd want to implement safeguards (such as volatility caps or anomaly detection). But for our use case? We're good.

References

[1] guiguilegui (2016). Rubber banding in Super Mario Kart. https://guiguilegui.wordpress.com/2016/11/16/rubber-banding-in-super-mario-kart/
[2] Audrey Shin, Felix Holmes, Rachel Claire Henry (2025). When Losing Means Winning: Algorithmic Fairness in Mario Kart. https://www.audreyhshin.com/assets/papers/CS_231_Final_Paper-5.pdf
[3] Mario Kart 8 item probability distributions. https://www.mariowiki.com/Mario_Kart_8_item_probability_distributions
[4] Glickman, M. E. (2012). Example of the Glicko-2 system. https://www.glicko.net/glicko/glicko2.pdf
[5] English Chess Federation. The Glicko system for beginners. https://www.englishchess.org.uk/wp-content/uploads/2012/04/The_Glicko_system_for_beginners1.pdf
[6] Glickman, M. E. (1999). Parameter estimation in large dynamic paired comparison experiments. https://www.glicko.net/research/acjpaper.pdf
[7] Ralf Herbrich, Tom Minka, Thore Graepel (2006) TrueSkill™: A Bayesian Skill Rating System. https://proceedings.neurips.cc/paper_files/paper/2006/file/f44ee263952e65b3610b8ba51229d1f9-Paper.pdf
[8] Kirkman, R. pyglicko2 implementation. https://github.com/ryankirkman/pyglicko2/blob/master/glicko2.py
[9] r/TheSilphRoad Farming Volatility: How a major flaw in a well-known rating system takes over the GBL leaderboard.. https://www.reddit.com/r/TheSilphRoad/comments/hwff2d/farming_volatility_how_a_major_flaw_in_a/