Fortnite Insider
  • Home
  • News
  • Quiz
  • Leaked Skins
  • Battle Royale
  • Item Shop
  • Contact Us
No Result
View All Result
Fortnite Insider
No Result
View All Result

Postmortem for Playground LTM

Khadija Saifi by Khadija Saifi
July 18, 2018
in News
0

The much anticipated Playground Limited Time Mode was live for only a few hours before it had to be disabled, due to matchmaking issues. This has now been resolved and the Playground LTM went live again. Here are the notes on why the LTM had to be disabled and how it was resolved:

Heya folks,

We recently stood up our Playground LTM on June 27th, at about 4 AM EDT. Following this, we experienced an overload of our matchmaking service which caused both the default modes and Playground to fall over.  We worked to get the service to where it needed to be, and were finally able to roll out the mode on the evening of July 2nd.

What happened?
Our matchmaking is built on something called the Matchmaking Service (MMS), which is responsible for facilitating the “handshake” between players looking to join a match and an available dedicated server open to host that match.  Each node in the matchmaking cluster keeps a large list of open dedicated servers that it can work with, randomly distributed by region to keep a roughly proportional amount of free servers for each.  Players that connect to MMS request a server for their region, MMS assigns that player to a node, and the node picks a free server for the requested region from its list.

Since Playground mode makes matches for every 1-4 people instead of 100, it requires between 25 and 100 times as many matches as normal depending on party size.  While we could pack virtual servers a bit tighter per physical CPU for Playground mode, we still had to use 15 times as many servers as we had been running for the other modes.  We were able to secure the total server capacity, but it meant the list that each node had to manage was suddenly 15 times as long as well.

[sm-youtube-subscribe]

When an MMS node can’t find a free server for the requested region within its own list, it has to go ask all of the other nodes for a spare one by reading from each of their local lists.  When you’re a node and your list is suddenly 15 times longer, it slows you down.  When you have to go check all of the other lists and each one is also 15 times longer, it slows you down up to 15 times per node, which can translate to computation times that are orders of magnitude longer than normal.  When we released Playground, the overwhelming demand quickly exhausted the local lists for MMS nodes far faster than the system could refresh them.  Each node was running to every other node to request extra servers that just weren’t there yet, or at the very least took a long time to pick out of the non-local lists.  The long compute times caused the CPU to end up with a backlog of pending requests, resulting in a feedback loop that eventually caused the system to grind to a halt.

What did we do to fix it?
The first thing we did after disabling the mode was to split Playground MMS to run on its own service cluster.  This was necessary not only to keep a traffic jam from affecting the base game modes, but also to allow us to iterate and tweak the service as often as we needed while we worked to get Playground back online. We tried increasing levels of dramatic re-architecturing, and tested at each stage until we reached the acceptance criteria to re-release the mode.

Once we identified the root of the problem as the exhaustion of sessions from local lists, the solution was to give the cluster the ability to bulk rebalance sessions from other nodes to ensure repeated lookups were not necessary.  With the system constantly shifting regional capacity from nodes with an excess to nodes that might be running low, the odds of a node running dry for a particular region and having to search outside its local list have been drastically reduced.  While not an issue right now in the primary Fortnite Battle Royale game modes, this is an upgrade we are bringing over to the main MMS cluster as well to future-proof the system.

We pushed the load-testing process to the limits during our MMS restructuring, because the scale of what we were trying to simulate was so far beyond normal usage or testing patterns.  We needed to spin up many millions of theoretical users and hurl them at our Playground MMS system in a big, crashing wave in an attempt to strain our new session rebalancer.  While the tweak – test – evaluate cycle took several hours per loop, it allowed us to develop and refine the rebalance behavior to a point where we felt it could stand up to the traffic, as well as to identify and fix edge-case bugs that could have torpedoed the effort to bring Playground back online.

What have we learned?
In short, we learned a lot about our own matchmaking system and its failure points as well.  We planned and prepared for what we thought to be the maximum sustained matchmaking throughput and capacity based on the size of our player base (plus a healthy buffer), but didn’t properly anticipate the edge-case of of the initial “land rush” of players exhausting local lists.

On the restart of the mode itself, we had an additional learning experience.  We opted to bring back Playground in small steps by individual regions and platforms, with the goal of reducing the initial load on the system so we could scale into it.  We actually encouraged the opposite, as players swapped regions into those that had the mode re-enabled and forced us to slow the rollout as we dealt with capacity issues.  The silver lining is that we certainly have much better visibility into the total available cloud resources in Asia than ever before, and we want to give a shoutout to our cloud partners for working with us to ensure we could quickly adjust!

The process of getting Playground stable and in the hands of our players was tougher than we would have liked, but was a solid reminder that complex distributed systems fail in unpredictable ways.  We were forced to make significant emergency upgrades to our Matchmaking Service, but these changes will serve the game well as we continue to grow and expand our player base into the future.

Khadija Saifi

Khadija Saifi

Co-founder, Lead Writer and Finance at Fortnite Insider. Khadija has been gaming for more than 10 years in her free time, playing mainly FPS, especially Apex Legends as well as puzzle games online. Khadija specializes in challenge and puzzle guides and breaking news. BA (Hons) Accounting & Finance. Contact: [email protected]

Related Posts

Epic Games Store Free Games List Leak 2024
News

Epic Games Store 2024 Free Mystery Games Hints Christmas Leaked List – 21 December

Here is everything you need to know about the 2024 Epic Games Store free games list for Christmas, including...

by Yousef Saifi
December 20, 2024
Fortnite Christmas Event Start Date Winterfest 2024
News

When Is the Fortnite Winterfest 2024 Christmas Event Start Date?

Here's everything you need to know about the Fortnite 2024 Christmas Winterfest event including the start date. Although we...

by Yousef Saifi
December 19, 2024
Fortnite Update v33.11 1.000.162 4.52 Server Downtime Dec 18 2024
News

Fortnite Update Server Downtime v33.11 Patch Notes – 18 December 2024 – 1.000.162 / 4.51

Epic Games is releasing the final update of 2024, v33.11, today, which includes the Winterfest update. Here's everything you...

by Yousef Saifi
December 18, 2024
Fortnite v33.10 1.000.161 4.50 Update Patch Notes Server Downtime 10 December 2024
News

New Fortnite Update Today Server Downtime v33.10 / 4.50 / 1.000.161 Patch Notes – 10 December 2024

We have our first Fortnite update of Chapter 6 today, v33.10 (1.000.161 / 4.50). Here are the patch notes...

by Yousef Saifi
December 10, 2024
Fortnite 4.48 v33.00 1.000.159 Update Patch notes 1 december 2024
News

Fortnite Update v33.00 Today Patch Notes – 4.48 / 1.000.159 – 1 December 2024.

Here are the patch notes for today's new Fortnite update (v33.00 / 1.000.159 / 4.48) which introduces chapter 6,...

by Yousef Saifi
December 1, 2024
Is Fortnite down right now 1 December 2024
News

Is Fortnite Down Right Now? Server Status Today 30 November / 1 December 2024

Players are wondering if the Fortnite servers are down for the today (November 30th / 1st December 2024). Here's...

by Yousef Saifi
November 30, 2024
ADVERTISEMENT
ADVERTISEMENT
  • About us
  • Contact Us
  • Home
  • Privacy Policy
Fan site - not affiliated with Epic Games or Fortnite
No Result
View All Result
  • Home
  • News
  • Quiz
  • Leaked Skins
  • Battle Royale
  • Item Shop
  • Contact Us

© 2021 Fortnite Insider