Unity QA screws us again or: How we learned to stop worrying and fix it ourselves.

We build Table Tennis Touch atop the Unity 3D game engine. Anyone who’s read my past posts will know we have a love/hate relationship with Unity: they provide us with a pretty decent engine at an affordable price, letting us target multiple platforms with relative ease – without them we’d have likely never built Table Tennis Touch, and for that we’re grateful. But their QA can be dire and it has cost us dearly on several occasions. I’m thinking:

Their most recent lapse in QA has caused our Google Play rating to drop 0.13 points. That may sound tiny but it reflects significant discontent in our player base – and rightly so. Players don’t buy a premium game for it to fail. So what’s the problem?

The Problem

After downloading v2.1.0104.1 (2048) from Google Play, some players reported the game wouldn’t start, that it crashed on launch. That’s a terrible situation and since none of our diagnostics had a chance to run before the crash, we had no idea what was causing the problem. Needless to say, we were unable to reproduce it with our test devices.

Note: As far as we know, the problem only affects Android.

Fortunately, one of our players was especially helpful. After some stellar support from Vanessa he was able to send us an 11 MB logcat dump. Wading through we found the following smoking gun (abbreviated):

Unity   : Upgrading PlayerPrefs storage
...
art     : Throwing OutOfMemoryError "PushLocalFrame"
art     : ... JNI ERROR (app bug): local reference table overflow (max=512)
art     : ... local reference table dump:
art     : ...   Last 10 entries (of 512):
art     : ...       511: 0x12e510c0 java.util.HashMap$HashMapEntry
art     : ...       510: 0x12e6f860 java.lang.OutOfMemoryError
art     : ...       509: 0x12e6f900 java.lang.String "java.lang.OutOfM... (42 chars)
art     : ...       508: 0x6f1f0240 java.lang.Class<java.lang.Object>
art     : ...       507: 0x12e6f860 java.lang.OutOfMemoryError
...
CRASH   : signal 6 (SIGABRT), code -6 (?), fault addr --------

It looks a lot like Unity is running out of memory whilst “Upgrading PlayerPrefs storage”. We’ve reported the issue to Unity as case 764422.

Note:

PlayerPrefs is Unity’s approach to storing player data on a device (eg. high score, tutorial viewed etc.).

Such is my way, I get a bit ranty so if you want to skip to the solution, click here.

<pre-empt>tl;dr</pre-empt>

Analysis

Three questions spring to mind:

  1. Why is Unity upgrading PlayerPrefs storage?
  2. Why is Unity running out of memory during the upgrade?
  3. Why have we not seen this in our tests?

Why is Unity upgrading PlayerPrefs storage?

Digging back through Unity release notes we found this innocuous entry for 5.3

Android: Fixed an issue where PlayerPrefs with keys and values containing line breaks would break

We had skipped Unity 5.3 moving straight to 5.3.1 so this would be the first time we’d encounter the change. But what did they do? And could it be responsible for the crash?

Comparing the PlayerPrefs file from an earlier version of TTT with the failing version we noticed the following:

  • the filename had changed (it now has the suffix: .v2.playerprefs.xml)
  • Unity was storing a few more lines of data
  • our data was now URL Encoded

Why is Unity running out of memory during the upgrade?

The PlayerPrefs file is nothing special; it’s just an XML file. We’ve been parsing XML files since the last millennium, many of them much, much larger than our PlayerPrefs. But the stack trace reveals something of Unity’s approach: there is a JNIerror.

For whatever reason, parsing the XML for migration was exceeding the maximum number of references and overflowing the local reference table, so thanks for that. It certainly seems like the sort of thing that a test suite at Unity should detect.

Why have we not seen this in our tests?

This is the most important question for us. You hope your game engine is stable, but if it isn’t it’s incumbent upon you as a developer to catch the problem and deal with it. As I’ll explain shortly, the problem is fairly easy to reproduce once you know what causes it. So what went wrong?

Whilst not feature rich, this was a big release for us as we’d upgraded Unity from 4.6.7 to 5.3.1. Consequently our testing was extensive. But we did not see this bug. And therein lies the problem for us: we had focused so much on the large changes (API, PhysX, etc.) that this innocuous fix from Unity went unnoticed; after all, every run of the game loads PlayerPrefs, changes them and saves them back out again. It’s one of those features that gets indirect testing on each run of the game. Unfortunately, it was harbouring a rather nasty side effect. We’ll certainly be adding the scenario to our test suite.

The (potential) Solution

The approach I’m about to describe is now in production. It is based on an intelligent inference of Unity’s approach, but as with all inferences, it may be incomplete or just plain wrong so please progress with caution.

Proof

Before creating a solution we needed to prove our theory. I’m using “prove” in a loose sense – we’re more trying to replicate the scenario and ensure our fix will work.

We located a standard TTT PlayerPrefs file for Unity Android 4.6.7 and presented that to TTT built with Unity 5.3.1. We indeed saw the log:

Unity   : Upgrading PlayerPrefs storage

but there was no crash. We felt somewhat vindicated that this was not an obvious case we’d missed during testing.

Working on the theory that the PlayerPrefs file needed to be large in order to trigger the OutOfMemory exception, we then quadrupled its contents. But I want to stress that this is not a big XML file, it’s 39 KB. In my old life it wasn’t uncommon to find XML files in the tens if not hundreds of MB. And also remember that prior to this release Unity was routinely handling files of this size.

When we presented this larger old format XML file to TTT sure enough, Unity crashed with the same exception seen in production. We were able to reproduce the bug and that’s a step closer to proving a fix!

The Fix

After researching Unity’s approach it seemed that we could avoid the crash if we prevented Unity from migrating the PlayerPrefs file. But we need that data in our game else upgrading players will lose their progress (if they don’t have Cloud enabled). So what if we migrate it safely ourselves, tricking Unity into thinking it had nothing to do? So we:

  1. intercept the Unity PlayerPrefs migration
  2. determine if migration is required
  3. create a new PlayerPrefs file with the new naming convention
  4. add the new data lines that Unity has created
  5. URL encode our data
  6. save the file

Tada! With this approach we were able to launch TTT with the massive 39 KB PlayerPrefs file. Unity sees the new file and assumes it has already performed the migration and continues launching the game.

This fix is now in production and I’m nervous. Like I said, it’s based on inferences and the assumption that we can do a better job of migrating our data than Unity can. I’m very conscious of hubris lurking in the shadows, but some judicious error handling and logging will hopefully show us if the fix has been successful.

Conclusion

What a waste of time. No one comes out of this well. The players are unhappy and we’ve wasted about three days of dev time.

If we make a mistake in our programming, our players suffer and we lose reputation. That’s fair. If Unity make a mistake then that damage increases many orders of magnitude. For some developers, it might be enough to put them out of business and that’s a responsibility that Unity must take seriously if they’re to live up to their goal of “democratising game development”.

To democratise can’t just mean to make a feature-rich engine that is affordable. It also has to be robust and rigorously tested. The so-called democratisation opens game development to the tiniest of companies, companies that don’t have the resources to employ (large) QA teams, so underlying engine quality is essential.

We pride ourselves in our quality – both visual and robustness. Being a small team though, we have to focus our QA attention on areas where we know we have made changes. It simply isn’t cost effective to test every feature of TTT, on every OS, on every device with every possible configuration permutation.

This eminently sensible approach only works if we’re building on a stable foundation, ie. if the game engine itself is stable. If the engine has made significant changes then we need to roll that area into our testing strategy. But no one in their right mind would expect a bug fix correcting the use of line breaks to prevent upgrading players from launching a game.

The code in question runs every time a Unity-based game is opened. Whilst the change might be minor, the code will be run literally millions, possibly billions of times. I know Unity have invested heavily in QA so it’s very worrying for a bug this simple but with such damaging potential to creep through.

James Gratton


CTO

James had a long career coding trading software for big banks. He quit it all and learned how to make games. He’s the the technical brain behind Yakuto.

Comments