Planet Rendering - Appendix A (Performance Tuning)

Posted 23 Jun 2012 by Dean Harding

I mentioned way back in part one of this series that I wasn't going to be too concerned with performance - as long as I could generate a 512x512 image on my desktop in under a second it "should" be fast enough on the phone.

Well now that I actually have completed images (for the most part), it turns out that some of the images do in fact take about 1 second to generate on my desktop (depending on the components, some take much less -- for example, a "terran" planet with only a bit of perlin noise and a uniform atmosphere only take 250ms). The problem is, that does in fact translate to pretty bad performance on the phone.

The first thing I did on the phone was move rendering of the planets to a low-priority background thread, and then cache the generated images. I also limit the size of the images to exactly the size required for the device (150x150 on a Galaxy S or 200x200 pixels on my higher-pixel-density Galaxy Nexus).

Baseline Timing

So let's take the following "inferno" planet image as a baseline:

Baseline planet

This is an "inferno" planet, and to show you the XML used to generate it as well as the time taken, here it is in my little test application (click for bigger view):

This planet makes use of pretty much all of the features that I developed, so it's a good place to look for performance issues. You can see, though, that it takes quite a while to render: 844ms or so (I usually click "Refresh" a few times to make sure initial startup time is not included).

Worse, even though on the phone it's generating a much smaller image, it actually takes longer than that to generate. On my Galaxy Nexus, it's between 2 and 5 seconds. On my Nexus S it's usually between 1.5 and 3 seconds. There's a few possible reasons for that:

It's more sensitive to garbage collection - looking at the output in LogCat, it does at least half a dozen GCs while generating the images,
Dalvik is just that much slower, or perhaps there's certain idioms that are just slowing in Dalvik
Some dumb choices on our own part.

No. 2 is kind of hard to do anything about. One issue is that the JVM on different devices is, well, different, so optimizing for one device may not necessarily yield results (or may even make things worse) on another. No. 1 is pretty universal though, allocating objects (and the resulting garbage collection) can be quite expensive. So we will be going through our code and trying to reduce the number of GCs that we do. Of course, No. 3 is quite likely too :-)

Simple Stuff First

But first, there's some very simple things we can do. The first is if we notice in the XML definition of our planet, atmosphere's template is defined to run from octave 3 to 8. If we simply drop that 8 to 6, we reduce the number of octave to calculate without drastically affecting image quality. You can see this in the screenshot below:

We've dropped a whole 200ms just by adjusting that one parameter! And the image looks almost identical (if you look closely, there some slight differences in the outer atmosphere, but nothing to write home about).

Taking out the trash

So if adjusting one parameter of our Perlin noise function has such a large effect, perhaps we should take a look at what the Perlin Noise function does. Recall from part 4 that our rawNoise function looks like this:

private double rawNoise(int x, int y, int octave) {
    long seed = ((octave * 1000000L) + (x * 1000000000L)
              + (y * 100000000000L)) ^ mRawSeed;
    double r = new Random(seed).nextDouble();

    // we want the value to be between -1 and +1
    return (r * 2.0) - 1.0;
}

You can see here that we're allocating a new Random object every time we call rawNoise. And we call rawNoise a lot. Luckily, the Random class has a member setSeed which we can use instead.

private Random mRawRand; // move this to a member field

private double rawNoise(int x, int y, int octave) {
    long seed = ((octave * 1000000L) + (x * 1000000000L)
              + (y * 100000000000L)) ^ mRawSeed;
    mRawRand.setSeed(seed);
    double r = mRawRand.nextDouble();

    // we want the value to be between -1 and +1
    return (r * 2.0) - 1.0;
}

With that simple change, we're now generating less garbage each loop.

Hmm, OK, so that gives us about 20ms. Not great.

Helper Classes

The helper classes (Vector3, Colour, etc) are generally pretty simple, but they get used a lot. Let's look at our Colour helper class:

public class Colour {
    public int argb;

    public Colour() {
        argb = 0x00000000;
    }
    public Colour(int argb) {
        this.argb = argb;
    }

    . . .
}

So it's basically just a simple wrapper around a 32-bit integer. Every time we call setPixelColour or do some interpolation between colours or whatever, it's generating a new piece of garbage. What if we just make the Colour class a bunch of static methods that work directly on int "argb" values?

Well I tried that, and it turns out it makes no difference whatsoever. This is why you need to test every change, no point making your code less readable if it makes not difference. It's also why you need source code control, even for one-man projects - I can revert that change and try something else! One thing I also want to try was changing the basic "storage" of the Colour from an integer argb value to 4 doubles. Since most of the calculation is done with doubles and it's really only the rendering into the final image that needs to be a 32-bit integer. It might be faster to do all of our calculations without constantly converting to/from a double.

Turns out that makes a bit of a difference, and it actually makes the class simpler and easier to understand. So now we're now down to around 600ms, another 20ms shaved off.

Perlin Noise

In my benchmarks, it's actually the PerlinNoise class that takes most of the time. If you look at the definition for our inferno planet, the PerlinNoise class is actually used twice and that's where much of the performance issues come from.

One thing that I've noticed is that while I implemented a "smooth" option, I never actually use it. By taking it out, I get another 20ms or so of performance.

Summary

So all-in-all, we did alright for ourselves, from 844ms down to 580ms, a saving of about 45%. I'm sure with a bit more effort (and perhaps a bit more tuning of the template parameters) we could do even better.

Series Index

Here are some quick links to the rest of this series: