Optimizing 10 lines of Arduino code: You’re wrong, so am I.

I’m going to show you a simple ten-line Arduino function, and how through surprisingly confused experimentation I was finally able to make it 40% smaller and nearly twice as fast. Along the way, you’ll see how not only were some “generally accepted” intuitions about efficiency misguided, but my own instincts and intuitions about how to improve things were no better! But when I hit something that I didn’t understand, or that surprised me, I kept digging, even if it confused me. A lot. Which it did.

Oh, and I should say this at the outset: none of what you’re about to see today requires reading “machine code”. Reading machine code can be incredibly helpful sometimes in cases like this, but today I’m going to show you what you can do just looking at the size of the executable. All you’ll see here is a ten-line function in plain C, compiled for Arduino/AVR which is a simple 8-bit CPU, and the code sizes reported by the Arduino IDE. It’s an illustration, I hope, of the idea that what matters in performance optimization is the general approach, not the particular tools you use.

The “dirty laundry” disclaimer

I usually do these sorts of performance optimization tear-downs on my own code, and I feel pretty comfortable dumping my own dirty laundry out on the table for illustrative purposes. But this time I’m starting with a snippet of code I found on-line, and I’m a little tentative about this: it’s sort of every open-source developer’s nightmare, having their code picked apart by someone else for no good reason. So let me be perfectly clear: this post is not a critique of the original code; the original code, as posted, was already much faster than it needed to be. The performance surprises came from what the compiler was generating, and even then only when I scaled up from a “20 LED” project to a thousand-LED project.  Fully scaling everything up requires a lot of rethinking, but this one function is an interesting case by itself.

If anything, this post is just a reminder that we should all be mindfully humble and presume that sometimes our instincts about code size, speed, and efficiency are completely wrong, or at least surprisingly misguided.

OK, ’nuff said. On to the code.

The Code

So here’s the code I saw today. It takes an LED index (from 0-19) and the position of a glowing ‘thing’ (also from 0-19) and returns a brightness to use for this LED from 1 (dim) to 32 (bright). If the ‘thing’ is at the same position as this LED, this LED should glow brightly. If the thing is further away, this LED should only glow dimly. Here’s the original function:

uint8_t brightness_1( const int8_t led, const int8_t pos) { 
  switch (abs(led-pos)) {
    case 0: return 32;
    case 1: return 16;
    case 2: return 6;
    case 3: return 2;
    default: return 1;
  }
}

It’s compact, legible, and probably pretty efficient, I thought. I checked by compiling it, and on AVR, this source compiles to roughly 80 bytes, which is about 40 machine code instructions.

Wait. What?!? 40 instructions?! What the heck is it doing?!? Advanced calculus? All the variables are one byte (single register) each, the function is doing only two simple, simple things: one trivial absolute value calculation, and a few conditionals! What’s the compiler doing, I wondered? Forty instructions seemed like way too many, and I decided to investigate.

Baby Steps

Just as a reality check, I re-wrote the function in what I call “baby steps C”: each small step of the code written out one small piece at a time. It can help understand what the function is actually doing.

uint8_t brightness_2( const int8_t led, const int8_t pos) {
  // Absolute value, by hand
  int8_t diff;
  diff = led - pos;
  if( diff < 0 ) {
    diff = -diff;
  }

  // Conditionals, by hand
  if( diff == 0) return 32;
  if( diff == 1) return 16;
  if( diff == 2) return 6;
  if( diff == 3) return 2;
  return 1;
}

This code compiled into just 42 bytes, around 21 instructions. Smaller. Way smaller. OK, so what’s up, gcc?

Well, as noted, the code only really has two parts: the absolute value caluclation, and the conditionals, so I was pretty sure that the bloat was coming from one of those two things.

gcc ” : – P ” switch

Now I know from prior experience that particularly on AVR platforms (like Arduino), gcc really hates to create a “jump table” out of a switch statement; it almost always turns a switch into a series of “if” and “goto” statements. It turns out on AVR that’s actually often faster and smaller than a jump table, so gcc is actually often making the right choice in these cases. But if gcc is turning the “switch” in the original into a series of “if”s and “goto”s by itself anyway, why is the second version so much smaller (and faster)?

I decided to try this: do the ‘abs’ by hand, but use leave the switch in place:

uint8_t brightness_3(const int8_t led, const int8_t pos) { 
  // Absolute value, by hand 
  int8_t diff;
  diff = led - pos;
  if( diff < 0) diff = -diff;

  switch (diff) {
    case 0: return 32;
    case 1: return 16;
    case 2: return 6;
    case 3: return 2;
    default: return 1;
  }
}

50 bytes, about 25 instructions. Basically “small” again. Maybe that call to “abs(…)” was the problem that was taking up all the space. I decided to confirm that simple hypothesis by switching back to the call to the abs(…) library function and checking the code size. I predicted, of course, that the code would go back to 80 bytes, more or less. I reinstated the call to abs(…) and changed the variable ‘diff’ to be unsigned, which is safe, as the result of abs(…) will never be negative.

uint8_t brightness_4(const int8_t led, const int8_t pos) { 
  uint8_t diff;
  diff = abs( led - pos );

  switch (diff) {
    case 0: return 32;
    case 1: return 16;
    case 2: return 6;
    case 3: return 2;
    default: return 1;
  }
}

This version was 66 bytes. That was completely unexpected: it was neither large nor small. It was medium-sized, and completely confusing. My hypothesis that the “abs(…)” call was causing the size increase went right out the window. How was this code any different from the original 80 byte code?? It was calling abs(…), just like the original code, and it was using a switch, just like the original code. And yet, it was markedly smaller.

Well. Wait a minute. There actually were some differences. I had changed the type of ‘diff’ from signed to unsigned. I changed it back.

uint8_t brightness_5(const int8_t led, const int8_t pos) { 
  int8_t diff;
  diff = abs( led - pos );

  switch (diff) {
    case 0: return 32;
    case 1: return 16;
    case 2: return 6;
    case 3: return 2;
    default: return 1;
  }
}

Size went up +6 bytes, to 72 bytes. OK, so: having ‘diff’ be a signed variable instead of unsigned was costing me a few instructions. But wait — this code was still eight bytes (four instructions) shorter than the original code, even though it was nearly identical!

If I put the “abs(led-pos)” back inside the switch, the size went back up to exactly where it started from: 80 bytes.

Desperate for anything to try next, I decided to split the difference with this odd combination: I moved the “abs(led-pos)” into the switch, but left the seemingly useless “diff=” in front of it.

uint8_t brightness_6(const int8_t led, const int8_t pos) { 
  int8_t diff;

  switch ( diff = abs( led - pos ) ) {
    case 0: return 32;
    case 1: return 16;
    case 2: return 6;
    case 3: return 2;
    default: return 1;
  }
}

Back down to 72 bytes! If I took out the “diff=” it went back up to 80. Something about that assignment was changing how the compiler was seeing things.

For completeness, I tried changing the “int8_t diff” to an unsigned “uint8_t diff”:

uint8_t brightness_7(const int8_t led, const int8_t pos) { 
  uint8_t diff;

  switch ( diff = abs( led - pos ) ) {
    case 0: return 32;
    case 1: return 16;
    case 2: return 6;
    case 3: return 2;
    default: return 1;
  }
}

And with that the code shrank six bytes again — even though the variable itself was never “used” after the swich statement. That was a rather odd and interesting clue: the variable wasn’t being used, but the type of the variable was making a difference in the code size!

Just my type

I decided to see if the type made a difference by itself.

Spoiler alert: changing the data type made ALL the difference.

To save space here, I’ll show you the three different variations and how big each one was:

uint8_t brightness_8(const int8_t led, const int8_t pos) { 
//switch ( (uint8_t) ( abs( led - pos ) ) )  // 66 bytes
//switch ( ( int8_t) ( abs( led - pos ) ) )  // 72 bytes
  switch (           ( abs( led - pos ) ) )  // 80 bytes
  {
    case 0: return 32;
    case 1: return 16;
    case 2: return 6;
    case 3: return 2;
    default: return 1;
  }
}

What’s going on? Well, different data types take different amounts of work to wrangle. That makes sense. Smaller, simpler data types, like unsigned single bytes, take less work and fewer instructions. Bigger data types take more work, more instructions. When we tell the compiler to treat the result of the “abs(…)” expression as a single unsigned byte, it’s somehow taking “less work”. But less work than … what? What’s the data type that it’s defaulting to? Well, even without reading the machine code (and I promised I wasn’t going to do that), we can try an experiment. Again, I’ll save some space by showing the size of a few different variations at once:

uint8_t brightness_9(const int8_t led, const int8_t pos) { 
//switch ( (unsigned int) ( abs( led - pos ) ) )  // 74 bytes
//switch (   (signed int) ( abs( led - pos ) ) )  // 80 bytes
//switch (          (int) ( abs( led - pos ) ) )  // 80 bytes
  switch (                ( abs( led - pos ) ) )  // 80 bytes
  {
    case 0: return 32;
    case 1: return 16;
    case 2: return 6;
    case 3: return 2;
    default: return 1;
  }
}

Based on this, I’m going to say that the default data type that the compiler is using for the result of the “abs(…)” expression is simply “int”. Now, I already knew in the back of my mind that the ANSI/ISO “C” language spec says that the data type of arithmetic expressions is, by default, “int”, but I wasn’t thinking about it here.  I deal with compiler internals for a living, so knowing some of the quirks of the ANSI/ISO “C” language spec is merely an occupational hazard. But it’s also interesting to see that you can more or less derive the type that the compiler is using just from the executable size, and some experimentation.

OK, so what? Weren’t we trying to make this code smaller and faster? Yes. So, let’s do that right now. We know that expression data types are defaulting to “int”, which on Arduino is a 16-bit data type, and since it’s running on just an 8-bit platform… regular 16-bit “int” is slow. And big. Let’s fix it.

There are two expressions in the function that don’t have an explicit type: the result of the “abs(…)” call and the argument inside the “abs(…)” call. Let’s give each of those expressions and explicit type, using single byte types for everything:

uint8_t brightness_10(const int8_t led, const int8_t pos) { 
  switch ((uint8_t)( abs( (int8_t)(led - pos) ) ) )
  {
    case 0: return 32;
    case 1: return 16;
    case 2: return 6;
    case 3: return 2;
    default: return 1;
  }
}

Save, recompile, and… it’s down from 80 byte (about 40 instructions) to 56 bytes (about 28 intructions), just by explicitly specifying data types that are CPU-native.

This, right here, is the point where I felt like I could declare a victory: the new generated code is 30% smaller (and, in this case, faster), we didn’t have to make any radical changes to it, and the code is approximately as maintainable as it was when we got here. Plus, at long last, I had a clue about what was making the code so apparently bloated in the first place: it was using 16-bit “int”s all over the place simply because it hadn’t been told to use 8-bit data types.

Looking at this, one might make an intuitive assumption that since all the input data types were 8-bit, that the computations and comparisons would all be efficiently 8-bit, too. Nope. That ‘intuition’ would be just wrong.

Now you may recall that the “baby steps C” version was only 42 bytes, making it smaller (and most likely faster) than the 56 byte version we have above. Even now that we’ve gotten our data types all lined up as 8-bit values, why is the compiler generating bigger code than for our “baby steps C” version?

Ripped abs

Let’s try substituting our own hand-rolled absolute value function, but leaving the switch statement.

uint8_t brightness_11(const int8_t led, const int8_t pos) { 
  int8_t diff; 
  diff = led - pos;
  if( diff < 0 ) diff = -diff;

  switch ( diff )
  {
    case 0: return 32;
    case 1: return 16;
    case 2: return 6;
    case 3: return 2;
    default: return 1;
  }
}

A little smaller: 50 bytes (25 instructions). I’m guessing here that “abs(…)” is defined to take a 16-bit int, and that it’s promoting the 8-bit value to 16-bit, caluclating the absolute value, and return it as 16-bits. Our hand-rolled absolute value computation keeps everything as 8-bit values, which is probably why it’s smaller. If we were reading machine code today, we could check, but we aren’t, so for now we’ll just move forward.

Throw the switch

Next let’s focus on testing to see if the “switch” versus “if”s makes a difference.

uint8_t brightness_12(const int8_t led, const int8_t pos) { 
  int8_t diff; 
  diff = led - pos;
  if( diff < 0 ) diff = -diff;
  // Using if's instead of a switch 
  if( diff == 0) return 32;
  if( diff == 1) return 16;
  if( diff == 2) return 6;
  if( diff == 3) return 2;
                 return 1;
}

42 bytes (21 instructions) using hand-written “if”s versus 50 bytes (25 instructions) before with the switch statement. This is another one of those times where “widely accepted knowledge” and the ensuing intuitions go right out the window. We’re taught in school, and it’s ‘commonly accepted as true’ that (1) switch statements compile down into a “jump table”, not a collection of “if”s and “goto”s, and (2) that the resulting code is both faster and more efficient than a chain of simple hand-written ‘if’ statements. And you know what? Neither one of those widely-held beliefs is universally true.

The “myth of super-efficient switch statements” is particularly unlikely to be true if you have a relatively small number of cases (e.g. say, fewer than 15?) in your switch statement. The exact ‘cutoff’ number varies from platform to platform and compiler to compiler, but if you have only five cases like the code we’re looking at here, I’d bet a nickel that most compilers would emit machine code that’s basically a collection of “if”s and “goto”s instead of a jump table. And that means that a simple set of hand-written “if”s can potentially produce smaller, faster code. In this case, it looks like that’s exactly what’s happening.

At this point, I suspect the code is about as small as we can make it without getting into assembly language or (worse) C code that veers into unsupported/unspecified language issues. Again, this is a great point at which to declare a victory.

And while we may not be able to make the code any smaller, we can almost certaintly make it faster by putting the most common code path first in the set of “if”s.

First things first

Since there are seven special values that the delta between ‘led’ and ‘pos’ might have that need special handling (-3, -2, -1, 0, 1, 2, and 3), and a much larger number of values (everything else) for which we just want to return “1”, we should put that test first:

uint8_t brightness_13(const int8_t led, const int8_t pos) { 
  int8_t diff; 
  diff = led - pos;
  if( diff < 0 ) diff = -diff;

  // test for the most common case first
  if( diff > 3 )  return 1;
  // then handle all the more special cases
  if( diff == 3)  return 2;
  if( diff == 2)  return 6;
  if( diff == 1)  return 16;
  /* diff == 0 */ return 32;
}

This code is also 42 bytes (21 instructions) long, but it most cases, it’ll only have to execute two “if” statements (one in the absolute value computation, and one to see if ‘diff > 3’) before being able to tell what value to return (‘1’). This compares very favorably with the previous code arrangement, where the most common case was actually listed _last_.

If the common case occurs about two thirds of the time, and the other cases make up the other one third of the time all together, this code reordering results in nearly a 2X speedup in just this one function.

If every switch statement turned into a jump table every time, the order of the cases wouldn’t matter. But as we’ve seen, switch statements don’t turn into jump tables all that often; they turn into a collection of “if”s and “goto”s, and when that happens, the order of the cases actually does matter a fair amount. Again, the “intuitive understanding” of what the machine code is actually doing turns out to be way off base.

Intuitively wrong

So there you have it. A ten-line Arduino C function dissected, and reassembled into code that’s both smaller and faster, without being substantially more difficult to maintin. And along the way, we discovered that not only several “well known” intuitions about speed and code size not entirely correct, but that as I tried to figure out what was going on, my intuitions came up wrong a couple of times, too!

“Performance tuning” is called that for a reason, instead of being called “performance guessing.” You establish a metric (like, in this case, mostly just code size), and work on optimizing it through hypothesis, testing, and frequent error. Along the way, you have to make theories about what’s going on, test them, and be willing to say that your guess was wrong, or that you’re just really confused. And when that happens, you have to push forward. This is the key thing. No matter how confusing things get, you have to keep your wits and really work to gain and understanding that, ultimately, you can use to make your code really sing.

Thanks for your time; I hope you found it interesting, useful, and fun.

-Mark


18 Comments on “Optimizing 10 lines of Arduino code: You’re wrong, so am I.”

  1. Eric says:

    I hope that I’m not the first to read this, but as the first commenter, let me say, “well done!”

    This is exactly what I needed to scratch the optimization itch I’ve had since I started programming Arduino. My job has taken me from ‘where I grew up’ (C/C++) all the way to very high-level languages, and I was scared I’d lost my ability to write lean code. After reading (and, I’ll admit, re-reading) this post, I now feel confident. Thanks!

    • kriegsman says:

      Glad to hear it, Eric!
      I’ve got another post coming (someday soon?) which does the same thing, but DOES look at the AVR machine language that gcc puts out. Thanks for stopping by, and stay tuned!

  2. Giannis says:

    I was looking for optimization switches in Arduino when I came across this article, Very interesting indeed! Keep investigating!

  3. Logan says:

    I’ve been coding an arduino synthesizer, I knew about uint8_t, but I didn’t know the performance impact was so big. I replaced every possible variable with it and uint16_t in my code and although everything was working good enough before, this sped things up and removed lots of small audible glitch sounds! My code utilizes a few switch statements and I was curious if they were the way to go or not.

    It hadn’t occurred to me to take individual functions out of my code and see how big they compiled to by themselves. This is only a rough gauge of the speed of the code as instructions can loop and stuff, but it is a good comparative indicator of the codes efficiency.

  4. Hi Mark,
    This was an entertaining and instructive walk-through – well done!
    I’m returning to coding after a decades-long hiatus and I wasn’t ever all that good even when I was doing it for money. This example of how to work through function logic is the kind of thing that might have made me better then. Starting with the most common case test seems an obvious route to faster execution, but hadn’t occurred to me until you wrote it. Ah, hindsight!
    I’m looking forward to making my sluggish Arduino code a little livelier now.
    Thanks!
    Jeff

  5. John Meacham says:

    Since your range is continuous, you can get rid of almost all the conditionals (even the absolute value one) anyway. gcc can sometimes figure out this optimization, but i wouldn’t count on the avr backend doing it.

    It ends up being 24 bytes of code and 7 bytes of data for a total of 31 bytes, plus it is a lot shorter.

    By casting to uint8 the negative numbers wrap around to large ones, so the single > 6 checks for both less than zero and greater than 6 at the same time. Then you just look up the index in the table which is a single ‘ld’ instruction after adding 3 to handle both the negative and positive sides at once.

    uint8_t brightness_14(const int8_t led, const int8_t pos) {
    static const uint8_t rt[] = { 2, 6, 16, 32, 16, 6, 2};
    uint8_t diff = led – pos + 3;
    if (diff > 6)
    return 1;
    else
    return rt[diff];
    }

    • kriegsman says:

      Nice little two-step there! Thank you for showing folks yet another way to tackle this code — especially noting that there’s only one small segment of the post-subtraction number line that we care about, and that we can offset the zero point there and keep everything on the positive side. Nice code.

  6. johnmeacham says:

    On another note, you should be careful about replacing signed values with unsigned willy nilly if code size is important, which produces better code depends a lot on context, here is an example

    bool f1(int x) {
    return x + 1 > x;
    }

    bool f2(unsigned x) {
    return x + 1 > x;
    }

    code size of f1 is 4 bytes
    code size of f2 is 16 bytes or 4 times bigger.

    The reason, f1 compiles down to ‘return true’ while f2 has to check if the value is already the maximum value.

    numeric overflow in C is only defined for unsigned types, this means the compiler is free to assume that overflow never occurs for signed types when optimizing, meaning that x + 1 is always greater than x for signed types as far as the optimizer is concerned and so it optimizes it to true. unsigned types are defined in terms of modular arithmetic, so code dealing with unsigned types has to account for overflow.

    On the other hand, signed multiplication/division is a lot more complicated than unsigned, and some cpus have a signed mul instruction but not an unsigned (or vice version), so you can’t know for sure which is better without knowing a bit about your target arch and the compiler internals or by doing some experimentation.

    I highly recommend reading the assembly, gcc will spit it out with the -S option, it can be quite enlightening to compare the differences when you make changes, like you would be able to see exactly which instructions it was able to omit when you changed the signedness.

    • kriegsman says:

      I agree strongly with pretty much everything you just said. If you haven’t already seen it I think you’d really like John Regehr’s blog in which he gets elbow-deep in these issues of unspecified and undefined behaviors as well as a whole bunch of other related compiler-level topics. http://blog.regehr.org

      Me, I do this stuff (compiler-internals, language lawyering, and what you think the source code meant vs what the resultant machine code actually does and who’s fault that is) for a living and have for the last 13 years or so. So believe me when I say that I totally agree with your comment, and that (1) verifying the correctness of your compiled code has to come first, and (2) sometimes there’s just no substitute for reading the machine code!
      Thank you again for adding your thoughts here for folks to learn from.
      Tell me, what do you do for a living, and where did you learn all this ‘good stuff’?

  7. Rho says:

    Awesome! Thanks for the insight 😀
    *belief shattered*

  8. Bigred says:

    Some 5 years later and all I can say is what a great expose on whats happening under the compiler hood.Well written – well done.

  9. That’s awesome thanks, I read every word and thought about it, not the usual way to read posts on these type of forums but I really wanted to hear what you were saying.
    I used some of these optimization ideas in my code without really knowing the compiler was well understood by me or not, just expecting the code to compile in a similar shape as the code itself.

  10. Andy says:

    Very interesting, but the use of return to exit each case of what would be the switch, bothers me.

    Yes, it’s the best way to do it, in this particular case, because there’s no more code after that, but if there was, you couldn’t. I’d expect the switch to be compiled as the equivalent of something like:

    if(exp1) { … goto thebreak; }
    if(exp2) { … goto thebreak; }

    do_the_default;
    thebreak: // all done

    My guess is, if you had to code it like that, or by using if().. else if()… else… (because there was more to do, after the switch), you couldn’t get much of an improvement on the size of the code for the switch statement.

    Using uint8_t as much as possible is a useful insight though. Is there any difference between that and “unsigned byte”, other than saving a few keystrokes each time you write it?

    • kriegsman says:

      I believe that “byte” is an Arduino-specific typedef; “uint8_t” is ANSI/ISO standard in stdint.h, and so more portable. Also for many people, Arduino is their first C/C++ coding experience, and I like to encourage people into habits that they can take with them as they bring their skill to a new platform.

      As for the control flow returns/gotos question, I agree with you that in general it’s better to have diamond-shaped control flow paths, that join up again at the end.

      Interestingly, many compilers will generate the machine code with a bunch of gotos and a single return even if you write multiple returns. Depending on how much stack cleanup has to be done before returning to the caller, the generated code can be smaller that way, and the compiler knows it.

  11. fgrieu says:

    I propose

    uint8_t brightness( const int8_t led, const int8_t pos) {
    if ((led += pos)3) return 1;
    return (((0x27F>>(led<<2))&15)+1)<<1;
    }

    main changes: use led instead of dif; replace 3 of the tests with a formula.

  12. fgrieu says:

    I propose

    uint8_t brightness( const int8_t led, const int8_t pos) {
    if ((led += pos) > 3) return 1;
    return (((0x27F>>(led<<2))&15)+1)<<1;
    }

    main changes: use led instead of dif; replace 3 of the tests with a formula.

    • kriegsman says:

      Different approach for sure! How does that measure up when compiled for AVR/Arduino? AVR doesn’t have a “barrel shifter” unit on the chip, so if you have a bit-shift of variable size (like the >> in that code), I think it’s going to have to compile it into a full blown loop of shift-test-branch instructions. This would probably be the right approach on ARM or Intel architecture, but I tend to think it’ll actually be larger and slower on AVR/Arduino boards. But as I often say, “maybe I’m wrong!” Give it a test and let us know what you actually find when it’s compiled for AVR! And thanks for the comment!


Leave a comment