Reply
Highlighted
Posts: 9
Registered: ‎04-04-2017

EFM32HG vs EFM8UB1 float math Benchmark

I've done some basic float math performance benchmark of EFM32HG vs EFM8UB1, to find out at least roughly where these two devices stand next to each other.

 

My questions are at the end of the post.

 

Method:

 

There are 12000 float calculations per program loop (+ - * /)
Duration is acquired by measuring a port output.
With each program loop the port negates its output thus one period on oscilloscope = 24000 calculations = the measured result.

 

--------------------------------
Code EFM8:
--------------------------------

 

#define AMNT 3000
float a, b, c, d;
float x = 1.000123f;
unsigned int i;

 

while (1){

 

      a = 3.1f;
      b = 4.1f;
      c = 4.1f;
      d = 500000.1f;

     

      for (i = 0; i < AMNT; i++){

 

            a *= x;
            b += x;
            c -= x;
            d /= x;
      }

      PROBE = !PROBE;
}

 

--------------------------------
Code EFM32:
--------------------------------

#define AMNT 3000
float a, b, c, d;
float x = 1.000123f;
unsigned short i;
bool switchIt;

 

while (1) {

 

      a = 3.1f;
      b = 4.1f;
      c = 4.1f;
      d = 500000.1f;

 

      for (i = 0; i < AMNT; i++) {

 

            a *= x;
            b += x;
            c -= x;
            d /= x;
      }

 

      switchIt = !switchIt;

      if (switchIt) GPIO_PinOutSet(gpioPortC, 0);
      else GPIO_PinOutClear(gpioPortC, 0);
}


----------------------------------------------------------
Extra measurement - code variation "V2":
----------------------------------------------------------

 

Changing the "for" loop to the following:

 

for (i = 0; i < AMNT; i++) a *= x;
for (i = 0; i < AMNT; i++) b += x;
for (i = 0; i < AMNT; i++) c -= x;
for (i = 0; i < AMNT; i++) d /= x;

 

---------------------------------------------
Results:
---------------------------------------------

 

EFM32HG @21Mhz:

 

a) Optimization level - none:

 

Duration: 326.5 ms (V2: 271.5 ms)
Size of added code: 3624

 

EFM8UB1 @48Mhz:

 

a) Optimization level - "Favor speed" + "level 0 optimization":

 

Duration: 304.4 ms (V2: 316.1 ms)
Size of added code: 1050

 

b) Optimization level - "Favor size" + "level 0 optimization":

 

Duration: 323 ms (V2: 347.2 ms)
Size of added code: 1026

 

c) Optimization level - "Favor size" + "level 9 optimization":

 

Duration: 324.7 ms (V2: 351.3 ms)
Size of added code: 1003

 

When I convert results from "a)" type of measurements to the same core frequency of the other chip, I get to the following conclusions:

 

1. EFM8 at 21Mhz would take 695.8ms(V2: 722.5ms)
2. EFM32 is 2.13 times (V2: 2.66x) faster in these calculations, at the same frequency

 

My questions:

 

1. Given that the 8-bit core is having much harder time (more instructions to go through) calculating floats than 32-bit core does. I expected the EFM32 to outperform the EFM8 about 4 times at the same frequency.

Are the results about right or is there any major flaw in my approach?
An explanation would help.

 

2. Also given that the 32 bit core has less instructions to go through, 

should not its code be smaller than the 8-bit version?

I observed the opposite: EFM32 - 3624 bytes, EFM8 - 1050 bytes. Why?

 

Thanks

Posts: 488
Registered: ‎02-21-2014

Re: EFM32HG vs EFM8UB1 float math Benchmark

1) 8-bit devices suck at 32-bit math. For the 8051, it can only perform operations on one byte at a time, so there is a ton of moving pieces of the number in and out of various registers to perform operations. If you repeat the experiment with 16-bit math, you'll probably see the HG and UB1 perform about the same. If you repeat the experiment with 8-bit math, you should see the UB1 pull ahead of the HG.

 

2) 8051 instructions can be as small as 1 byte (though usually 2, sometimes 3). ARM instructions are usually 4 bytes, though I think we usually compile in Thumb mode, which would be two bytes per instruction. So the instructions themselves are larger. Also, the arm architecture is read/modify/write. So there is a lot of extra loading of values into registers in order to modify them, then loading them back into RAM. The 8051 can do a lot of this in-place instead. 

 

Of course, erikm or vanmierlo (or some other expert) will probably come in here and tell me how wrong I am, but that's my understanding.

Posts: 9
Registered: ‎04-04-2017

Re: EFM32HG vs EFM8UB1 float math Benchmark

Thank you for your input.

 

I'm aware that in case of 8-bit core the work has to be done in many small chunks, making the 32-bit outperform significantly at the same core frequency.

 

The picture however changes when we compare these MCUs at their maximum internal oscillator frequency, that is EFM8UB1 - 48Mhz, EFM32HG - 21Mhz.

 

Where the EFM32 outperformed significantly at "per Mhz performance", it could now outperform perhaps by 10% and that may not justify project migration to the particular MCU, especially given that EFM32HG309 costs roughly 60-70% more than EFM8UB1(2) + requires more footprint on a PCB.

 

My decision making is now a balance of performance + cost + PCB footprint requirements.

 

This is the reason I need to know whether my method of mesuring is more/less correct and such results are expected.

 

The second part of your reply regarding the code size issue makes more sense to me now, thanks.

Posts: 582
Registered: ‎09-18-2015

Re: EFM32HG vs EFM8UB1 float math Benchmark

Running EFM32 without optimization is not quite akin to running EFM8 code without optimization.

 

In particular, the GCC Debug configuration used in Simplicity Studio imposes a fair bit of overhead to maintain traceability and local variable scope.

 

You should probably build your EFM32 project using the Release configuration and change the optimization level from O3 to O1 as this would be more inline with that you're getting from what I'm assuming is the Keil C compiler on your 8-bit code.

 

John

Posts: 9
Registered: ‎04-04-2017

Re: EFM32HG vs EFM8UB1 float math Benchmark

JohnB,

 

That's a very good point.

 

I've done as you suggested, re-ran the EFM32 and I'm getting slightly improved
results.

 

*Also the following code has been added to EFM32 to keep the compiler from
eliminating the unused variables:

 

if (a + b + c + d > 0) switchIt = !switchIt; // ~4/12000 = adds ~0.033% extra negligible delay
else switchIt = !switchIt;

 

"-O0" 326.5 ms (V2: 280.9 ms)
"-O1" 312.8 ms (V2: 249.8 ms)
"-O2" 311.6 ms (V2: 249.3 ms)
"-O3" 311.5 ms (V2: 249.3 ms)
"-Os" 313.4 ms (V2: 265.5 ms)

 

Conclusion from having picked the fastest configuration:

 

Both MCU at equal frequency:

 

EFM32HG is 2.234 times faster (V2: 2.898 times faster) than EFM8UB1

 

Both MCU at their top internal oscillator freq. (EFM32@21Mhz and EFM8@48Mhz):

 

EFM32HG is 1.023 times slower (V2: 1.268 times faster) than EFM8UB1

 

Thanks for the suggestion made.

Posts: 488
Registered: ‎02-21-2014

Re: EFM32HG vs EFM8UB1 float math Benchmark

You can also declare variables to be 'volatile' to prevent the compiler from removing them. This does tell the compiler that the variable can be changed outside the scope of the core's operation, so there's also the side effect that the core won't perform any optimizations involving storing the value from the variable for use later, or for redundant reads. So, for example:

 

 

volatile char x, y;

x = y;
x = y;

 

 

In the non-volatile case, the compiler would see that the two reads of y to x are redundant and optimize one of them out. In the volatile case, this code would produce two reads of y to x.

 

Posts: 9
Registered: ‎04-04-2017

Re: EFM32HG vs EFM8UB1 float math Benchmark

Thank you both for your contribution, I have learned something new.

 

I would like to note - I find interesting that the V2 part of code runs quite faster on the EFM32 compared to the original code, makes me wonder why.

 

If this sort of increase in performance is predictable just by coding the right way, then it's worth knowing how, isnt it?

Posts: 3,150
Registered: ‎02-07-2002

Re: EFM32HG vs EFM8UB1 float math Benchmark

@Jacobido, Nice experiment.

@BrianL, Thanks for inviting me into the discussion Smiley Wink

 

First, any Cortex-M (like EFM32) only supports Thumb instructions, so expect instructions to be 2 bytes.

 

Further, I guess this very much depends on register pressure. Both CPU cores work fastest on data that resides in registers. The 8051 only has 8 8-bit ones and the ARM has 16 32-bit ones. This probably explains why the 4 loops run faster than 1 bigger loop.

 

After that comes the speed of operating on data in memory. The 8051 can do quite a lot in place in direct memory and a little less in indirect memory. The ARM must perform 3 instructions to read, then modify, and finally write the result.

 

Then, floating point is done in software on both CPU´s as neither has a hardware floating point unit. And floating point is not quite the same as doing 16-bit or 32-bit integer math. And the library implementing the operations can take shortcuts to trade off accuracy against speed. Maybe the ARM even uses hardfaults to catch unsupported floating point instructions to then emulate them in software.

 

It would be instructive to dive into the generated and linked in assembly code to see how they differ.

Posts: 9
Registered: ‎04-04-2017

Re: EFM32HG vs EFM8UB1 float math Benchmark

Thanks for the in-depth explanation@vanmierlo, it makes more sense now. 

 

Posts: 9
Registered: ‎04-04-2017

Re: EFM32HG vs EFM8UB1 float math Benchmark

With regards to the memory model of the 8051 - there should still be a small room for improvement as I've used xdata model as opposed to pdata or data.

Posts: 3,150
Registered: ‎02-07-2002

Re: EFM32HG vs EFM8UB1 float math Benchmark

pdata and xdata accesses on an 8051 are read-modify-write just like on an ARM. I expect the 8051 math lib does not use them much for this performance.

Posts: 8,176
Registered: ‎08-13-2003

Re: EFM32HG vs EFM8UB1 float math Benchmark

[ Edited ]

I think we usually compile in Thumb mode

 

you better, cortex is thumb only

erik
Posts: 75
Registered: ‎09-03-2015

Re: EFM32HG vs EFM8UB1 float math Benchmark

Hi @Jacobido

 

This is a really interesting question.

I noticed that you are using an if statement and the GPIO_PinOutSet/Clear functions for toggling a gpio pin on the EFM32HG. This can be done in a single function instead.

 

    GPIO_PinOutToggle(gpioPortC, 0);


This will use the DOUTTGL register to toggle a gpio and give you some improvements. One thing to note about the emlib functions is that they have optionally argument validation that will impact any benchmarks. The DEBUG_EFM macro is used to turn this argument validation on. So when doing benchmarking with the EFM32 devices you should double check the macros you use since this can have a size and speed impact.

I took a look at the assembly created by gcc when compiling the benchmarking function with high optimization. Here is the assembly listing, it's quite nice.

00000f50 <foo>:
     f50:   2201        movs    r2, #1
     f52:   4904        ldr r1, [pc, #16]   ; (f64 <foo+0x14>)
     f54:   4b04        ldr r3, [pc, #16]   ; (f68 <foo+0x18>)
     f56:   3b01        subs    r3, #1
     f58:   b29b        uxth    r3, r3
     f5a:   2b00        cmp r3, #0
     f5c:   d1fb        bne.n   f56 <foo+0x6>
     f5e:   660a        str r2, [r1, #96]   ; 0x60
     f60:   e7f8        b.n f54 <foo+0x4>
     f62:   46c0        nop         ; (mov r8, r8)
     f64:   40006000    .word   0x40006000
     f68:   00000bb8    .word   0x00000bb8


This assembly function translates back to the following c code:

while (1)
{
  for (int i=0; i<3000; i++)
  {}
  GPIO_PinOutToggle(gpioPortC, 0);
}


Which is really fast on the EFM32HG, but not what you were looking for I believe. As it was mentioned previously in the post you could add volatiles to try to force the compiler to do something else, however doing this will most likely not benchmark what you are going to do in your application. I suggest that you should change the benchmark to be more like what you want to do in your application to see which MCU is the best fit. If you want to test float operations more isolated you could benchmark some functions like these for instance:

void float_mul(unsigned n, float * a, float x) {
  for (unsigned i = 0; i < n; i++) {
    a[i] = a[i] * x;
  }
}

void float_add(unsigned n, float * a, float x) {
  for (unsigned i = 0; i < n; i++) {
    a[i] = a[i] + x;
  }
}

void float_sub(unsigned n, float * a, float x) {
  for (unsigned i = 0; i < n; i++) {
    a[i] = a[i] - x;
  }
}

void float_div(unsigned n, float * a, float x) {
  for (unsigned i = 0; i < n; i++) {
    a[i] = a[i] / x;
  }
}


And use varying arguments and sizes to see how it effects performance.