A US-based silicon wafer materials and tools supplier approached Bluewater Systems to improve an image-processing algorithm used for visually verifying silicon wafers. The algorithm mostly consisted of a 3x3 matrix multiply across a large monochrome image.

The chosen ARM core was the StrongARM, the attraction being the relatively low cost and power consumption. The challenge was to significantly improve the execution time.

The first step was to analyse the existing C code and understand the algorithm. This done, we could attempt to select a better algorithm if available, and finally to hand-code the performance-critical parts of the algorithm in ARM assembler.

First we created a simple test environment which could process an image with both the original algorithm and our new algorithm, then compare then byte for byte.

The algorithm supplied was in fact basically optimal as far as it went (in fact created by PhD staff at the client company who knew image processing very well). A few minor tweaks were made to the C code but there was limited scope for improvement here.

However, inspection showed that the algorithm was loading 9 single byte values for each output byte. While the values loaded were caching well, they were being loaded a byte at a time and out of order. The code took no advantage of the fact that three consecutive bytes were loaded from each of three rows. On RISC processors which use a load/store architecture this is wasteful.

The first step was therefore to improve on this. By loading a word and pulling the relevant bytes out of it with the barrel shifter, performance might increase. However, this would be at the cost of extra register use and instructions. This attempt was coded in C. Unfortunately, although the resulting code was about 5% faster, it was also more complex.

A more ambitious step was therefore tried. This involved remembering previous data values in ARM registers for the next column along. For example, if a row consists of bytes abcde, then abc are used for the first calculation and bcd for the next. By storing b and c in a register from the first calculation, they need not be reloaded in the second. This must be repeated for all three rows.

This was also coded in C. It yielded a respectable speed improvement of around 30% for an acceptable complexity increment. This optimisation is very viable in RISC processors like the ARM where plenty of registers are available.

Examination of the resulting assembler code generated by the compiler indicated that time was being spent copying the two new values for each row (bc) into the two 'save values' for each row (ab for next time around). There were 6 such copies between registers. We decided therefore to unroll the loop and process 4 output pixels each time around the loop. The would avoid the copying since the correct temporary values could just be used direct each time. This would use a lot of registers, however.

Coding of this algorithm was relatively straight-forward, however the resulting code was significantly slower than the original. We modified it to pack three values into a register and unpack them with the barrel shifter. This yielded an additional 20% speedup (50% total).

Examination of the assembler output of the C compiler showed that significant register spilling was still occurring. The stack was being used to hold temporary values that could not be held in registers. Changing a few C compiler options (to remove the frame pointer) helped a little.

The decision was then made to move to ARM assembler. Using the C compiler output as a base, we first just removed the register spilling operations (stack loads and stores). Obviously the code did not now work, but it indicated the performance we might achieve if this could be done. The result was convincing and indicated that this route was worth following, at least for RISC processors such as the ARM.

We then worked on removing the register spilling problems. We moved the outer loop and inner loop counters out of registers and onto the stack since these were not actually used in the inner loop itself. This freed up two more registers.

By rearranging some of the code, we were able to remove all of the registers spills from the inner loop. This yielded a total speed-up of around 120%, more than double the original performance.

In total the project took just on a week of time, to more than double the performance of an already-reasonable algorithm. This is a significant period to work on one single algorithm, but in this case it was well worthwhile. The newly coded algorithm allowed the client to significantly improve the performance of their device on the chosen RISC processors and produce a useful working prototype.