Algorithim Coding - Our Approach
Selecting the right algorithm for a problem is a non-trivial task. Information must be gathered about many factors which affect the choice. Once the algorithm is chosen it must be coded, or standard code used. It may need modifying for embedded use. Small parts of it may benefit from coding in ARM assembler for performance reasons. It may even be that the best algorithm may depend on the particular ARM microprocessor being used, although this is rare.
While good algorithms design and coding is a skill which is hard to document and pass on to others, it can be learned by experience.
DETERMINING THE PARAMETERS
- the number of data to be processed — this may affect the broad algorithm choice. A good algorithm for 5 data elements may not scale to 500.
- whether the algorithm should be optimised for time or space — the two choices often point in opposite directions
- whether the algorithm is in fact critical at all — optimising a non-critical algorithm may help the ego, but is really just a waste of programming time
- the data set — can we take advantage of any knowledge of the data? For example, if 95% of the data are zero, resulting in a trivial case, then this should be taken advantage of with special code
- the ARM core being targeted — there are many factors here, including raw performance, cache type and size (if any), architecture variant (some have special DSP instructions, for example) and availability of high speed tightly-coupled on-chip memory, which may be used to advantage
- the system performance parameters — if the memory system is slow, it makes sense to load and save data rarely and in blocks, trying to re-use data as many times as possible to improve the hit-rate of the cache or fit the data in registers or on-chip memory in the ARM microprocessor. Other similar factors may point in different directions
- coding time — how much time is in fact available for coding in the budget. There may only be time for selecting a suitable algorithm, coding it in C and selecting one critical loop for hand-coding in ARM assembler. This need to be carefully discussed at the start so that a working program is in fact obtained in time!
There are a number of good algorithm books, and with the advent of the internet, it is often possible to simply do a web search for a suitable standard algorithm. Many very simple algorithms such as searching, sorting and 'container objects' are supported by standard C or C++ libraries.
Some algorithms are too memory-intensive to be suitable for embedded systems. Others operate correctly most of the time, but exhibit a pathological worst case which would cause the embedded system to crash or hang for an extended period. Both of these should ideally be avoided.
Converting data into a form which a particular algorithm can use may take time. Sometimes in an embedded system, a simple algorithm must be chosen because it is too expensive (either on CPU time or memory) to select the 'best' algorithm. These cases must be examined one by one.
Standard algorithms may benefit from slight modifications to improve their performance or memory usage. In extreme cases, or for relatively simple problems, an algorithm must be devised from scratch to perform the required task. Testing then becomes all the more important, as well as analysing worst-cast and average-case performance.
With many algorithms a standard C or C++ implementation exists which can freely be used in commercial projects. When it does not, coding one is normally fairly straight-forward.
Most algorithms have inner loops which process large amounts of data and which benefit from extra attention. Without moving to assembler it is often possible to improve the C code and produce a better result. For example, a loop which processes one data item each time around may be changed to process four instead. In some cases this simple change may double the performance.
In other cases, a loop may do some checking each time around, but the checking can be moved outside the loop, perhaps with a second loop added. Depending on the data this may result in a performance improvement.
An extreme case, but remarkably common, is where some data items which are very common can be processed by a different means. For example, in the huffman decoding part of JPEG decompression, the next 4-16 bits must be repeatedly decoded to produce a value. This variable length code is slow and tedious to decode bit by bit even on a very efficient microprocessor with excellent bit handling instructions such as the ARM. A simple 8-bit lookup table consumes less than 1KB of memory, can be set up in advance (outside the loop), and (since the 4-8-bit codes are the most common) provides a dramatic speed-up. Longer codes can be signalled by a special code in the table, and handled using the slow code.
PROFILING AND TESTING
Performance analysis is an essential part of algorithm coding. As a change or improvement is made, it is important to be sure that it has actually changed things for the better rather than introduced a backwards step. This is particularly difficult to second-guess when processor cache and memory systems come into play.
It is therefore common to optimise an algorithm on a board with a logic analyser or scope connected, so that accurate timings can be taken. The system is brought to the same starting state each time, then the data are run through the algorithm and the performance measured. Having said this, if you need a scope to detect the performance improvement from your last several hours work, then you are probably well into the region of diminishing returns. Performance can also be measured conveniently using a microprocessor simulator such as that available in the ARM RealView tools.
Testing is critical when optimising routines. How frustrating would it be to spend an entire day recoding an algorithm based on a brilliant idea you had in the morning, making it faster and faster, only to find at the end of the day that the new algorithm doesn't actually work. Backtracking is difficult, unless you checked in each revision, since it is not possible to know whether you broke the code at 9:30 or 4:30.
A simple test which can be quickly run each time the algorithm changes will avoid this pitfall. This is particular true of hand-coding assembler, which is very prone to coding errors.
Coding in ARM assembler is relatively straight-forward. It is easy to learn and instantly attractive one you grasp the basic concepts.
Hand-optimising ARM code, on the other hand, is a skill learned only from long experience. The ARM lends itself to hand-coding due to features such as load/store multiple registers, the barrel shifter and a relatively generous register set. While there are may tricks and techniques which we could write about, the overall requirement is experience of applying ARM assembler to many many tasks. This is, after all, why a skilled programmer is able to beat a C compiler with over 100 man-years of development behind it.
Many times we have been sent problem code for improvement. Sometimes we have found glaringly obvious improvements, and doubled the speed in 30 minutes. Most times we have had to work hard for a 50% speed-up, and harder still for the next 20%. Many techniques and tricks are involved, including cunning use of ARM assembler, but the key determination of performance is the chosen algorithm, regardless of its coding.