Algorithim Coding
In most systems there are a few critical algorithms which must be implemented efficiently. A good implementation may dramatically improve the performance of the entire system.
To many people this means coding in assembler. Assembler, they say, is the only way to get the best performance out of an embedded CPU.
In fact, efficient algorithm coding is not just about writing in assembler. An n=1000 bubble sort in assembler is still a very poor sorting algorithm, and much slower than an n=1000 quicksort written in C, for example. We recognise this fact, and we can put our considerable algorithmic knowledge and experience to work to select the best algorithm for a given problem.
Also since the ARM is such an excellent C target, C and C++ compilers can often do remarkably well at turning a well-written C algorithm into efficiency assembler.
Of course, for core parts of an algorithm, particularly on embedded platforms, assembler has its place. Over the past decade or so, we have hand-coded JPEG compression and decompression, many DSP algorithms, matrix-based image processing routines and dozens of special-purpose routines to extract maximum performance out of many different ARM chip variants. For example, it is possible to make use of fast on-chip to dramatically speed up algorithms which require temporary storage such as JPEG. This sort of optimisation is fairly unique to embedded systems.