Non-cryptographic hash functions is an interesting topic. There exists a number of more or less popular ones, including those by Bob Jenkins, Fowler, Noll and Vo, and Paul Hsieh. These are all good enough to be used in practice (although their behavior with some classes of difficult keys differs a bit) and perform well on most 32- or 64-bit CPUs.

What about 8-bit microcontrollers and processors? Many support only 8-bit shifts or rotations, one bit per instruction. This is clearly bad for hashes based on 32- or 64-bit shifts and rotations, especially when the hash state is too big to fit in registers. Even using the sparse primes of the FNV hash, a 32- or 64-bit multiplication requires quite a few instructions on a typical 8-bit system.

How can we improve performance on 8-bit systems? One obvious way is to process data eight bits at a time, as far as possible. Another is to use lookup tables, because memory access is usually very cheap. I have used the AES S-box table in my hashes, for two reasons: it is designed to mix the bits within a byte well, and it has a rather simple algebraic description allowing it to be computed in much less than the 256 bytes its lookup table requires. Yet another characteristic of simple 8-bit machines is that they execute instructions strictly in series, so one can not draw much advantage of the parallelism present in algorithms like Jenkins'.

This version is very fast on processors such as the 6502, with only one general-purpose register.

for i = 1 to length hash[i mod 4] <- hash[i mod 4] xor S-box[data[i] xor hash[i-1 mod 4]] end for i = 1 to 4 hash[i] <- hash[i] xor S-box[hash[i-1 mod 4]] end

In practice you will keep the last hash[i mod 4] value in the accumulator, as
is done in the following 6502 code implementing one iteration of the unrolled
main loop (*i* should be replaced by the numbers 0 through 3). This
executes in just 16 cycles per byte minus loop overhead. For reference,
data block copying on the 6502 requires 8 cycles per byte without loop
overhead. The final mixing step (not shown below) requires an additional 52
cycles in total.

eor data+i,y tax lda sbox,x eor hash+i sta hash+i

I have chosen the size of the hash to be 4 bytes, but other sizes may be used as well. The only performance difference should be that for very short keys, the overhead of the final mixing step will be larger.

The fact that only the most recently computed byte is mixed with the input data improves performance, and in the average case the S-box function will propagate changes to future bytes in a chaotic fashion. However, one of the 256 possible input bytes will cancel out a given change in the previous byte, which results in insufficient mixing of the data. Normally this is not a huge problem, but for keys which are similar there is a tendency to get similar hash values.

I originally started researching 8-bit hashing algorithms to provide a good checksum for Commodore tapes. Since a one-byte XOR checksum is normally used, almost any other function is a vast improvement in all respects other than speed. Although the basic version performs pretty well on average, its inability to handle sparse bit arrays of constant length makes it a relatively poor candidate for a checksum algorithm.

One simple improvement that behaves much better with difficult inputs, at the
cost of some speed, is to keep a variable *t* that is updated
independently of the state, but is mixed into the state in every step. This
version passes 2^53 key pairs in Jenkins' frog.c program testing hash
functions for collisions with inputs that are sparse bit arrays of equal size,
as opposed to just about 2^27 for the basic version (with an 8-byte output).
Pseudocode follows, but
the main loop of a 6502 implementation would require about 35 cycles per byte.

for i = 1 to length t <- S-box[t xor data[i]] hash[i mod 8] <- hash[i mod 8] xor S-box[t xor data[i] xor hash[i-1 mod 8]] end for i = 1 to 8 t <- S-box[t] hash[i] <- hash[i] xor S-box[t xor hash[i-1 mod 8]] end

This can be improved by adding yet another variable, *s*, which
is similar in function to *t*. At the cost of some more performance, this
reduces the frog.c collision rate even further (to the point where my patience
ran out before a collision could be found, at over 2^60 key pairs). The main
loop of this version requires about 50 cycles per byte on a 6502, around three
times more than the basic version.

for i = 1 to length s <- S-box[s xor data[i]] t <- S-box[t xor s xor data[i]] hash[i mod 8] <- hash[i mod 8] xor S-box[t xor data[i] xor hash[i-1 mod 8]] end for i = 1 to 8 s <- S-box[s xor t] t <- S-box[s xor t] hash[i] <- hash[i] xor S-box[t xor hash[i-1 mod 8]] end

I also wrote a simple program simulating insertions into a hash table (using open hashing), counting the number of total operations when inserting 98569 different English words and names into a table with 131072 (2^17) slots, corresponding to a load of about 0.7520. From the table below it should be obvious that the basic version is not particularly good for text hashing, but that both of the improved versions do a good job, indistinguishable from a random mapping. One should keep in mind though, that the 20% speedup in hash table insertion measured comes at the price of using a hashing algorithm that is more than twice as slow!

Function | Average number of operations |
---|---|

Basic | 183002 |

Version 2 | 149525 |

Version 3 | 149278 |

Random | 149641 |

A very fast PRNG with decent statistical properties can be built using a variant of this algorithm, by separating the state into two parts A and B, and executing A <- mix(A, B), B <- mix(B, A) to generate the next state. Then use A and B as outputs. This algorithm does pretty well in the DIEHARD suite of statistical tests to detect bias in PRNGs. Note that "good" in this case means that it's probably good enough for use in most simulations, games or probabilistic algorithms on a C64. Given past output it is trivial to predict the next state, so it is clearly not suitable for security applications. The following 6502 implementation of the A <- mix(A, B) function requires about 22 cycles per byte of output.

clc lda state_A+3 ; Repeat the following 3 more times, increasing the indices by 1 (mod 4) eor state_B+0 tax lda sbox,x eor state_A+2 eor state_A+3 adc state_A+0 sta state_A+0