Originally Posted by hillel:
“@mjr Please give me a reference/research to support this, by pm if warranted. I am also interested in Use Cases, Unit Test Cases for Firmware, ... (I have a professional interest in this area and want to follow it through.)”
I'm generalising here about problems with computers or embedded systems where one particular unit experiences random behaviour, especially if the problem disappears with a new software build that wasn't trying to address such an issue. A common explanation would be faulty RAM, where under certain conditions one or more bits are faulty and misread. The data or code that occupies that part of RAM may relate to some important underlying function in the OS, e.g. thread scheduler, in one firmware build. This being a critical function to the OS, if corruption occurs either in terms of the instructions being executed, or the data structures being manipulated, then a crash could occur.
But in a later build, memory could be allocated in a different order and the size of the code will likely be different, so the RAM will be used differently. So perhaps that faulty memory is now unused, e.g. the alignment/packing of the data structures mean the faulty bit(s) never store any data. Or let's say that the fault is that a single bit always reads 1 regardless of what was stored in it, maybe the code or data that is now stored there
should have that bit set to 1 anyway, so no anomolous behaviour occurs.
There would be other hardware related explanations too, e.g. faulty CPU - only takes one time in many billion instruction cycles to do the wrong thing and incorrect behaviour could occur, which could manifest as a crash.
The best analogy I could give would be to look at what happens when people "overclock" their PCs, i.e. run the CPU and/or RAM faster than its rated speed. When they push the tolerances too far (and this varies from one piece of hardware to the next, even if they're supposedly identical), they experience random crashes (e.g. blue-screen crashes on Windows) or other anomolous behaviour, ultimately because of a hardware error.
Though this doesn't completely rule in a hardware fault (could be for example that you had an uncommon usage pattern that just happened to trigger a bug that was silently corrected in a later build, e.g. something quite innocuous like the timing of remote control keypresses, leading to triggering a subtle thread race condition when two operations overlap with very specific timings) but it would make me wary of ruling one out.