D&C GLug - Home Page

[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] Folding@home 'errors'

 

On Tue, 16 Mar 2010, tom wrote:

I run folding@home on 4 machines here and one quite often finishes early:
" Simulation instability has been encountered. The run has entered a
[23:12:46]   state from which no further progress can be made.
[23:12:46] This may be the correct result of the simulation, however if you
[23:12:46]   often see other project units terminating early like this
[23:12:46] too, you may wish to check the stability of your computer (issues
[23:12:46]   such as high temperature, overclocking, etc.)."

there is no overclocking, cpu is ~29c and memtest can run for days without finding a problem...
Any clues/tips?

Once upon a time I worked in the R&D department of an old british supercomputer company... I mainly wrote test & diagnostics, and low-level driver code - worked with the hardware & chip desginers, did some design & integration, system building, etc, etc...

And even then, I could get a system to run all my diagnostics for days on end in & out of the burn-in ovens, then they would fail miserably when subject to application code )-:

And even more yerars ago - I looked after a PDP11 running Unix v6 - every quarter we'd get the DEC engineer in as part of the maintenance contract - he'd hoover the core memory, etc... run all his diagnostics, but I remember him saying that running Unix on them was a much better test than any of his diagnostics ever were!

So you need to think bigger than just memtest - have you tried cpuburn? However that's just a set of CPU tests. There is a user-land memory tester too - it's 'memtester' under debian. Portentially not as thorough as memtest86+, but you can run it in conjunction with other things.

But who knows where the issue is - 9 times out of ten we never found bad memory on those old boards - it was more usually bad PCBs/memory controllers (all custom designed)

As part of a soaktest/burn-in for new servers, I try to get them to run as many different things - so get the PCI(e) bus(es) excercised (disk IO - run bonnie or some custom scripts - dd'ing /dev/urandom to a file for example) and some network activity - FTP big files (and small files) to/from another PC.

I try to get my systems doing as much as possible - so as well as individual sub-system tests, run a disk test and a network test and a cpu test all at the same time...

And good luck...

Gordon

--
The Mailing List for the Devon & Cornwall LUG
http://mailman.dclug.org.uk/listinfo/list
FAQ: http://www.dcglug.org.uk/linux_adm/list-faq.html