I’ve spent weeks searching for an answer and trying different fixes, but at best I’ve reduced the frequency of it happening and even that I’m dubious of, since it seems so random.
-
journalctl has absolutely nothing at all from when it happens, except one time where it managed to log that the kernel lost contact with the GPU in the seconds before the system went down - after undervolting and underclocking the GPU that message hasn’t happened since.
-
there’s no crash log from it either.
-
memtest declared there were no problems with the RAM.
-
I’ve been watching sysinfo and corectrl like a hawk and CPU, RAM, and VRAM usage is all well within normal levels when it happens, temperatures are low across the board.
-
the same system has been 100% stable and completely fine running under heavier load for hours at a time in windows.
-
I’ve followed AMD’s instructions for making sure the GPU drivers are what they should be for this, and the kernel is a version that’s supposed to be correct and stable for those drivers as well.
-
specific compatibility settings that other people found to fix literally this exact problem may have, at most, reduced the frequency of the crashes but again, they’re so erratic it’s almost impossible to determine cause and effect here.
-
I’ve tried disabling the integrated graphics both in the BIOS and through settings, because that can apparently cause instability, but that hasn’t helped.
I don’t know what else to look at or try at this point.
Sorry, I should have been clearer: in linux I’m only using the GPU for ROCm, I’m not trying to get games running or anything. I just want to get its ROCm performance stable and then never touch anything for fear of breaking it.
ROCm sits on top of the kernel driver in the graphics driver stack. Switching out the kernel driver (i.e. AMDGPU for mesa) is a good place to start. Feel free to the repository version of mesa if you’re not using it for gaming. Trust me, I’ve tried to get ROCm working on my own machine before.
When I looked at it it was only talking about vulkan, opengl, and something mimicking directx 9 for compatibility, but I’ll keep it in mind and if switching from ROCm 5.6 to 6.0 didn’t solve the problem I’ll try it. I didn’t find anything about using ROCm with mesa when I searched, but between google being useless and ROCm seemingly being the least talked about and documented thing ever that’s probably not surprising.
It’s one of the most annoying thing I’ve ever dealt with, and that purely becaus of how badly it’s documented