I’ve spent weeks searching for an answer and trying different fixes, but at best I’ve reduced the frequency of it happening and even that I’m dubious of, since it seems so random.
-
journalctl has absolutely nothing at all from when it happens, except one time where it managed to log that the kernel lost contact with the GPU in the seconds before the system went down - after undervolting and underclocking the GPU that message hasn’t happened since.
-
there’s no crash log from it either.
-
memtest declared there were no problems with the RAM.
-
I’ve been watching sysinfo and corectrl like a hawk and CPU, RAM, and VRAM usage is all well within normal levels when it happens, temperatures are low across the board.
-
the same system has been 100% stable and completely fine running under heavier load for hours at a time in windows.
-
I’ve followed AMD’s instructions for making sure the GPU drivers are what they should be for this, and the kernel is a version that’s supposed to be correct and stable for those drivers as well.
-
specific compatibility settings that other people found to fix literally this exact problem may have, at most, reduced the frequency of the crashes but again, they’re so erratic it’s almost impossible to determine cause and effect here.
-
I’ve tried disabling the integrated graphics both in the BIOS and through settings, because that can apparently cause instability, but that hasn’t helped.
I don’t know what else to look at or try at this point.
Linux Mint 21.3 with the 6.2.0-26 kernel, an RX 6800, and the drivers are whatever AMD’s ROCm 5.6 package installed for itself - system reports says that’s amdgpu 6.3.6 but I’m having trouble lining that up to a revision number.
Edit: looks like pytorch supports up to ROCm 6.0 now, so I’ll try uninstalling 5.6 and installing 6.0 along with reinstalling pytorch in the conda environment I’ve got it in.
Try this: https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers
It is a repackage of the bleeding edge mesa drivers. Mesa is generally more stable than AMDs drivers and sometimes it performs better, as well
Unfortunately my main use case for linux is ROCm and that requires the drivers it installs for itself. After updating and some light testing updating to ROCm 6.0 hasn’t actively failed yet but I won’t conclusively know if it’s fixed until it’s gone long enough with regular use without crashing, at which point I’m sure it’ll black screen the second I dare to feel relief and believe it to be fixed.
You can install mesa and ROCm at the same time. There should be a guide to it on the AMD website.
Sorry, I should have been clearer: in linux I’m only using the GPU for ROCm, I’m not trying to get games running or anything. I just want to get its ROCm performance stable and then never touch anything for fear of breaking it.
ROCm sits on top of the kernel driver in the graphics driver stack. Switching out the kernel driver (i.e. AMDGPU for mesa) is a good place to start. Feel free to the repository version of mesa if you’re not using it for gaming. Trust me, I’ve tried to get ROCm working on my own machine before.
When I looked at it it was only talking about vulkan, opengl, and something mimicking directx 9 for compatibility, but I’ll keep it in mind and if switching from ROCm 5.6 to 6.0 didn’t solve the problem I’ll try it. I didn’t find anything about using ROCm with mesa when I searched, but between google being useless and ROCm seemingly being the least talked about and documented thing ever that’s probably not surprising.
It’s one of the most annoying thing I’ve ever dealt with, and that purely becaus of how badly it’s documented