AI GoF disclaimer: I don’t expect this blog post to contribute to frontier AI gain-of-function research or I would refrain from publishing it. Please consider supporting Doom Debates to improve the quality of discourse around the risks of frontier AI research, and MIRI to try to mitigate the risks.
I have been wanting to experiment with open weights language models on the Talos II.
I have a gfx803 card that I always wanted to use for compute, but it is now out-of-support for ROCm. I have made progress getting a gfx1201 card working on this machine and I wanted to write up all the interesting error messages for reference.
I took a risk and bought a new GPU, the AMD Radeon AI Pro R9700 (ASRock Creator 32GB), without knowing if I could get it working with the Talos II mainboard, which is now seven years old.
First I realized my existing power supply did not have enough free connectors; I needed a new “modular” power supply for the GPU’s new-style 12v-2x6 power connector (which is actually a 16 pin connector, with an array of 2 x 6 main big pins and 4 little pins at the top). That prerequisite project was nerve-racking but successful. Physically, the card fit fine in the mainboard and EATX chassis.
With the latest Debian Trixie kernel driver, the card showed up as a PCIe device in lspci (validating the physical installation) but without displaying the card’s name. I figured the driver was not new enough to recognize the card’s product identifier. I read online that a Debian-derivative’s 6.17 kernel recognized the card on a different CPU architecture, so I temporarily enabled the Debian testing repository, installed linux-image, and rebooted. Now lspci displayed the card’s name, so that was progress. But as a side effect of the kernel upgrade, my virtual machines failed to start up. The libvirtd message was:
qemu-system-ppc64: Can't support 64 kiB guest pages with 4 kiB host pages with this KVM implementation
It turned out Debian ppc64le had changed the default page size from 64KiB to 4KiB. Debian though, with its characteristic flexibility, still provided a 64KiB page-size linux-image variant. With that the virtual machines worked again and the GPU continued to be recognized.
Next I shifted to userspace; the Debian-packaged rocminfo segfaulted early during its initialization, so I looked upstream and found TheRock.
I had lots of initial trouble with TheRock‘s CMake monorepo/subprojects; I am not yet sure what’s up with that, but I suspect it may be ppc64le-specific. That said, I was able to make progress by building individual subprojects one-by-one (this is probably a better approach anyway, at this stage of porting).
Eventually I got amd-llvm bootstrapped, built with a minimal configuration with Trixie‘s gcc 14.2.0. Then I built amd-llvm with itself, in the TheRock-recommended configuration, except for the PowerPC and AMDGPU targets. Next I built rocminfo. It segfaulted in the same place as Debian‘s package! Some debugging resulted in a patch to accommodate ppc64‘s vDSO naming; that eliminated the segault.
Then rocminfo ran and showed both the CPUs as “Agents” 0 and 1. But no sign of the GPU.
I further debugged rocminfo and found it was traversing sysfs, and specifically the AMD Kernel Fusion Driver (kfd) topology. The card did not have an entry there.
I looked at dmesg and noticed:
[...] amdgpu 0033:03:00.0: amdgpu: Error parsing VCRAT
[...] kfd kfd: amdgpu: Error adding device to topology
[...] kfd kfd: amdgpu: Error initializing KFD node
[...] kfd kfd: amdgpu: device 1002:7551 NOT added due to errors
First I tried building and updating a .deb of the linux-firmware from its Git repository, to rule out the parsing error being caused by an outdated binary-only firmware blob. (This is my one disappointment with the ROCm stack; it would be great if the firmware and firmware toolchains were free software.) Rebooting with the new firmware produced the same result.
I looked at the kernel source for that driver, and noticed extra debug printks. Debian helpfully enables the CONFIG_DYNAMIC_DEBUG kernel option. I tried dynamically reloading the amdgpu driver and various PCIe and GPU reset approaches, but I could not get the card back to its after-boot state. I would have to reboot to test each change.
I added amdgpu.dyndbg="+p" to the kernel command line, and that gave me some extra kfd messages; with those I narrowed down the failure to the IO link entry of the Virtual Component Resource Association Table (VCRAT).
I re-reviewed dmesg and, earlier than the parsing error, there was another clue:
[...] amdgpu: IO link not available for non x86 platforms
That message was printed during the creation of the CPU VCRAT (in kfd_create_vcrat_image_cpu). That was the #else branch of a platform-specific #ifdef. kfd_create_vcrat_image_gpu which did not have a corresponding #ifdef; “this could explain the subsequent parsing failure on the VCRAT IO link entry, on ppc64le, a non-x86 platform”, I thought.
It was time to recompile the Linux kernel. Debian makes this surprisingly easy; I followed the official instructions to build a custom kernel .deb with my attempted fix applied to the amdgpu.ko module. Another reboot and no more VCRAT parsing failure message in dmesg. That seemed like more progress. (Perhaps a more proper solution would be to add IO link support to ppc64le upstream; I don’t know if there is an equivalent POWER9 capability, hardware-wise. For my purposes, I have not yet needed an IO link.)
rocminfo still failed though, albeit in a new way:
hsa api call failure at: /TheRock/rocm-systems/projects/rocminfo/rocminfo.cc:1329
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
The co-timed dmesg messages were:
[...] amdgpu 0033:03:00.0: amdgpu: bo 00000000bdd46d97 va 0x0ffffffbfe-0x0ffffffc1d conflict with 0x0ffffffc00-0x0ffffffe00
[...] amdgpu: Failed to map VA 0xffffffbfe000 in vm. ret -22
[...] amdgpu: Failed to map bo to gpuvm
I analyzed the section of kernel driver code that generated those messages and noticed the use of AMDGPU_GPU_PAGE_SIZE in range calculations. It is hard-coded to 4096.
I had a hunch that the driver needed the kernel’s page size to match. I did a quick side quest to change all my virtual machines to use 4KiB pages, reconfigured my custom Debian kernel for 4KiB pages, and rebooted again.
Now the virtual machines loaded, and finally rocminfo showed the card’s information!
[...]
*******
Agent 3
*******
Name: gfx1201
Uuid: GPU-6413e1798933ffe0
Marketing Name: AMD Radeon Graphics
[...]
I think Debian‘s decision to use 4KiB pages is sensible, likewise amdgpu‘s assuming 4KiB pages, so I’m happy to have done this reconfiguration. I was only using 64KiB pages because it was the default when I first installed the operating system on the Talos II.
The rest of the process was a grind through TheRock subprojects with a bunch of build failure workarounds. The hardest one was fixing static_assert failures about __bf16, reported by clang, when building hipblaslt:
In file included from /TheRock/rocm-libraries/projects/hipblaslt/tensilelite/include/Tensile/DataTypes.hpp:42:
In file included from /opt/rocm/include/hip/hip_fp8.h:30:
In file included from /opt/rocm/include/hip/amd_detail/amd_hip_fp8.h:67:
/opt/rocm/include/hip/amd_detail/amd_hip_bf16.h:155:15: error: static assertion failed due to
requirement 'sizeof(__bf16) == sizeof(unsigned short)'
155 | static_assert(sizeof(__bf16) == sizeof(unsigned short));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/include/hip/amd_detail/amd_hip_bf16.h:155:30: note: expression evaluates to
'0 == 2'
155 | static_assert(sizeof(__bf16) == sizeof(unsigned short));
| ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
Debugging led me to a workaround; this must be a configuration issue with how I built amd-llvm. It needs further investigation, but building amd-llvm with that hack meant that I could successfully build hipblaslt.
The other TheRock dependencies were easier to build; I continued until I had all the dependencies of llama.cpp as-configured for ROCm, per this wiki entry.
At this point llama.cpp built successfully and ran, with good performance!
Here is an example run with a simple prompt:
$ uname -srom
Linux 6.17.13 ppc64le GNU/Linux
$ build/bin/llama-cli -fa 1 -m ~/models/chatgpt-oss-20b/gpt-oss-20b-F16.gguf
LoadLib(libhsa-amd-aqlprofile64.so) failed: libhsa-amd-aqlprofile64.so: cannot open shared object file: No such file or directory
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32
Loading model...
?? ??
?? ??
?? ?? ???? ???????? ???? ????? ????? ?????
?? ?? ????? ?? ?? ?? ????? ?? ?? ?? ?? ??
?? ?? ????? ?? ?? ?? ????? ?? ????? ????? ?????
?? ??
?? ??
build : b7533-c18428423
model : gpt-oss-20b-F16.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
> write hello world in common lisp
write hello world in common lisp
<|channel|>analysis<|message|>We need to answer: "write hello world in common lisp". The user wants a "Hello, world" program in Common Lisp. Likely they want code snippet. We should provide simple Common Lisp code: `(print "Hello, World!")` or `(format t "Hello, World~%")`. Provide explanation. Also maybe show how to run with SBCL. Let's produce a response.<|end|><|start|>assistant<|channel|>final<|message|>Here’s a minimal “Hello, World!” program in **Common?Lisp**.
You can copy it into a file (e.g. `hello.lisp`) and run it with any Common?Lisp implementation such as SBCL, CLisp, or CCL.
```lisp
;; hello.lisp
;; Common Lisp "Hello, World!" example
;; Print the string to the default output stream.
(format t "Hello, World!~%")
```
### How to run it
```bash
# Using SBCL
sbcl --script hello.lisp
# Using CLisp
clisp hello.lisp
# Using CCL
ccl hello.lisp
```
### What each part does
- `format` – a versatile printing function.
- `t` – the target output stream (`*standard-output*`).
- `"Hello, World!~%"` – the string to print, where `~%` inserts a newline.
- `--script` (SBCL) or similar options let the interpreter treat the file as a script and exit automatically after execution.
That’s all you need to see “Hello, World!” on your terminal in Common?Lisp!
[ Prompt: 8.7 t/s | Generation: 103.1 t/s ]
>
I am very happy with this result! I will see how stable it is in the coming days.
Now that I have a working proof-of-concept I will try to upstream some patches and ideally make the top-level TheRock build “just work” on ppc64le Debian.
Thank yous:
- ROCm and
amdgpu teams for making TheRock and the Linux kernel drivers free software, portable and well-documented.
- Debian maintainers for a highly-adaptable operating system.
- Raptor Computer Systems team for making Talos II future-proof.
#talos-workstation and #debian-ai participants for support and feedback.
- Rene Cheng for power supply advice.
- Matthew Tegelberg for editing.