Debian on Dell Kace M300

sudos

Re: Debian on Dell Kace M300
May 09, 2025 10:59AM

Registered: 10 years ago
Posts: 55

I have an interesting one, figured this was best put into this thread as it's an issue I only saw on the M300 and not on the Pogo v4 that had the same kernel.

Had the three M300 boxes running 6.10.11 and one of them was giving me unhandled fault exceptions a few months ago in my dmesg and I chalked it up to something in the hardware going flaky, possibly a bad disk given it always reared its head during an apt update, so I reinstalled that one and continued on. Today, I went to update packages after leaving them sit in the time after doing nothing at all and had this come up on a different box, same sort of exception.

[4683963.113411] Unhandled fault: alignment exception (0x001) at 0xffffff41
[4683963.124131] [ffffff41] *pgd=6fffd871, *pte=00000000, *ppte=00000000
[4683963.133249] Internal error: : 1 [#1653] PREEMPT ARM
[4683963.139043] Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_ondemand cpufreq_powersave sg marvell_cesa orion_wdt kirkwood_thermal uio_pdrv_genirq uio
[4683963.154973] CPU: 0 PID: 20157 Comm: php8.2 Tainted: G      D            6.10.11-kirkwood-tld-1 #1 211932710076a3f6d6304997ca04b9111b47c9c4
[4683963.168338] Hardware name: Marvell Kirkwood (Flattened Device Tree)
[4683963.175507] PC is at 0xe2025646
[4683963.179535] LR is at vfs_getattr_nosec+0xa0/0xcc
[4683963.185063] pc : [<e2025646>]    lr : [<802eadd8>]    psr: a0000013
[4683963.192231] sp : f0dade50  ip : f0dadf70  fp : fffff000
[4683963.198352] r10: 00000000  r9 : 838cb738  r8 : 000007ff
[4683963.204473] r7 : f0dade8c  r6 : 80000000  r5 : e08f2002  r4 : f0dadec8
[4683963.211903] r3 : 000007ff  r2 : ffffff41  r1 : f0dade8c  r0 : 814a3bb4
[4683963.219334] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[4683963.227374] Control: 0005397f  Table: 13b98000  DAC: 00000051
[4683963.234018] Register r0 information: non-slab/vmalloc memory
[4683963.240583] Register r1 information: 2-page vmalloc region starting at 0xf0dac000 allocated at kernel_clone+0xa4/0x28c
[4683963.252216] Register r2 information: non-paged memory
[4683963.258163] Register r3 information: non-paged memory
[4683963.264110] Register r4 information: 2-page vmalloc region starting at 0xf0dac000 allocated at kernel_clone+0xa4/0x28c
[4683963.275734] Register r5 information: non-slab/vmalloc memory
[4683963.282291] Register r6 information: non-slab/vmalloc memory
[4683963.288849] Register r7 information: 2-page vmalloc region starting at 0xf0dac000 allocated at kernel_clone+0xa4/0x28c
[4683963.300473] Register r8 information: non-paged memory
[4683963.306420] Register r9 information: slab ext4_inode_cache start 838cb680 pointer offset 184 size 712
[4683963.316570] Register r10 information: NULL pointer
[4683963.322254] Register r11 information: non-paged memory
[4683963.328289] Register r12 information: 2-page vmalloc region starting at 0xf0dac000 allocated at kernel_clone+0xa4/0x28c
[4683963.340000] Process php8.2 (pid: 20157, stack limit = 0xeb7b25c0)
[4683963.346994] Stack: (0xf0dade50 to 0xf0dae000)
[4683963.352247] de40:                                     80000000 f0dadec8 000007ff f0dadec8
[4683963.361334] de60: 00000001 00000800 00000000 93b7f000 000007ff 802eb640 00000000 00000000
[4683963.370429] de80: 00000000 ffffff9c 00000000 835e3d90 8214c990 e0d487ae 00000000 7eb8e358
[4683963.379516] dea0: 00000800 ffffff9c 000007ff 8010021c 81a7be80 00000000 00000006 802eba64
[4683963.388610] dec0: 000007ff 802bebd0 000007ff 00000000 00000000 00000000 00000000 00000000
[4683963.397696] dee0: 00201000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[4683963.406781] df00: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[4683963.415868] df20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[4683963.424955] df40: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[4683963.434042] df60: 00000000 00000000 00000000 00000000 ffffff9c e0d487ae 8010021c 93b7f000
[4683963.443136] df80: 00000800 802ebaf4 7eb8e358 7eb8e5c4 00000800 7eb8e358 7eb8e488 7eb8e5c8
[4683963.452231] dfa0: 0000018d 80100060 7eb8e358 7eb8e488 ffffff9c 7eb8e5c4 00000800 000007ff
[4683963.461318] dfc0: 7eb8e358 7eb8e488 7eb8e5c8 0000018d 7eb8e5c4 028fa4a0 007fdfc4 00000006
[4683963.470413] dfe0: 00000000 7eb8e298 ffffff9c 76815508 20000010 ffffff9c 00000000 00000000
[4683963.479503] Call trace:
[4683963.479514]  vfs_getattr_nosec from vfs_statx+0x7c/0x150
[4683963.489148]  vfs_statx from do_statx+0x40/0x84
[4683963.494503]  do_statx from sys_statx+0x4c/0x64
[4683963.499847]  sys_statx from ret_fast_syscall+0x0/0x44
[4683963.505802] Exception stack(0xf0dadfa8 to 0xf0dadff0)
[4683963.511754] dfa0:                   7eb8e358 7eb8e488 ffffff9c 7eb8e5c4 00000800 000007ff
[4683963.520841] dfc0: 7eb8e358 7eb8e488 7eb8e5c8 0000018d 7eb8e5c4 028fa4a0 007fdfc4 00000006
[4683963.529933] dfe0: 00000000 7eb8e298 ffffff9c 76815508
[4683963.535884] Code: 230a2268 6c636e69 20656475 70757422 (626f656c)
[4683963.548947] ---[ end trace 0000000000000000 ]---

I have no idea what I'm even looking at, here. I got the machine rebooted and updated the kernel 6.13.8 right after and will be monitoring for any weirdness, but this seems to be something only affecting the M300s. Two of the three have now shown this issue and the third, which is now remote, has not shown any of this, and I was so scared of it possibly happening that it got rebooted and upgraded to the latest kernel with it. Based on whats included in the trace, could this be something to do with the CPU power saving stuff? was this something fixed in a later kernel and I'm reading too much into it? I don't know. but at least I can drop this here and document it for anyone else, because it was quite the scare.

When this happens it does in fact taint the running kernel, and when that happens the running program usually segfaults and drops me back to a prompt.

It seems to only happen after the machine has been running for a long while, or after a bout of super-duper heavy load. The one box runs my owntone music server with a copy of all of my music on it, the remote one is sitting there doing nothing special aside from acting as an ssh SOCKS5 proxy for managing my parents' router and managing the devices on their network through it (meaning it just sits there idle most of the time running nothing at all) and the third is the test-new-stuff box and seems to have this problem the most despite not running anything spectacular. it might have an nginx server running and that might be accounting for it but that has yet to be seen. I did have plans to use it as a webserver instead of the one box running in my basement but since then my priorities have changed dramatically.

Reply Quote

bodhi

Re: Debian on Dell Kace M300
May 09, 2025 04:02PM

Admin
Registered: 14 years ago
Posts: 19,726

sudos,

> I have no idea what I'm even looking at, here. I
> got the machine rebooted and updated the kernel
> 6.13.8 right after and will be monitoring for any
> weirdness,

> Based on whats included in the trace, could
> this be something to do with the CPU power saving
> stuff? was this something fixed in a later kernel
> and I'm reading too much into it? I don't know.
> but at least I can drop this here and document it
> for anyone else, because it was quite the scare.

It's a system call to get satistics. And it's *likely* you see something that has been fixed in later kernel. There were several commits related to this in mainline since Sept 2024 (about the time of 6.10.x). One of them was specific to alignment (but it was not stated as a bug fix).

I think it's better running 6.13.8 for a while and see if the problem has gone away before doing any upgrade.

-bodhi
===========================
Forum Wiki
bodhi's corner (buy bodhi a beer)

Reply Quote

sudos

Re: Debian on Dell Kace M300
May 14, 2025 11:13PM

Registered: 10 years ago
Posts: 55

Went into one of the boxes today to check on things and:

[333328.480877] BUG: Bad page state in process find  pfn:25b41
[333328.487203] page: refcount:0 mapcount:0 mapping:00000000 index:0x0 pfn:0x25b41
[333328.495247] memcg:20000
[333328.498490] flags: 0x0(zone=0)
[333328.502352] raw: 00000000 ef586528 ef586528 00000000 00000000 00000000 ffffffff 00000000
[333328.511266] raw: 00020000
[333328.514683] page dumped because: page still charged to cgroup
[333328.521241] Modules linked in: sg marvell_cesa orion_wdt kirkwood_thermal uio_pdrv_genirq uio
[333328.530626] CPU: 0 UID: 0 PID: 5378 Comm: find Not tainted 6.13.8-kirkwood-tld-1 #1 d786cb445a55a57f0ba1beeb464768952359b8d7
[333328.542683] Hardware name: Marvell Kirkwood (Flattened Device Tree)
[333328.549764] Call trace:
[333328.549776]  unwind_backtrace from show_stack+0x10/0x14
[333328.559166]  show_stack from dump_stack_lvl+0x54/0x5c
[333328.565042]  dump_stack_lvl from bad_page+0xd0/0x10c
[333328.570834]  bad_page from check_new_pages+0x9c/0xa8
[333328.576623]  check_new_pages from __rmqueue_pcplist+0xa4/0x430
[333328.583285]  __rmqueue_pcplist from get_page_from_freelist+0x2a4/0x7b0
[333328.590644]  get_page_from_freelist from __alloc_pages_noprof+0x144/0xa04
[333328.598266]  __alloc_pages_noprof from alloc_slab_page+0x24/0x5c
[333328.605102]  alloc_slab_page from new_slab+0xb4/0x2bc
[333328.610971]  new_slab from ___slab_alloc.constprop.0+0x330/0x408
[333328.617798]  ___slab_alloc.constprop.0 from __slab_alloc.constprop.0+0x34/0x74
[333328.625846]  __slab_alloc.constprop.0 from kmem_cache_alloc_lru_noprof+0x70/0x1c8
[333328.634157]  kmem_cache_alloc_lru_noprof from ext4_alloc_inode+0x18/0xf8
[333328.641692]  ext4_alloc_inode from alloc_inode+0x1c/0x9c
[333328.647831]  alloc_inode from iget_locked+0x64/0x178
[333328.653611]  iget_locked from __ext4_iget+0x11c/0xc5c
[333328.659480]  __ext4_iget from ext4_lookup+0x160/0x26c
[333328.665350]  ext4_lookup from __lookup_slow+0xcc/0x100
[333328.671313]  __lookup_slow from walk_component+0x80/0xcc
[333328.677444]  walk_component from path_lookupat+0x80/0x114
[333328.683669]  path_lookupat from filename_lookup+0x50/0xd8
[333328.689895]  filename_lookup from vfs_statx+0x60/0xc8
[333328.695774]  vfs_statx from do_statx+0x40/0x84
[333328.701030]  do_statx from sys_statx+0x78/0x90
[333328.706289]  sys_statx from ret_fast_syscall+0x0/0x44
[333328.712157] Exception stack(0xf2b9dfa8 to 0xf2b9dff0)
[333328.718020] dfa0:                   7ead25e0 7ead27d8 00000009 01efe870 00000900 000007ff
[333328.727029] dfc0: 7ead25e0 7ead27d8 004702d4 0000018d 00470300 00470300 00470008 00000007
[333328.736034] dfe0: 00000100 7ead2520 00000009 76e2f508
[333328.741893] Disabling lock debugging due to kernel taint

Interestingly this is the one that's running Owntone. I haven't connected anything to it yet since then and it's just been sitting there. Both the tester box and the remote box are completely unaffected.

This is on 6.13.8... I wonder what's causing it. Bad RAM? I really hope not.

Edited 1 time(s). Last edit at 05/14/2025 11:14PM by sudos.

Reply Quote

bodhi

Re: Debian on Dell Kace M300
May 15, 2025 03:54PM

Admin
Registered: 14 years ago
Posts: 19,726

sudos,

I don't think it's bad RAM.

> Interestingly this is the one that's running
> Owntone. I haven't connected anything to it yet
> since then and it's just been sitting there. Both
> the tester box and the remote box are completely
> unaffected.

Do you have swap on?

=====

FWIW, one of my several M300 is used in my kernel distributed build farm. Whenever I'm building kernels, this compile node max out at ~100% CPU persistenly for a couple hours, but not doing much disk IO, only writing to RAM.

So I'm guessing this error is related to swap and file system. It's good to run e2fsck on this rootfs and then make sure there is some swap space. The rule of thumb is 4x RAM for the swap file. But if it never used up much of this 2GB RAM then smaller swap file is OK. Linux does not behave well without swap.

-bodhi
===========================
Forum Wiki
bodhi's corner (buy bodhi a beer)

Reply Quote

sudos

Re: Debian on Dell Kace M300
May 15, 2025 06:33PM

Registered: 10 years ago
Posts: 55

yep, it has 2GB of swap space but it barely ever touches it if ever with the kinds of workloads it gets, it's more there for the sake of having some semblance of swap so the machine doesn't complain if I do end up compiling something necessary in the future...
e2fsck was the first thing I did on force and it came back clean as a whistle. I don't think whatever this is has to do with the filesystem there.
Might be an idea to replace the swap partition's mount with a generated swap file instead?

I did do some other lookings-into and it looks like in earlier 6.13 kernel versions there was in fact a problem with flatpaks on Arch with 6.13 throwing a similar "BUG: Bad page state in process x" issue and it was big enough to get fixed very quickly, but I don't think this is necessarily related.

Looks like 6.13 is now EOL and 6.14.6 at time of writing is the current stable, I wonder if it's been fixed there. issue being, it might also not be.

Reply Quote

bodhi

Re: Debian on Dell Kace M300
May 15, 2025 08:08PM

Admin
Registered: 14 years ago
Posts: 19,726

I've built kernel 6.14.6-kirkwood and running it on a few boxes. Since 6.15 will be out in a couple week, I think I'll release this 6.14.6 stable version (if there is nothing significant comming on 6.14.6+ stable in the next couple days).

-bodhi
===========================
Forum Wiki
bodhi's corner (buy bodhi a beer)

Edited 1 time(s). Last edit at 05/15/2025 08:20PM by bodhi.

Reply Quote