2020-05-06

Beware of ahci module change from dynamic to built-in on Artix (Arch) Linux


Recently, after one of the usual system updates, I suddenly ended up with an unbootable system on my Artix based NAS server. Since the system's boot process is not the most stable in general, initially, I thought it was yet another "moody" day for the system caused by the Marvell controller on ADPE4S-PB daughterboard. However, the boot is usually successful in several attempts in this case. If not, rarely this was caused by failed initramfs generation during the update process. In such situations, I was regenerating it by booting into Artix (or previously Arch) installation system and chrooting into the main system (or using fallback initramfs boot option). This procedure requires reconnecting the system drive to VIA based controller. Nevertheless, nothing was helping to bring it back to life. I started to suspect that I was dealing with the new issue this time. Initial speculation was geared to failing old hardware, but in the end it appeared to be a software based issue. Since then, I successfully booted into the system using the VIA integrated controller, it was getting obvious that modules configuration was not being applied for some reason. I confirmed that by looking at the lspci output which didn't show the AHCI driver being applied for the Marvell controller.


A little investigation revealed that the new kernel has ahci.ko.xz and libahci.ko.xz module files missing in /usr/lib/modules//kernel/drives/ata directory. They were present there before the upgrade. Since I depend on the AHCI driver specific property, the reason of the failure was pretty clear. Despite that, I didn't know yet why those modules had been missing. Regardless, I was looking for the fastest way to restore my system first. The first solution I came up with is to revert to the previous kernel. It appeared to be possible due to the fact that pacman package manager is keeping older packages in pacman cache. Using command "pacman -U /var/cache/pacman/pkg/linux-5.5.10.artix1-1-x86_64.pkg.tar.xz" downgraded kernel and, fortunately, the system was bootable again.

The downgrade solution was supposed to be temporary, since I can't ignore upgrades forever. Initially, I assumed that the missing modules were a mistake so I filed a bug report. However, it was soon closed with explanation that ahci modules are built-in in the kernel and it's not a bug. I believe this change happened starting 5.6 kernel series since 5.5 based kernels were still working for me. Because of this, I made an attempt to apply corresponding configuration of dynamic modules on kernel. Thankfully to the good Arch Linux online documentation, it was easy to find required information. This page describes how to pass module parameters to the kernel and this one describes the GRUB bootloader configuration. To pass module parameter "module.param_name=param_value" needs to be passed to the kernel line. Blacklisting is performed by passing "module_blacklist=module_name" parameter. In my specific case, I needed to pass marvell_enable=1 property value to ahci driver and blacklist pata_marvell module. So, these steps needed to be performed in my case:
  • sudo vi /etc/default/grub
    • change GRUB_CMDLINE_LINUX line to:
    • GRUB_CMDLINE_LINUX="ahci.marvell_enable=1 module_blacklist=pata_marvell"
  • regenerate grub.cfg by running sudo grub-mkconfig -o /boot/grub/grub.cfg
It should be safe to keep both: grub configuration and previous configuration for dynamic modules. They should not interfere with each other and would be ignored depending on using either built-in or dynamic module.

Unfortunately, this solution didn't work as well as expected. Though I was able to boot into the system, the success ratio decreased to an unbearable level. Only 1 out of 5 to 7 attempts were partially successful. By that, I mean the system booted and I could interact with it, however, none of the attempts initialized all the hard drives correctly. It was either system one or one of other two hard drives in LVM RAID failing. Thus, RAID volume wasn't mounted at best, or system was failing to boot at worst. Quite often, the system disk was still not recognized early in the boot process leading to rescue shell. Because of that, I was forced to revert to the old kernel again.

Considering that I didn't manage to fix the issue, I am not sure if there is a workable and stable solution at this point. Manually built kernel with modularized ahci module may help, but it would mean that I would need to track kernel upgrades myself. Moreover, building Linux kernel is not a very trivial process as I wish it should be, so it is not a viable option for me. As a temporary solution I can completely disable Linux kernel upgrades and keep the other software up-to-date. However, as a long term solution, it may force me to look for another distribution which still uses AHCI as module or even to consider a complete hardware update. Only time will tell which one will be easier to implement.

In conclusion, if Arch Linux based system is used along with Marvell 88SE6145 SATA controller (or any other marvell controller which requires ahci module specific configuration), I currently advice to refrain from upgrading to 5.6.x kernel. One can try to experiment with kernel parameters as described above, however, it is advisable to make a backup image of the system beforehand, so it can be easily restored. The downgrade path may not always be successful because of other dependencies and should not be relied on.

No comments: