Mem abort at bsa_is_domain_monitored+0x70/0x118
Hi,
We are trying to boot Qualcomm RB3gen2 board with below ACS build for System Ready compliance. https://github.com/ARM-software/arm-systemready/tree/main/SystemReady-devicetree-band/prebuilt_images/v24.11_3.0.0-BET0 We are observing mem aborts in following path. Could you please advise on L3 mapping from bsa_is_domain_monitored perspective?
We have raised the issue at https://gitlab.arm.com/linux-arm/linux-acs/-/issues/4 as well, raising the issue here also in-case linux-acs is not the right platform for raising the issue.
[ 8.161560] Mem abort info: [ 8.163260] cpu cpu4: EM: created perf domain [ 8.166255] Unable to handle kernel paging request at virtual address ffff8000812bb9c0 [ 8.166258] Mem abort info: [ 8.166259] ESR = 0x0000000096000007 [ 8.166260] EC = 0x25: DABT (current EL), IL = 32 bits [ 8.166262] SET = 0, FnV = 0 [ 8.166262] EA = 0, S1PTW = 0 [ 8.166263] FSC = 0x07: level 3 translation fault [ 8.166264] Data abort info: [ 8.166264] ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000 [ 8.166265] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 8.166266] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 8.166267] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000179521000 [ 8.166269] [ffff8000812bb9c0] pgd=100000010023b003, p4d=100000010023b003, pud=100000010023c003, pmd=1000000103615003, pte=0000000000000000
[ 8.166312] CPU: 5 PID: 275 Comm: (udev-worker) Not tainted 6.10.14-yocto-standard #1 (closed) [ 8.166314] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 addons video mezz platform (DT) [ 8.166315] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 8.166317] pc : bsa_is_domain_monitored+0x70/0x118 [ 8.166323] lr : bsa_is_domain_monitored+0x60/0x118 [ 8.166324] sp : ffff80008178b2e0 [ 8.166325] x29: ffff80008178b2e0 x28: 00000001a7382000 x27: 0000000000000001 [ 8.166327] x26: 00000001a7382000 x25: 0000000ffffe0000 x24: 0000000000000001 [ 8.166329] x23: ffffc9becaa17000 x22: ffff623c4026da60 x21: ffff8000812b8000 [ 8.166331] x20: ffff623c4026da60 x19: ffff623c411b0000 x18: ffff80008099d108 [ 8.166333] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 8.166335] x14: 0000000000000000 x13: ffff623c40d32000 x12: 0000000000000000 [ 8.166337] x11: 0000000000052820 x10: 0000000000000000 x9 : ffffc9bec7eeae48 [ 8.166340] x8 : ffff623ce7380d80 x7 : 0000000000000000 x6 : 0000000000000fff [ 8.166342] x5 : 000000000000000c x4 : ffff623c42116a80 x3 : 0000000000000002 [ 8.166343] x2 : 0000000000000008 x1 : 0000000000000004 x0 : ffff623c411b0000 [ 8.166346] Call trace: [ 8.166348] bsa_is_domain_monitored+0x70/0x118 [ 8.166350] __iommu_map+0xa4/0x270 [ 8.166353] iommu_map_sg+0xcc/0x1c0 [ 8.166355] iommu_dma_map_sg+0x348/0x510 [ 8.166357] __dma_map_sg_attrs+0xa4/0xb0 [ 8.166361] dma_map_sg_attrs+0x1c/0x40 [ 8.166364] sdhci_pre_dma_transfer+0xe0/0x178 [ 8.166368] sdhci_pre_req+0x44/0x60 [ 8.166370] mmc_blk_mq_issue_rq+0x418/0x948 [ 8.166372] mmc_mq_queue_rq+0x128/0x268 [ 8.166374] blk_mq_dispatch_rq_list+0x11c/0x740 [ 8.166377] __blk_mq_sched_dispatch_requests+0x4a0/0x5b0 [ 8.166380] blk_mq_sched_dispatch_requests+0x38/0x80 [ 8.166383] blk_mq_run_hw_queue+0x104/0x1b0 [ 8.166384] blk_mq_flush_plug_list.part.0+0x1dc/0x610 [ 8.166386] blk_mq_flush_plug_list+0x28/0x48 [ 8.166388] __blk_flush_plug+0x108/0x178 [ 8.166390] blk_finish_plug+0x48/0x68 [ 8.166392] read_pages+0x180/0x308 [ 8.166395] page_cache_ra_unbounded+0x10c/0x1f8 [ 8.166398] force_page_cache_ra+0xb0/0xf0 [ 8.166400] page_cache_sync_ra+0x54/0xc0 [ 8.166403] filemap_get_pages+0xcc/0x6f0 [ 8.166404] filemap_read+0xe4/0x388 [ 8.166405] blkdev_read_iter+0x80/0x178 [ 8.166408] vfs_read+0x288/0x338 [ 8.166411] ksys_read+0x80/0x128 [ 8.166413] __arm64_sys_read+0x28/0x40 [ 8.166416] invoke_syscall+0x54/0x130 [ 8.166419] el0_svc_common.constprop.0+0x4c/0x100 [ 8.166422] do_el0_svc+0x28/0x40 [ 8.166425] el0_svc+0x38/0xe8 [ 8.166428] el0t_64_sync_handler+0x128/0x138 [ 8.166430] el0t_64_sync+0x19c/0x1a0 [ 8.166433] Code: aa0003f3 b40001c0 f9442c15 b40000f5 (f95ce2a0) [ 8.166434] ---[ end trace 0000000000000000 ]---
Thanks, Naina
Hi, @NainaMehtaQUIC Kindly share complete Linux logs for the above Mem abort crash.
Thanks, Acs Team
Hi, @NainaMehtaQUIC
The Debug Image is available in the forked branch https://github.com/ajayswar-s/arm-systemready/tree/Partner_Image located inside the Debug_image folder. Due to space constraints of github, the systemready-dt_acs_live_image.wic.xz file has been split into two parts. To reconstruct the full image, please use the following command:
cat part_* > systemready-dt_acs_live_image.wic.xz
Kindly run this image and share the logs with us for further analysis.
Thanks, ACS Team
Hi,
Please find attached the kernel logs with debug image across 3 reboots.
Regards, Naina
Hi @NainaMehtaQUIC,
Thanks for the logs, based on our analysis the issue seems to be due to some missing drivers related to some devices during the linux boot.
Since the device initialization is failing, later on the data read is not valid and resulting in crash.
ACS DT image has by default these configs enabled, can you add the required configs to this file (https://github.com/ARM-software/arm-systemready/blob/main/SystemReady-devicetree-band/Yocto/meta-woden/recipes-kernel/linux/files/systemready.cfg) and build a latest DT image and confirm if the crash still observed.
[ 6.190603] ath11k 17a10040.wifi: Adding to iommu group 7 [ 6.191483] qcom-spmi-adc5 c440000.spmi:pmic@0:adc@3100: error -EINVAL: adc get dt data failed [ 6.191492] qcom-spmi-adc5 c440000.spmi:pmic@0:adc@3100: probe with driver qcom-spmi-adc5 failed with error -22 [ 6.192424] qcom_pmic_glink pmic-glink: Failed to create device link (0x180) with 1-001c [ 6.193378] ath11k 17a10040.wifi: wcn6750 hw1.0 [ 6.197162] arm-smmu 3da0000.iommu: Stage-1: 48-bit VA -> 36-bit IPA [ 6.197566] g_bsa_iommu_domain is empty [ 6.201187] g_bsa_iommu_domain is empty [ 6.205530] arm-smmu 3da0000.iommu: preserved 0 boot mappings [ 6.206179] g_bsa_iommu_domain is empty [ 6.209556] g_bsa_iommu_domain is empty [ 6.210956] ath11k 17a10040.wifi: failed to setup msa resources [ 6.211078] ath11k 17a10040.wifi: probe with driver ath11k failed with error -2 [ 6.211936] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000060
Thanks, Chetan
Hi @NainaMehtaQUIC,
If similar issue is seen after enabling all the required drivers for the system, please reopen the ticket.
Thanks, Chetan
Hi @chetan-rathore,
It seems that I don't have the permission to reopen the ticket.
We tried to enable multiple configs to make sure that the drivers where getting probed. However, we are still observing mem abort in bsa_is_domain_monitored path across different drivers. We are able to boot to shell without https://gitlab.arm.com/linux-arm/linux-acs/-/raw/master/kernel/src/0001-BSA-ACS-Linux-6.10.patch.
We added a few debug logs and the crash looks to be coming from below point.
The board doesn't support SATA and abort is seen while dereferencing ata_shost_to_port(shost)->dev.
We wanted to check if the call can be bypassed for cases where SATA support is not present.
static int bsa_get_sata_dev(void) { int ret = -1; struct Scsi_Host *shost; struct ata_port *ap; struct scsi_device *sdev = NULL; unsigned int i = 0;
do { pr_err("bsa_get_sata_dev loop iterator: %d", i); shost = scsi_host_lookup(i++); if (shost) { sdev = NULL; ap = ata_shost_to_port(shost); if ((ap == NULL) || (ap->dev == NULL)) -- ap is non NULL but referencing ap->dev is causing the abort continue; //Not a ATA port if ((ap->scsi_host == NULL) || (ap->scsi_host != shost)) continue; //Not a valid ATA Port
} while(shost);
Thanks, Naina
@chetan-rathore , can you please reopen this?
Hi, Here is our analysis:
-
Our SoC supports UFS and it is the boot media. Driver registers as SCSI host (Scsi_Host), this is at index 0.
-
bsa_get_sata_dev is scanning all the SCSI host devices and assuming that it is a ATA host device
-
it is first checking if ap (priv scsi_host structure) is NULL and ap->dev is NULL. In our SoC this priv structure pointer is ufs_hba, ap->dev can be pointing to any field in ufs_hba structure or it can be out of structure boundary. In the crash case, it is non-NULL.
-
We made below change to access scsi_host in ap (which is first parameter) and compare shost: do { shost = scsi_host_lookup(i++); if (shost) { sdev = NULL; ap = ata_shost_to_port(shost); if ((ap == NULL) || (ap->scsi_host != shost)) goto cont; //Not a ATA port if (ap->dev == NULL) goto cont; //Not a valid ATA Port do { /* get the device connected to this host */ sdev = __scsi_iterate_devices(shost, sdev); if (sdev) { g_bsa_iommu_domain = iommu_get_domain_for_dev(ap->dev); ret = 0; } } while(sdev); cont: scsi_host_put(shost); } } while(shost);
return ret; }
With this change we are seeing: [ 11.992655] Internal error: synchronous external abort: 0000000096000010 [#1] PREEMPT SMP
It is better to check if shost is ata port before accessing priv_struct.
Hi @NainaMehtaQUIC and @quic-bhaskarv,
Thanks for the analysis. We will review the changes proposed and get back asap.
Thanks, Chetan
@chetan-rathore any update?
Hi @quic-pansing,
We are working on this and will try to provide update by next week.
Thanks, Chetan
Hi @quic-pansing and @NainaMehtaQUIC,
The changes seems fine, but can you confirm with these changes was the issue observed or not. As the previous comment indicates the issue still persist.
With this change we are seeing: [ 11.992655] Internal error: synchronous external abort: 0000000096000010 PREEMPT SMP
Hi @NainaMehtaQUIC and @quic-pansing,
Any updates on above query ?
@chetan-rathore we are working on this query. Wil update
We are debugging to root cause, it will take some time.
I wanted to check with you, @chetan-rathore and ARM team, on the conditions used to check if given Scsi_Host is ATA port or not before calling ata_shost_to_port function. We did not find any way to check if the host is ATA port, there are few flags that we can check but libata is generic library implementation and the module dependent on this library has complete flexibility to create ata port. Can you check if you know or found any mechanism to identify ATA port before calling the function?
@chetan-rathore can you update Bhaskar query?
Hi @quic-pansing and @quic-bhaskarv,
I agree that finding if the host is ATA port or not is very tricky, we are trying couple of flags check, same can be tried at your end.
Please note these changes are WIP.
@chetan-rathore @quic-pansing @quic-bhaskarv do you have an update? We are seeing the same issue on RB3gen2 ACS image.
With below function, we are not seeing any crash. However, we are not able to boot to shell, seems like there is some issue as tried by removing 0001-BSA-ACS-Linux-6.10.patch . _static int bsa_get_sata_dev(void) { int ret = -1; struct Scsi_Host *shost; struct ata_port *ap; struct scsi_device *sdev = NULL; unsigned int i = 0;
do { shost = scsi_host_lookup(i++); if (shost) { if (!shost->hostt) { printk(KERN_WARNING "shost->hostt is NULL for SCSI Host %d\n", i); scsi_host_put(shost); continue; }
//Step 2: Print the name of the SCSI Host (even if it's NULL)
//Step 3: Check if the queuecommand is ata_scsi_queuecmd (libata-based)
if (shost->hostt->queuecommand == ata_scsi_queuecmd) {
//printk(KERN_WARNING "SCSI index %d, SCSI Host name: %s, This is ATA port\n", i, shost->hostt->name);
} else {
//printk(KERN_WARNING "SCSI index %d, SCSI Host name: %s, This is not ATA port\n", i, shost->hostt->name);
goto cont;
}
sdev = NULL;
ap = ata_shost_to_port(shost);
if ((ap == NULL) || (ap->scsi_host != shost))
goto cont; //Not a ATA port
if (ap->dev == NULL)
goto cont; //Not a valid ATA Port
do {
/* get the device connected to this host */
sdev = __scsi_iterate_devices(shost, sdev);
if (sdev) {
g_bsa_iommu_domain = iommu_get_domain_for_dev(ap->dev);
ret = 0;
}
} while(sdev);
cont: scsi_host_put(shost); } } while(shost);
return ret;
}_
I think the better option here would be to use a newer kernel, 6.10 is pretty old and rb3g2 is getting improvements regularly.
If this is somehow due to missing drivers, you can also try running this script 1 on a known good 6.10 (or newer) linux environment and determining if any modules are missing in the yocto 6.10 image
@chetan-rathore With v25.04_3.0.1, kernel 6.12 is not reaching the shell. Logs are attached. It looks like there is some change in v25.04_3.0.1 that is blocking the boot. With version 6.10, v25.04_3.0.1 is also not reaching the shell. Can you check? Issue is look like from UFS side, when UFS config disable, device is reaching to shell disable config CONFIG_SCSI_UFS_CDNS_PLATFORM CONFIG_SCSI_UFS_QCOM CONFIG_SCSI_UFS_RENESAS CONFIG_SCSI_UFS_TI_J721E
@chetan-rathore @rajatgoyal47 Can you update this? this is blocking our validation
After more trial and validation, we found that if we only disable CONFIG_SCSI_UFS_RENESAS , Device reach to shell. Our current device do not use ufs-renasas.c .
Hi @obbardc and @quic-pansing,
We are working on removing the DMA patch dependency during linux boot, the changes are under review and we are planning to close them by this week.
I will share a new debug DT image to get feedback on linux boot part.
Thanks, Chetan
Thanks Chetan.. I will raise new case for comment belor avoid confusion
https://github.com/ARM-software/arm-systemready/issues/252#issuecomment-2934059711
Hi @quic-pansing,
As mentioned in ticket #402, can you share your observation on bsa linux execution with DT image at this location https://github.com/chetan-rathore/arm-systemready/tree/Images/images/linux_porting
cc @NainaMehtaQUIC
Thanks, Chetan
The image seems to now boot fine for me, see attached log for more detail. I am not sure if this is a good run or not, due to not knowing about ACS internals.
iotil_rb3gen2_acs_17_june_2025.log
When will this fix be merged into ACS image ?
Hi @obbardc,
As recently noted in comment #306, the changes were made on the BSA Linux side, have already been merged, and should be included in the latest DT image build from source.
Thanks, Chetan
Hi @NainaMehtaQUIC , @quic-pansing,
Could you also share your feedback on this issue based on the image that was shared.
Thanks, Chetan
@chetan-rathore we shared response on https://github.com/ARM-software/arm-systemready/issues/402#issuecomment-29757349 with given patch, we see no issue, device reach to shell.
When this patch merge?