So I just posted a blog entry about debugging and lo and behold, not 30 minutes later I his an exception in my platform. I’ve done this tracing before, but this time I decided I’d document my discovery process and post it here as a real-world example. It turned out to be an interesting example that I’d not hit before, so all the more fun to post.
So here’s the background: I have a CE 4.2 PXA255-based system. When I go to sleep and then wake back up with a USB device (keyboard, mouse or mass torage device) inserted I get the following exception:
RaiseException: Thread=81a8db58 Proc=816e7490 ‘device.exe’
AKY=00000005 PC=03fd3f80 RA=800bd398 BVA=00000001 FSR=00000001
This exception only happens at wake – if the device is in during boot there’s no problem.
Nicely, when PB makes an image, it dumps the output into a PLG file in the same folder as your WCE and PBW files. This output provides the map of where everything gets assembled.
So let’s take a look at what my exception is telling me. I like to start with the return address (RA in the exception output) to see who the caller is causing the problem. From Sue’s entry, we can tell immediately it’s in the kernel because the address is after 80000000.
We need to find the offset (plus verify this assumption) so we’ll look at the build output.Here’s an excerpt from my myproject.plg file:
MODULES Section
Module Section Start Length psize vsize Filler
———————- ——– ——— ——- ——- ——- ——
nk.exe .text 800b9000 258048 255488 255440 o32_rva=00001000
nk.exe .pdata 800f8000 8192 8192 7872 o32_rva=0006a000
cplmain.cpl .text 800fa000 110592 109056 108976 o32_rva=00001000
cplmain.cpl .rsrc 80115000 126976 126464 126276 o32_rva=0001f000
We can see that our return address of 800bd398 is in nk.exe at offset 0x4398 (0x800db398-0x800b9000).
Now we look in our _FLATRELEASEDIR and pull out the MAP file for NK.EXE. This tells us the offsets for the beginnings of every function in the module (keep an eye on statics listed at the bottom of the map file). Here’s an excerpt from NK.MAP:
0001:000041f8 UndefException 8c5851f8 nk:armtrap.obj
0001:00004218 SWIHandler 8c585218 nk:armtrap.obj
0001:0000421c FIQHandler 8c58521c nk:armtrap.obj
0001:00004238 PrefetchAbortEH 8c585238 nk:armtrap.obj
0001:00004240 PrefetchAbort 8c585240 nk:armtrap.obj
0001:000044bc MD_CBRtn 8c5854bc nk:armtrap.obj
0001:00004578 CommonHandler 8c585578 nk:armtrap.obj
0001:00004584 SaveAndReschedule 8c585584 nk:armtrap.obj
My calculated offset of 0x4398 is after the start of PrefetchAbort and before MD_CBRtn, so my exception is being throw by PrefetchAbort. That surprises me – why the hell would PrefetchAbort be throwing an exception? Even better, why is my code in PrefetchAbort in the first place?
Let’s see where the exception is coming from exactly by chasing the program counter (PC=03fd3f80) from the exception. Again we trun to our PLG file. Here’s an excerpt:
Module coredll.dll at offset 01fff000 data, 03f71000 code
Module regenum.dll at offset 01ffd000 data, 03f61000 code
Module pm.dll at offset 01ffb000 data, 03f51000 code
Based on this we can see that our program counter address of 0x03fd3f80 is in coredll.dll at offset 0x62f80. This is odd as I’d expect it to be in phci.dll or device.exe. It appears that maybe we’re passing some invalid info to a Win32 API, which then causes it to throw an exception? Let’s do more digging.
We go back to our _FLATRELEASEDIR and open up COREDLL.MAP to find what function in there is the culprit. Here’s an excerpt:
0001:00062e40 __fp_mult_uncommon 10063e40 coredll_ALL:mul.obj
0001:00062f5c __rt_div0 10063f5c f coredll_ALL:__div0.obj
0001:00062f90 __ld12mul 10063f90 f coredll_ALL:tenpow.obj
Ahhh…now things come together. An offset of 0x62f80 puts us in a function called “__rt_div0”, which I’m going to assume is a runtime divide by zero exception. That explains why we see an RA leading to PrefetchAbort. Something in the platform code is causing a divide by zero error, which then causes PrefetchAbort to be called, which in turn enters __rt_div0.
So now what? I still don’t know what library or function caused the exception, and herein is a flaw in trying to track problems with addresses. In this case (and this is the first I’ve seen this happen) I’ve got almost nothing. I know it’s in the USB host driver simply because of the physical effects (happens only with USB devices, and when it happens the device stops working). I know it’s a divide by zero error. That’s all I get from this entire exercise – I wish it was a prettier case study that led me right to the line of code causing it, but hey, if it were simple everyone would do it.