More Debugging in CE

So I just posted a blog entry about debugging and lo and behold, not 30 minutes later I his an exception in my platform.  I’ve done this tracing before, but this time I decided I’d document my discovery process and post it here as a real-world example.  It turned out to be an interesting example that I’d not hit before, so all the more fun to post.

So here’s the background: I have a CE 4.2 PXA255-based system.  When I go to sleep and then wake back up with a USB device (keyboard, mouse or mass torage device) inserted I get the following exception:

RaiseException: Thread=81a8db58 Proc=816e7490 ‘device.exe’
AKY=00000005 PC=03fd3f80 RA=800bd398 BVA=00000001 FSR=00000001

This exception only happens at wake – if the device is in during boot there’s no problem.

Nicely, when PB makes an image, it dumps the output into a PLG file in the same folder as your WCE and PBW files.  This output provides the map of where everything gets assembled. 

So let’s take a look at what my exception is telling me.  I like to start with the return address (RA in the exception output) to see who the caller is causing the problem.  From Sue’s entry, we can tell immediately it’s in the kernel because the address is after 80000000. 

We need to find the offset (plus verify this assumption) so we’ll look at the build output.Here’s an excerpt from my myproject.plg file:

Module                 Section  Start     Length  psize   vsize   Filler
———————- ——– ——— ——- ——- ——- ——
nk.exe                 .text    800b9000  258048  255488  255440 o32_rva=00001000
nk.exe                 .pdata   800f8000    8192    8192    7872 o32_rva=0006a000
cplmain.cpl            .text    800fa000  110592  109056  108976 o32_rva=00001000
cplmain.cpl            .rsrc    80115000  126976  126464  126276 o32_rva=0001f000

We can see that our return address of 800bd398 is in nk.exe at offset 0x4398 (0x800db398-0x800b9000).

Now we look in our _FLATRELEASEDIR and pull out the MAP file for NK.EXE.  This tells us the offsets for the beginnings of every function in the module (keep an eye on statics listed at the bottom of the map file).  Here’s an excerpt from NK.MAP:

 0001:000041f8       UndefException             8c5851f8     nk:armtrap.obj
 0001:00004218       SWIHandler                 8c585218     nk:armtrap.obj
 0001:0000421c       FIQHandler                 8c58521c     nk:armtrap.obj
 0001:00004238       PrefetchAbortEH            8c585238     nk:armtrap.obj
 0001:00004240       PrefetchAbort              8c585240     nk:armtrap.obj
 0001:000044bc       MD_CBRtn                   8c5854bc     nk:armtrap.obj
 0001:00004578       CommonHandler              8c585578     nk:armtrap.obj
 0001:00004584       SaveAndReschedule          8c585584     nk:armtrap.obj

My calculated offset of 0x4398 is after the start of PrefetchAbort and before MD_CBRtn, so my exception is being throw by PrefetchAbort.  That surprises me – why the hell would PrefetchAbort be throwing an exception?  Even better, why is my code in PrefetchAbort in the first place?

Let’s see where the exception is coming from exactly by chasing the program counter (PC=03fd3f80) from the exception.  Again we trun to our PLG file.  Here’s an excerpt:

Module coredll.dll   at offset 01fff000 data, 03f71000 code
Module regenum.dll   at offset 01ffd000 data, 03f61000 code
Module pm.dll        at offset 01ffb000 data, 03f51000 code

Based on this we can see that our program counter address of 0x03fd3f80 is in coredll.dll at offset 0x62f80.  This is odd as I’d expect it to be in phci.dll or device.exe.  It appears that maybe we’re passing some invalid info to a Win32 API, which then causes it to throw an exception?  Let’s do more digging.

We go back to our _FLATRELEASEDIR and open up COREDLL.MAP to find what function in there is the culprit.  Here’s an excerpt:

 0001:00062e40       __fp_mult_uncommon         10063e40     coredll_ALL:mul.obj
 0001:00062f5c       __rt_div0                  10063f5c f   coredll_ALL:__div0.obj
 0001:00062f90       __ld12mul                  10063f90 f   coredll_ALL:tenpow.obj

Ahhh…now things come together.  An offset of 0x62f80 puts us in a function called “__rt_div0”, which I’m going to assume is a runtime divide by zero exception.  That explains why we see an RA leading to PrefetchAbort.  Something in the platform code is causing a divide by zero error, which then causes PrefetchAbort to be called, which in turn enters __rt_div0.

So now what?  I still don’t know what library or function caused the exception, and herein is a flaw in trying to track problems with addresses.  In this case (and this is the first I’ve seen this happen) I’ve got almost nothing.  I know it’s in the USB host driver simply because of the  physical effects (happens only with USB devices, and when it happens the device stops working).  I know it’s a divide by zero error.  That’s all I get from this entire exercise – I wish it was a prettier case study that led me right to the line of code causing it, but hey, if it were simple everyone would do it.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s