I’ve been working with a customer to perform a healthcheck on their Active Directory in order to (hopefully) mitigate the risk of issues as they migrate users and mailboxes between domains. One of the things that concerned me was that
dcdiag.exe – one of the Windows Server 2003 Support Tools that I was using as part of the healthcheck – was crashing part way through.
I was pretty stumped, so I used one of the support incidents on our Microsoft Premier Support contract… and as my expert colleagues in Fujitsu’s Enterprise Support team guided me through the troubleshooting process towards a resolution (which was obvious to anyone thinking clearly), I realised that I should have been able to work this through by myself.
Now that the issue is resolved I’m kicking myself for effectively wasting an incident on what should be straightforward but that’s what happens when you spend so much time talking about technology and designing solutions and so little actually resolving problems (it probably also has something to do with spending so much time travelling and so little time sleeping). So, at the risk of embarrassing myself in years to come with a post that proves what an idiot I can be, I decided to post a little lesson on troubleshooting incidents like this, in the hope that someone else finds it useful…
- Don’t panic. OK, so you’re on a client site, on your own, and the customer is paying for your expertise but (as one of my customers taught me many years back – thank you Andy Cumiskay if you’re reading), an expert does not necessarily know all the answers. An expert knows how to analyse a situation and ask the right questions to find the answer. Stop and think.
- What are you doing? In my case I was running
dcdiag /e /c /v /f:dcdiag.logand it was aborting. So, what was I actually asking the computer to do? Well,
/emeans for all servers in the enterprise – so what if I run the command against individual servers? Does it affect them all – and is there a pattern to the failure?
/cmeans comprehensive – is there just a single test that’s failing?
/vis verbose – that’s probably fine, and
/ffor logging to a file, no problem there either. Using this method, the problem was narrowed down to a single domain controller.
- Could this be done another way? In my case, I was running the command from a remote server – what if I run it from the target computer? In my case, the problem existed whether run locally or remotely.
- Having narrowed down the problem, look at the diagnostic evidence. At first , the errors in the event log didn’t seem to tell me much. Or did they? What about the version number of the faulting application? Does it match the version of the installed operating system. In my case the application log had an error message where the description read (in part): "Faulting application dcdiag.exe, version 5.2.3790.1830, faulting module ntdll.dll, version 5.2.3790.3959, fault address 0x0002caa2". So, ntdll.dll is the service pack 2 version (3959) and dcdiag.exe is at service pack 1 (1830) – i.e. not at the same service pack revision. If the event logs don’t give this much information, try looking at file version information in the file properties.
- Is an alternative version available? Google (or Windows Live Search, Yahoo!, Ask, etc.) is your friend. After downloading and installing the service pack 2 version of the Window Server 2003 Support Tools,
dcdiag.exestopped crashing. Problem solved.
All it needed was a little logical thinking. Thanks to Richard and Alastair in Fujitsu Services’ Enterprise Support group – not just for the diagnosis but for reminding me how to solve problems.