Exchange Server best practice and preventative maintenance

Until fairly recently, Exchange was my main area of technical expertise, but since I joined Conchango, I’ve been working in other areas and my Exchange skills have become a little rusty. That was until a couple of nights back, when I attended a Microsoft TechNet UK event, where Paul Bowden (Exchange Product Manager) demonstrated the Microsoft Exchange Server Best Practices Analyzer tool (ExBPA) before Brett Johnson (one of Microsoft’s escalation engineers in the UK) talked about best practices of Exchange Server preventative maintenance.

Microsoft Exchange Server Best Practices Analyzer tool

Analysis of support incidents logged with Microsoft has shown that only 0.3% result in the generation of a hotfix and 60% are configuration errors. The ExBPA is a tool which analyses Exchange Server for the top configuration issues in a manner which is a hybrid of a proactive health check and reactive diagnosis.

ExBPA was not the first best practice analyser from Microsoft – that was the Microsoft SQL Server Best Practices Analyzer (SQLBPA), launched in May 2004 – and BPAs will eventually be produced for all Microsoft products within the Windows Server System.

The design principles used for the creation of ExBPA were:

Concentrate on performance, scalability and availability – whilst the ExBPA does examine some security mis-configurations, e.g. open relays or too many administrators, it does not look for the latest patch levels – the Microsoft Baseline Security Analyzer (MBSA) performs that function.
Make it easy to run – previous tools were not particularly easy to set up and ExBPA is designed on a 3-click principle (from startup to scan).
Don’t leave me hanging – i.e. don’t just provide a strange message and a link to a Microsoft knowledge base article – provide some useful information in relation to the tool’s findings.
Keep it up-to-date – ExBPA automatically downloads its web update packs, which are published every two weeks.
Work in all environments – ExBPA works from single server Microsoft Small Business Server implementations right through to enterprise Exchange Server deployments.

The ExBPA can be run against on all versions of Exchange Server, although for versions prior to Exchange Server 2000 it does require that Active Directory and at least one Exchange 2000 or 2003 server are available. The tool is implemented in Visual C#, with an XML input/output data model and an XPath analysis engine. There are no server components and the tool is generally run from a Windows XP computer, collecting the data remotely. More architecture information is available in the ExBPA overview on the Microsoft website.

ExBPA is not a monitoring tool – that is Microsoft Operations Manager Server (MOM), for which there is an Exchange Server 2003 Management Pack. ExBPA provides a snapshot in time, looking for data in:

Active Directory.
DNS.
WINS.
Registry (there are over 1200 registry parameters for Exchange Server 2003).
IIS metabase.
Performance monitor.
Files on disk.
TCP/IP ports.

The first pass is a data collection and a subsequent pass is made on this for analysis against defined rules.

ExBPA understands a number of Exchange Server roles:

Small mailbox servers.
Large mailbox servers.
Clustered servers.
Front-end servers.
Bridgehead servers.

Advice is adjusted accordingly (e.g. circular logging off for mailbox servers but on for a bridgehead) and ExBPA reports on a variety of rule types:

Errors (something is causing, or is likely to cause a problem).
Warnings (something looks suspicious).
Non-default (something has been changed).
Time (something has changed within the last 5 days).
Information (something of interest about the environment).
Best practice (in ExBPA v2.0, due for release within the next 3 weeks).

When running, the ExBPA will automatically detect the closest global catalog (GC) server and the credentials of the current logged on user (although these can be modified if required). The type of scan can be set to one of three options (heathcheck, connectivity test or baseline) and the network speed must be set, both to provide an estimate of the time left to run and to set appropriate thresholds for timeouts, etc. Once the elements of the organisation to be monitored are selected, the ExBPA will run (it is multi-threaded, using up to 25 threads) and following a successful analysis, a number of reports are available:

Critical issues list.
Full issues list.
Non-default settings.
Recently-changed settings.
Baseline.
Items of interest.
Summary view.
Detailed view.
Disabled items list.
Run time log.

Some of the success stories from using the ExBPA to identify issues include:

Incorrectly configured DNS server address causing poor performance (even with secondary and tertiary addresses in place, Exchange will always try to contact the primary DNS server first – if that is down, or the IP address is not correct, then that means that every lookup request will first be tried against an invalid entry, before the secondary DNS entry is attempted).
Poor performance due to placement of database files on compressed disk volumes (even though they were on a high performance SAN).
Circular logging enabled on a 12,000 user Exchange cluster (had been enabled prior to migration from the old servers to prevent excessive log generation, but was not disabled afterwards).
Incorrect memory configuration generating 9582 Event ID errors, leading to a server restart every two weeks.

ExBPA v1.0 was released in September 2004, with 1200 points collected using 800 rules. ExBPA v1.1 followed in December 2004, with some usability improvements and 1300 points using 900 rules. ExBPA v2.0 is due for release in March 2005 and will add:

Localisation for all languages in which Exchange Server is available.
Performance sampling and root cause analysis (how close to the limit is the server).
Administrative API support (when was the last backup).
Operational integration with MOM 2005.
Export in XML HTML or CSV format.
New baseline logic.

ExBPA v3.0 is already being planned for release later in 2005, with new features including more rules and refinements, and a MAPI.NET collector.

The web update pack for the ExBPA is a 650Kb XML file and just some of the elements that the ExBPA checks today are:

Active Directory (forest) – functionality level, Exchange schema extensions, default policy changes.
Active Directory (domain) – functionality level, renamed domains, FSMO availability, renamed/deleted/moved Exchange system containers and/or groups.
Active Directory Connector – state of the connector (overloaded/idle/newer version available/service pack level), connection agreements (orphaned/set never to run/missing server/one-way/out-of-date).
Exchange organisation – message size limits enforced, stray Exchange objects in LostAndFound, more than 10 Exchange administrators, ForestPrep version, mixed/native mode, Outlook Mobile Access (OMA) options, Exchange Archive Solution (EAS) options, unsolicited commercial e-mail (UCE) thresholds, recipient update service definitions, address list and offline address book (OAB) definitions.
Exchange administration groups – validity of legacyExchangeDN, policy containers intact.
Exchange routing groups – valid routing master, enumerate all connectors, recently changed connectors.
Exchange server – server name validity, fully qualified domain name (FQDN) and NetBIOS name resolution, service pack/rollup level, time synchronisation with Active Directory.
Cluster configuration (active and passive) – number of nodes, configuration discrepancies, temporary paths, quorum configuration, heartbeat configuration, DNS/WINS configuration, enumerates all resources and parameters, kerberos configuration.
Directory access – cache configuration and non-default parameters, cache efficiency, round trip times between Exchange and each domain controller (DC)/GC, hardware configuration of each DC/GC, GC to Exchange processor ratio.
Information store – extensible storage engine (ESE) cache configuration, virtual memory state, online maintenance window, checkpoint depth, circular logging state, log buffer configuration, log generation level, file system characteristics (compression/encryption), validity of legacyExchangeDN, database and logs on the same disk, content indexing state, non-default parameters in private/public GUID registry, database size, e-mail address on public folder stores, remote procedure call (RPC) compression/buffer packing settings, hard-codes TCP/IP ports and clashes with other Exchange ports, non-default and bad store process parameters.
Transport – main configuration parameters within Active Directory, cross-check of AD and metabase for consistency, non-default settings, file system characteristics (compression/encryption) for mailroot folders, SMTP stack verb validation, SMTP mail submission test, enumeration of transport event sinks, enumeration of MTA settings, detection of archive sink and configuration, non-default routing parameters.
System Attendant – service state, file system characteristics (compression/encryption) for message tracking folders, request for response (RFR) service, RFR/name service provider interface (NSPI) target server configuration, hard-coded TCP/IP ports.
Anti-virus support – product detection, configuration and patch level (product dependent).
Other installed applications – RPC client/server binding order, presence of LeakDiag, old versions of Simpler-Webb Exchange Resource Manager, ISA 2000 service pack level, presence of MOM agent.
Hardware configuration – BIOS less than one year old, processor configuration, physical memory installed, specific support for Dell, HP and IBM servers.
Disks – performance counters enabled, enumeration of physical/logical disks, enumeration of mount points, enumeration of disk controllers and driver levels, host bus adaptor (HBA) configuration, SAN multi-pathing software version.
File versions – verify 29 key Exchange binaries (present/not too old/hotfixes), check MAPI subsystem/presence of old rollups/presence of ESE API virus scanners.
Hotfixes – detect all installed hotfixes for Exchange Server 5.5/2000/2003 and Windows 2000 Server/Windows Server 2003, identify any updates installed within the last 5 days and the logon name of the user that performed the installation.
Network – enumeration of all network cards and check NIC connection status, DNS/WINS configuration, IP gateway settings, primary DNS and domain suffix.
Operating system – page table entry (PTE) levels, paged/non-paged pool configuration, CrashOnAuditFail configuration, HeapDeCommitFreeBlock threshold, temporary paths, SystemPages configuration, /3GB and /USERVA configuration, physical address extensions (PAE), version and SKU (i.e. Standard, Enterprise, etc.), Dr. Watson configuration, debug settings, Virtual PC/Virtual Server/VMware detection.

This is not an exhaustive list and changes with each web update pack.

Preventative Maintenance

As mentioned previously, 60% of Exchange incidents reported to Microsoft are traced back to people or process, and not the technology itself. Additionally, mission critical software needs to run on good hardware as the reliability of the system is only as good as the reliability of each of its components and Microsoft claims that of the 90% of Exchange support incidents that are performance related, 50% are due to hardware issues.

Microsoft also claims that 90% of Exchange administrators do not carry out any maintenance until disaster strikes. The cause of this has been identified as a number of areas, including:

Low understanding of the issues and problems.
No time, resource, or budget to address maintenance tasks.
Non-availability of test equipment (and high impact of testing in a production environment).
Assumption that the risk of doing nothing outweighs the risk of pro-activity.
Technically capable, but see preventative maintenance as boring and time consuming.

When configuring Exchange Server, there are a number of items to consider, discussed in the following paragraphs:

In general, hardware should be selected from the Microsoft Windows Server Catalog and configured in a consistent manner, using high-quality components, the same (recent) firmware and driver levels.

Error correcting code (ECC) memory should be used, and there is little return on investment above 4Gb with Exchange Server.

Microsoft recommends the following disk layout, which separates transaction logs, data, and queues onto separate spindles for reasons of performance and data recovery:

Recommended disk configuration for Exchange Server

RAID 0 (disk striping) provides fast read and write times, with RAID 1 (disk mirroring) adding
redundancy to form RAID 0 + 1. As an alternative to RAID 0 + 1, RAID 5 (disk striping with parity) may be used (requiring less disks), but this configuration is slower to write due to the need to write the parity data.

Disk caching should be disabled (to avoid database corruptions where a transaction may not be successfully written to the disk) and hardware RAID employed (software RAID is too resource intensive). Servers should be specified with enough free disk space allow database maintenance to be performed (ideally 110% of the database file size) and disk compression should never be used on an Exchange server due to the effect on performance. The Microsoft JetStress and LoadSim tools should be used to test that the server is capable of providing the required performance levels.

The Windows Server operating system should be consistent, both in version and configuration. The maximum log size value for the event logs should be at least 16Mb in size and set to overwrite as needed (to allow a reasonable amount of diagnostic information to be captured, but to avoid full log files). Dr Watson should be the default debugger (to allow capture of user dump information) – this would normally be the case but may not be if some development environments are installed on the same computer as Exchange Server (not recommended). Recovery options should be set and the /3GB switch selected in boot.ini if more than 1Gb of RAM is installed (this provides a different memory split between the application and the kernel).

There should be at least two domain controllers for each domain, with at least one GC processor for each 4 Exchange processors (assuming all processors are of a roughly equivalent specification).

Exchange Servers should be configured with circular logging disabled (for most server configurations), a staggered information store maintenance window, and mailbox quotas configured (so the maximum database size is a known value). Permissions should be set by group (not user), a solid naming convention employed for all objects and the administrative notes fields should be used (incidentally, the use of these is a good way to check that the SRS is working where Exchange Server 5.5 servers are in use).

Microsoft has two core cluster configurations for its own Exchange servers:

Enterprise datacentres use a 7-node cluster with 4 active Exchange Server 2003 nodes, 1 passive Exchange Server 2003 node and 2 Windows Server 2003 (non-Exchange Server) nodes (for local backups).
Regional datacentres use a 5-node cluster with 3 active Exchange Server 2003 nodes, 1 passive Exchange Server 2003 node and 1 Windows Server 2003 node.

To reduce the number of drive letters used, mount points are employed, for example on Exchange virtual server 1:

This configuration has allowed Microsoft to perform a massive server consolidation exercise, removing 2 regional datacentres, 175 servers and 55 physical sites from the Exchange organisation, whilst doubling mailbox quotas.

When preparing for a Microsoft Exchange implementation there are a number of considerations:

A server configuration log can be used to enforce consistency and provide information for support staff. It should include firmware and BIOS revisions, installed software and version information, service packs, hotfixes, hardware, services, network configuration, repair and recovery information.
An operations logs should be created.
Test accounts should be used and a production test server acquired.
Operations should be considered through the up-front planning of a maintenance window for patch management and other maintenance processes, generation of a backup/restoration plan and standardised tape formats, and finally generation of a recovery and troubleshooting plan with contingency, oversized storage (for database maintenance), spare parts available locally for immediate replacement and periodic server recovery drills.

Once Exchange has been deployed, the maintenance process comes into play.

Daily tasks:

Check event logs (and act on them).
Check backup logs.
Monitor performance.
Check disk space.
Check the badmail folders and queues.
Check for updates.
test mail flow.
Back up the server.

Weekly tasks:

Compare the server against a baseline configuration.
Verify backed-up data with a restore (to a recovery server).

Monthly tasks:

ESEUTIL file dump.
ESEUTIL integrity check.
ISINTEG all tests default mode.

Ad-hoc tasks:

ESEUTIL defragmentation (every 12 months or after a large data move).
Full disaster recovery test.

Daily online backups should be used, even if the volume shadow copy service (VSS) is used. Online backups check database integrity through checksum verification and full online backups purge transaction logs at the conclusion of the backup. Even whilst the backup is taking place, users can still access mailboxes and public folders.

When monitoring Exchange, compare the recorded counters with a baseline and pay particular attention to:

Database\Log Record stalls/sec – average should be below 10 per second and maximum values should not be higher than 100 per second (indicates the number of logs records that cannot be written because the buffers are full – note that Exchange Server 2000 defaults to 84 buffers whilst Exchange Server 2003 defaults to 512).
Database\Log Threads Waiting – average should be below 10 (indicates the number of threads waiting to complete an update to the database by writing their data to the log – if too high, the log may be a bottleneck).
MSExchangeIS\RPC Requests – should be below 30 at all times (indicates the number of MAPI requests being serviced by the Microsoft Exchange Information Store service – the default maximum is 100).
MSExchangeIS\RPC Average Latency – should be below 50ms at all times and should be in the 10-25ms range on a healthy server (averaged over the last 1024 packets and affects how long it takes for a user’s view to change in Outlook).
MSExchangeIS\RPC Operations/sec – should rise and fall with MSExchangeIS\RPC Requests (indicates how many RPC operations are being requested and actually responded to).
MSExchangeIS\Virus Scan Queue Length – if this is consistently high consider a hardware upgrade (indicates the number of outstanding requests queued for virus scanning).
MSExchangeIS Mailbox\Active Client Logons – this is server-specific but should be baselined and monitored (indicates the number of clients which performed any action within the last 10 minutes).
Paging File\% Usage – should remain below 50% – high values indicate that the paging file size should be increased or more RAM added to the server (indicates the amount of the paging file used).
Memory\Available MBytes (MB) – 50Mb available at all times (indicates the amount of physical memory immediately available to a process).
Memory\Pages/sec – below 1000 at all times (indicates the rate at which pages are written to disk to resolve hard page faults).
Memory\Pool Nonpaged Bytes – no more than 100Mb (indicates the amount of memory available for kernel objects which must remain in memory and cannot be written to disk).
Memory\Pool Paged Bytes – no more than 180Mb, unless a backup or restoration is taking place (indicates the amount of memory available for kernel objects which must remain in memory and can be written to disk).
Physical Disk\Average Disk Read/sec – average below 20ms and maximum below 100ms for the database volume, average below 5ms and maximum below 50ms for the transaction log volume, average below 10ms and maximum below 50ms for the SMTP queue volume (indicates the average time to read data from the disk).
Physical Disk\Average Disk Write/sec – average below 20ms and maximum below 100ms for the database volume, average below 10ms and maximum below 50ms for the transaction log volume, average below 10ms and maximum below 50ms for the SMTP queue volume (indicates the average time to read data from the disk).

Consider implementing management tools such as MOM to monitor these counters.

ESEUTIL is the Exchange server database utility. Full syntax may be obtained by typing ESEUTIL /? at a command prompt but for maintenance purposes, there are four main options of interest:

Offline defragmentation (/d).
Integrity (/g).
File dump (/m).
Copy file (/y).

Because of the potential to cause damage with ESEUTIL, operations should normally be performed with restored data on a non-production server.

Offline defragmentation may be necessary if a large number of mailboxes have been deleted (e.g. following a migration, or if there is a high staff turnover), or following a hard database repair (ESEUTIL /p). It is only recommended if at least 30% of the space taken my the database will be recovered (Event ID 1221 in the application log after an online defragmentation will give a conservative estimate as to how much free space is in the database).

Unless a temporary path is specified as an option, offline defragmentation requires free disk space of at least 110% of the database size to be available as well as the streaming database to reside on the same path.

An integrity check may be necessary to perform a dry run of the repair function – i.e. to validate the checksum for each 4Kb page in the database. Problems that a repair would address are written to a database.integ.raw file which logs all pages in the database, not just those with problems. An integrity check may abort prematurely if problems are of such a nature that a repair is required before some parts of the database can be checked but this does not necessarily mean that a repair would fail. Unless options are specified, an integrity check requires 20% free space.

A file dump allows the viewing of the header information for database, streaming database, checkpoint, online backup patch or transaction log files. The header information can be used to validate that a series of transaction log files forms a matched set and that all files are undamaged, to view space allocation inside the databases, or to view metadata for one or more tables within the database file. An example use of this would be to read the state of an unmounted store (i.e. clean or dirty), to provide some diagnosis as to why the store stopped, prior to mounting the store (which would attempt a soft recovery).

If a database repair is required, this is a last resort, which will strip out orphaned database pages, possibly resulting in data loss. Multiple runs may be required until the entire database is repaired.

A copy file operation simply provides a quick method of copying databases between servers.

ISINTEG is a utility to search an offline information store for integrity weaknesses. Unlike ESEUTIL (which focuses on the physical database), ISINTEG is concerned with the logical structure. It has two modes:

Default mode – in which the tool runs the specified tests and reports its findings
Fix mode – where options are specified to run tests and attempt a fix where possible.

For maintenance work, default mode is used. Unless addressing a particular issue in the database, the alltests option is typically the most effective course to follow.

In order to run ISINTEG, the Microsoft Exchange Information Store service must be started but the database to be checked dismounted. ISINTEG can be run against remote servers, but not against raw database files or backups.