Archive

Posts Tagged ‘dfs replication problems’

Windows 2008 R2 DFS-R Troubleshooting Notes

Distributed File System (DFS) can sometimes have problems replicating, recovering its database, etc. The notes below are meant to provide a very basic guide to troubleshoot your DFS replication issues.

Event IDs

Event ID 4412: The DFS Replication service detected that a file was changed on multiple servers. A conflict resolution algorithm was used to determine the winning file. The losing file was moved to the Conflict and Deleted folder.

This event ID indicates proper operation and function of DFS. Don’t be deceived, however. If you’re having DFS replication problems, you may see this event ID intermittently, but not as often as should be expected. Check event logs on the other replication members to see what the amount of these entries should look like. If you don’t see enough 4412 entries, that server is probably having replication issues. It may also take some time to see the appropriate amount of these log entries after the database recovers and you get event ID 2214.

Event ID 2212: The DFS Replication service has detected an unexpected shutdown on volume C:. This can occur if the service terminated abnormally (due to a power loss, for example) or an error occurred on the volume. The service has automatically initiated a recovery process. The service will rebuild the database if it determines it cannot reliably recover. No user action is required.

Something happened to the server or DFS service. Either the server was forcibly shut down (ie. shutdown /f was run from command prompt), the power button was held long enough to cold boot the server, or the dfsrs.exe proecss was terminated in Task Manager. I’m sure there are other things that could cause this, but you get the idea. Now you have to wait for event ID 2214.

Event ID 2214: The DFS Replication service successfully recovered from an unexpected shutdown on volume C:.This can occur if the service terminated abnormally (due to a power loss, for example) or an error occurred on the volume. No user action is required.

Whew! Good news. The DFS database recovered and can now resume replication with the other members.

 

Dos and Don’ts

Do: Run chkdsk in read-only mode on the DFS share volume. If chkdsk finds errors, run chkdsk /F or chkdsk /R to fix the disk. DFS uses the NTFS USN Journal to track changes and Chkdsk can determine if there are errors in the USN Journal that need to be repaired. These errors may cause DFS replication issues and prevent the database from recovering. Chkdsk will not trigger DFS replication.

Do: Check the size of your staging quota. DFS shares can grow in size, especially if used with roaming profiles or any other storage repository containing dynamically changing files. Microsoft recommends having the staging quota set as close as possible to the actual size of the data being replicated to maintain replication performance. There are minimum staging quota size recommendations out there that don’t come close to what is needed for large DFS shares. At a minimum, set the staging quota to 25% of the size of the data being replicated.

Do: Determine if Remote Differential Compression (RDC) is necessary for your replication group. RDC is useful for smaller DFS share implementations and for larger file sizes. It is also recommended for replication between branches with slow WAN connections. If your DFS share is large, contains lots of small files, or replicates only between LAN servers, RDC may actually cause replication slowing.

Do: WAIT… yup, just wait. Depending on the size of data being replicated and the number of files, it can take DFS several hours, even a day or more, to recover its database. You may see some 4412 log entries, but until you see 2214 in the log the database is not fully recovered. Just leave everything (yes, everything!) alone and wait it out.

Don’t: Do not restart the server (forcibly or otherwise); do not end the dfsrs.exe process in Task Manager if the DFS Replication service remains in a “Stopping” state for an extended period of time; do not attempt to delete the server as a member in the DFS replication group; do not modify the server’s connections in the DFS Management console. Restarting the server or ending the dfsrs.exe process will only cause more 2212 Warnings to be issued in the log and make DFS start all over trying to recover. Deleting the server as a member from the replication group and adding it back will also merely prompt the database to try to recover again. A brief review of the debug logs (located @ C:\Windows\debug) will reveal journal wrap notices. These are normal as the database tries to recover. Just… don’t do anything with anything.

Don’t: Do not try to rebuild the replication group. Again, all you will be doing is restarting the database recovery process on a database that isn’t working to begin with.

Alternative: If DFS simply won’t recover, an alternative is to create a new replication group, remove the server from the old group, and join it to the new one. Yes, this starts the database rebuild process all over again, but it’s an entirely new database with new connection and recovery information. This prevents DFS from attempting the rebuild process on a possibly corrupt database.

 

REFERENCES

Staging folders and Conflict and Deleted folders
You receive DFSR event ID 2212 after you restart the DFSR service in Windows Server 2008
Tuning replication performance in DFSR (especially on Win2008 R2)