On a Windows Server 2012 R2 Hyper-V cluster I had an issue whereby backups were failing with VSS errors.
When running VSSAdmin List Writers I received the response: “Waiting for responses. These may be delayed if a shadow copy is being prepared.”
In order to resolve this I restarted the COM+ Event System service. Following this VSSAdmin List Writers failed to return anything.
When restarting the COM+ Event System service there are a number of dependant services. I noticed that the COM+ System Application service hadn’t automatically restarted. I started the service, then restarted the services below. Following this VSSAdmin List Writers returned the expected list of writers.
- EqualLogic VSS Requestor – Specific to my use of EqualLogic storage
- EqualLogic VSS Service – Specific to my use of EqualLogic storage
- Hyper-V Virtual Machine Management
- Microsoft Software Shadow Copy Provider
- Volume Shadow Copy
When trying to run the Optimize Hosts wizard within SCVMM 2012 R2 I received the error “Dynamic Optimization Cannot Be Performed At This Time” and “Object reference not set to an instance of an object”.
The Application Event Log on the SCVMM server contained a Windows Error Reporting event from the same time. Opening the event showed a link to the error log.
Opening the error log showed that the error was related to a logical network issue on the cluster. This cluster has a converged network switch to which all virtual machines (VMs) connect. However, two additional logical networks are mapped to the switch to enable the migration of VMs which were connected to logical networks of a different name on a legacy cluster.
What I found is that some VMs were connected to the “Hyper-V External Access ” logical networks, rather than the ConvergedNetworkSwitch. Changing the network mapping of the affected VMs to ConvergedNetworkSwitch enabled me to run the Dynamic Optimization wizard.
Following an issue with the storage used by our Hyper-V cluster, one node in our five node cluster became partially unresponsive. The virtual machines (VMs) running on the unresponsive node were automatically moved to other cluster nodes and service was resumed withing a couple of minutes. At first everything appeared to be fine, but within a few minutes our monitoring system started to report connectivity issues to the VMs that had failed over.
I RDP’d onto one of the VMs that was having connectivity issues, but found the connection kept dropping out, so I connected to the console through System Center Virtual Machine Manager (SCVMM). I found I was unable to ping any server on the physical network. I took a look at the event log on one of the virtual hosts and saw the error below:
Port ‘BF392932-9AE4-453A-8E13-26671BB556D9′ was prevented from using MAC address ’00-14-22-18-7F-DC’ because it is pinned to port ‘SCVMM-C26227E3-D6AB-4818-B8BF-4CCF923C’.
The error message implied another VM was using the MAC of the VM that was having connectivity issues. As the VM had a dynamic MAC that was managed by SCVMM I knew that couldn’t be the case. I decided to reboot the unresponsive cluster node. After waiting 30 minutes for the node to shutdown I killed the power via a DRAC. As soon as I killed the power to the node the MAC address errors in the event log disappeared and all the VMs resumed normal connectivity. I believe the cluster node that became unresponsive was keeping some kind of lock on the MAC addresses of the VMs that were running on the node when it became unresponsive. Killing the power to the node freed the locks enabling connectivity to resume.
Our five node Hyper-V cluster is connected to a Dell MD3000i, which provides virtual machine storage using Cluster Shared Volumes (CSV). The MD3000i has dual storage controllers for redundancy, but recently both storage controllers rebooted within a minute of each other. Looking at the Storage section of Failover Cluster Manager showed that half the CSV volumes had a status of Redirected Access. Reading this blog http://blogs.technet.com/b/askcore/archive/2010/12/16/troubleshooting-redirected-access-on-a-cluster-shared-volume-csv.aspx showed that the first thing to try was to “Turn off redirected access for this Cluster shared volume”, unfortunately this didn’t work. I looked in Disk Management on each of the five nodes, which looked as below:
Each node should be able to see all the disks, but only the first node, the one on the far left, could see all five disks. I live migrated the virtual machines off each node and rebooted each node one at a time. Once completed, Disk Management looked like this:
Every node can see all five disks. I checked Failover Cluster Manager and all the CSV volumes had returned to Online status automatically.
Following a power down of our Hyper-V cluster, about half the virtual machines (VMs) would not start. Looking in Failover Cluster Manager showed the VMs couldn’t start because the machine files were missing. I looked at the Storage summary, but the cluster shared volumes were present and had a status of online:
This was odd, Failover Cluster Manager was saying the VM files were missing, but the disks were online. I expanded the cluster shared volumes (CSV):
If you look at the CSV paths, you’ll see the path for Cluster Disk 1 is C:\ClusterStorage\Volume1 and the path for Cluster Disk 3 is C:\ClusterStorage\. The reason the machines wouldn’t start was because the Cluster Disk 3 path was incorrect. It should have been C:\ClusterStorage\Volume2.
I right-clicked on Cluster Disk 3, chose “Move this shared volume to another node” and selected one of the other cluster nodes. When the move had completed, the path was correct and I was able to start the VMs.
You must import the FailoverClusters module to use the cluster commandlets in PowerShell. Open PowerShell then type: Import-Module FailoverClusters
Two virtual disks failed on the storage used by our Windows Server 2008 R2 Hyper-V cluster. I wasn’t able to recover the virtual disks, so I needed to remove them from the cluster. I had to use a different method for each disk.
The cluster uses Cluster Shared Volumes, so using Failover Cluster Manager, I opened Cluster Shared Volumes, right-clicked on the first disk and selected “Remove from Cluster Shared Volumes” (This option was only available when the cluster was attempting to bring the disk online). The disk was then listed as Failed under Available Storage, in the Storage section of Failover Cluster Manager. I was then able to right-click on the disk and select Delete.
The second disk was continuously listed as Failed in Cluster Shared Volumes, so I wasn’t able to select the “Remove from Cluster Shared Volumes” option. I had to remove this disk from Cluster Shared Volumes using PowerShell. I opened PowerShell, and imported the Failover Cluster module into PowerShell by typing Import-Module FailoverClusters . Next, I typed Remove-ClusterSharedVolume “Cluster Disk 4” to remove the failed disk from Cluster Shared Volumes. The disk was then listed as Failed under Available Storage, in the Storage section of Failover Cluster Manager. I then typed Remove-ClusterResource “Cluster Disk 4” to remove the disk from the cluster.