virtuallyPeculiar

In few cases, when you try to perform a File Level Restore on a windows server, the "Select Destination" window is empty.

This can happen due to few reasons. If this is happening on ESX 6.0 then it would imply that the vSphere Guest API would be used instead of the VIX API. We need to modified the configuration of the config.xml to use the VIX API only.

Make sure the values in the config.xml file are what is expected. Run the below commands:

# egrep "vmmgrflags|mountmgr" /usr/local/avamarclient/bin/config.xml

# egrep "vmmgrflags|mountmgr" /usr/local/avamarclient/bin/MountPoint/config.xml

The above two commands should display the below output:

<vmmgrflags>2</vmmgrflags>
<mountmgrflags>3221225476</mountmgrflags>

If you see this, skip to "Other possible solutions" section. If not, then edit these two files and make sure the values reflect the above mentioned output.

Then restart the FLR service using:

# service vmwareflr restart

Then re-try the FLR operation.

Other possible solutions:

In most recent VDP releases this value should be already set to VIX API only, and we might still run into the same issues. In that case a strong case be:

> Outdated VM Tools on that windows machine. Update the VMware tools and restart the VM and then re-attempt the FLR task.

> Also, add the FLR URL to the browser's trusted website list.

If the issue still persists, open a case with VMware Support.

Hope this helps!

There are few cases where you try to perform an in place restore of a virtual machine and it fails and complains the virtual machine is not powered off. The virtual machine is indeed powered off, however, the VDP does not understand this. You will see something like:

There might be two possible causes.

1. The MCS cache might not be updated with the power state.
Run the below command:

# mccli vmcache show --name=/vc-domain-name/VirtualMachines/<vm-name> | grep -i "power status"

Example command:

# mccli vmcache show --name=/cartman.southpark.local/VirtualMachines/Test-1 | grep -i "power status"

The output you should ideally see on a powered off VM is:
Power Status poweredOff

In the above case, you might see:

Power Status poweredOn

If this is the issue, then update the MCS VM cache by issuing the below command:

# mccli vmcache sync --name=/vc-domain-name/VirtualMachines/<vm-name>

Example command:

# mccli vmcache sync --name=/cartman.southpark.local/VirtualMachines/Test-1

Then the power state should be updated and the in place restore should work.

2. The MCS cache might be updated but the tomcat is out of sync with MCS.

In many cases the MCS and tomcat don't sync together. Due to this, the CLI will show one set of results and the GUI will say otherwise. To sync them up simply restart the tomcat service by issuing:

# emwebapp.sh --restart

Note that post restarting the tomcat service, it will take a while to connect the appliance back in the web client as it has to rebuild the cache.

If you tail the vdr-server.log located under /usr/local/avamar/var/vdr/server_logs/vdr-server.log then once you see the below logging, it would indicate the connection has completed successfully:

2018-01-25 09:05:57,566 INFO [Timer_PhoneHomeCollect]-schedule.PhonehomeCollectTask: Writing Phome data to location /usr/local/avamar/var/vdr/phonehome/vdp_state

2018-01-25 09:05:57,567 INFO [Timer_PhoneHomeCollect]-schedule.PhonehomeCollectTask: Writing Phome data to location /usr/local/avamar/var/vdr/phonehome/vdp_state

The restore should then work successfully.

If it still continues to fail, then try performing a restore to a different location as a workaround.

Hope this helps!

In few scenarios when you run a test recovery or a planned migration, the SRM service will crash. This might happen when you run a specific recovery plan or any recovery plan.

If you look into the vmware-dr.log you will notice the following back-trace:

--> Panic: VERIFY d:\build\ob\bora-3884620\srm\public\functional/async/timedFunc.h:210
-->
--> Backtrace:
-->
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: -
--> backtrace[00] vmacore.dll[0x001C568A]
--> backtrace[01] vmacore.dll[0x0005CA8F]
--> backtrace[02] vmacore.dll[0x0005DBDE]
--> backtrace[03] vmacore.dll[0x001D7405]
--> backtrace[04] vmacore.dll[0x001D74FD]

xxxxxxxxxxxxxxxxxxxxx Cut Logs Here xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

--> backtrace[36] ntdll.dll[0x000154E4]

--> [backtrace end]

-->

The timedFunc back-trace is seen when "Wait For VMware Tools" is set to 0 minutes and 0 seconds.

And just about few lines top of this back trace you will see the faulty VM which caused this crash.

You will see something similar to:

2018-01-21T08:37:05.421-05:00 [44764 info 'VmDomain' ctxID=57d5ae61 opID=21076ff:c402:4147:d883] Waiting for VM '[vim.VirtualMachine:b2ab3f04-c72e-43ca-b93d-de1566e4de14:vm-323]' to reach desired powered state 'poweredOff' within '0' seconds.

The VM ID is given here. To find this VM ID you will need to go to the vCenter MOB page.

The way I found out to correlate this is:

1. Login to MOB page for vCenter (https://vcenter-ip/mob)

2. Content > group-d1 (Datacenters)

3. Respective datacenter under "Child Entity"

4. Then under vmFolder group-v4 (vm)

5. Expand childEntity and this will list out all the VMs in that vCenter.

My output was similar to:

The VM was CentOS7.2

> Then navigate to the Recovery plan in SRM

> Select the affected Recovery plan this VM is part of > Related Objects> Virtual Machines

> Right click this VM and select Configure Recovery

Here the Wait For VMware Tools were set to 0,0 timeout. We had to change this to a valid non zero value.

Post this, the recovery plan completed fine without crashing the SRM service. This should ideally be fixed in the newer SRM releases as it would not let you set a 0 timeout.

Hope this helps!

Today while working on a 6.1.1 fresh SRM deployment we were unable to see the Site Recovery Manager plugin in the web client. The first thing, we do in this case is to go to the Managed Object Browser page and check if the SRM extension is registered successfully. The URL for MOB page is https://vcenter-ip-or-fqdn/mob

Here we browse further to content > ExtensionManager. Under the properties section, we should have an SRM extension, which is com.vmware.vcDr, by default. If you have installed SRM with a custom identifier then you would see something like, com.vmware.vcDr-<your-custom-identifier-name>
In our case, the extension was available.

Next, looking at the web client logs, in our case a vCenter appliance, we noticed the following:

[2018-02-06T12:00:13.283+03:00] [ERROR] vc-extensionmanager-pool-81 70000046 100002 200001 com.vmware.vise.vim.extension.VcExtensionManager Package com.vmware.vcDr-custom was not installed!
Error downloading https://SRM-Local-IP:9086/srm-client.zip. Make sure that the URL is reachable then logout/login to force another download. java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:668)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)

So the vCenter was unable to pull the plugin manually from that URL. So under the plugin package folder we did not have any SRM plugin folder. The location to this plugin package folder on vCenter appliance is:

# /etc/vmware/vsphere-client/vc-packages/vsphere-client-serenity

Here you should have a "com.vmware.vcDr-<version-ID>" folder which in our case was missing. So we had to manually dump this package in this location.

To fix this:
1. Navigate to the URL from the log from a browser, https://SRM-Local-IP:9086/srm-client.zip
This will prompt you for a download of the plugin zip file. Download this file and put into the above mentioned vsphere-client-serenity location via a WinSCP

2. Now, we will have to manually create this plugin folder. There are few catches to this.

If you are using default plugin identifier for SRM, then the naming convention would be:
com.vmware.vcDr-<srm-version-string>

If you are using custom identifier for SRM, then the naming convention would be:
com.vmware.vcDr-customName-<srm-version-string>

How do you find this exact SRM version string?

A) Go back to the MOB page where you had left off in ExtensionManager. Click the com.vmware.vcDr extension. This will in turn open a new page.

B) Here click on the client under VALUE

C) Now you can see the version string and the value. In a 6.1.1 SRM for example, the version string is 6.1.1.1317

So the plugin folder now will be:
Default:
com.vmware.vcDr-6.1.1.1317

Custom:
com.vmware.vcDr-custom-6.1.1.1317

3. Copy the zip file into this folder and then extract it. The outcome would be a plugin-package.xml and a plugins folder.

4. Restart the web client service for the vCenter. The command varies for 6.5 and 6.0 vCenter.

5. Re-login back to the web client once the web client loads up and you should have the plugin.

Hope this helps!

When connecting a newly deployed VDP or an existing VDP to the web client, you might run into the following error:

This is a very generic message and if you have a look at the web client logs, you will notice the following back trace:

[2018-02-08T09:03:57.295Z] [WARN ] http-bio-9090-exec-5 70000222 100008 200003 org.springframework.flex.core.DefaultExceptionLogger The following exception occurred during request processing by the BlazeDS MessageBroker and will be serialized back to the client: flex.messaging.MessageException: com.sun.jersey.api.client.UniformInterfaceException : POST https://10.10.0.12:8543/vdr-server/auth/login returned a response status of 204 No Content
at flex.messaging.services.remoting.adapters.JavaAdapter.invoke(JavaAdapter.java:444)
at com.vmware.vise.messaging.remoting.JavaAdapterEx.invoke(JavaAdapterEx.java:50)
at flex.messaging.services.RemotingService.serviceMessage(RemotingService.java:183)

Caused by: com.sun.jersey.api.client.UniformInterfaceException: POST https://10.10.0.12:8543/vdr-server/auth/login returned a response status of 204 No Content
at com.sun.jersey.api.client.ClientResponse.getEntity(ClientResponse.java:609)
at com.sun.jersey.api.client.ClientResponse.getEntity(ClientResponse.java:586)
at com.emc.vdp2.api.impl.BaseApi.convertToFlexException(BaseApi.java:184)

Looking further into the vdr-server.log, you will notice this:

2018-02-08 10:04:44,850 ERROR [http-nio-8543-exec-9]-rest.AuthenticationService: Failed To Get VDR Info
java.lang.NullPointerException
at com.emc.vdp2.common.appliance.ApplianceServiceImpl.getApplianceState(ApplianceServiceImpl.java:47)
at com.emc.vdp2.services.VdrInfoServiceImpl.getVdrInfo(VdrInfoServiceImpl.java:171)
at com.emc.vdp2.services.ApplianceInfoServiceWS.getVdrInfo(ApplianceInfoServiceWS.java:50)
at com.emc.vdp2.rest.AuthenticationService.login(AuthenticationService.java:87)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)

The next piece of stack trace might vary, but if you see the above messages then you are bound to see a similar trace as below:

2018-02-08 10:04:44,727 INFO [http-nio-8543-exec-9]-rest.AuthenticationService: Logging into appliance with type: vdp
2018-02-08 10:04:44,768 INFO [http-nio-8543-exec-9]-connection.Mcsdk10StubManager: ServiceInstanceMoref desc=Service Id: urn:uuid:9FBE7B2DFEF05750401518080690404 name=urn:uuid:9FBE7B2DFEF05750401518080690404 value=SERVICE
2018-02-08 10:04:44,771 INFO [http-nio-8543-exec-9]-connection.McAccessManager: Creating new mcsdk stub handler for connection key: [2091248218, Service Id: urn:uuid:9FBE7B2DFEF05750401518080690404] on Thread: [http-nio-8543-exec-9]
2018-02-08 10:04:44,849 ERROR [http-nio-8543-exec-9]-db.ApplianceStateDAO: ApplianceStateDAO.getApplianceState failed to execute ApplianceState query.
java.sql.SQLException: ERROR: relation "appliance_state" does not exist Query: select * from appliance_state Parameters: []
at org.apache.commons.dbutils.AbstractQueryRunner.rethrow(AbstractQueryRunner.java:320)
at org.apache.commons.dbutils.QueryRunner.query(QueryRunner.java:349)
at org.apache.commons.dbutils.QueryRunner.query(QueryRunner.java:305)

Right after it initiates Authentication, it queries the vdr database. And in this case, appliance_state, table is missing from vdrdb.

To connect to vdrdb on VDP, run:

# psql -p 5555 -U admin vdrdb

Type \d to list all tables. You should see 26 tables here:

List of relations

Schema | Name | Type | Owner

--------+-----------------------------------------+----------+-------

public | appliance_state | table | admin

public | compatibility | table | admin

public | container_group_membership | table | admin

public | container_group_membership_id_seq | sequence | admin

public | email_report_settings | table | admin

public | entity_display_path | table | admin

public | entity_display_path_id_seq | sequence | admin

public | esx_hosts | table | admin

public | esx_hosts_id_seq | sequence | admin

public | group_app_client_targets | table | admin

public | group_app_client_targets_id_seq | sequence | admin

public | identity | table | admin

public | identity_id_seq | sequence | admin

public | job_migration_history | table | admin

public | job_migration_history_id_seq | sequence | admin

public | locked_backup_retentions | table | admin

public | mc_activity_monitor | table | admin

public | mc_replication_activity_monitor | table | admin

public | user_log | table | admin

public | user_log_id_seq | sequence | admin

public | v_vm_group_membership_by_container | view | admin

public | vcenter_event_monitor | table | admin

public | vdp_migration_history | table | admin

public | vdp_migration_history_id_seq | sequence | admin

public | vm_group_membership_by_container | table | admin

public | vm_group_membership_by_container_id_seq | sequence | admin

If you are missing one or more tables, the vdr service is not initialized and the connection fails.
To recreate the missing tables, open a case with VMware Support. I had to fix these tables manually. If someone has a better way, I'm open for suggestions.

Post recreating the tables, restart the tomcat service using:

# emwebapp.sh --restart

That's it!

In one of my case, there was a report that 7778, 7779, 7780, 7781, 9443 where reported as vulnerable on VDP 6.1.6. All these are MCS java based ports and you can confirm them by running:

# netstat -nlp | grep <enter-port>

To check your MCS SSL validity perform the below commands:

# /usr/java/default/bin/keytool -list -keystore /usr/local/avamar/lib/rmi_ssl_keystore -storepass changeme

The output:

Keystore type: JKS

Keystore provider: SUN

Your keystore contains 2 entries

mcssl, Feb 1, 2008, PrivateKeyEntry,

Certificate fingerprint (SHA1): F1:61:A7:FE:36:A9:E9:7E:DB:92:AE:89:05:52:13:B6:3C:FA:55:A7

vcenterrootca, Jan 8, 2018, trustedCertEntry,

Certificate fingerprint (SHA1): F0:46:B4:00:B8:52:24:6E:A2:94:6B:17:CE:83:23:49:54:9A:3A:49

Then export the cert to root directory:

# /usr/java/default/bin/keytool -exportcert -v -alias mcssl -keystore /usr/local/avamar/lib/rmi_ssl_keystore -storepass changeme -file /root/mcssl.cer -rfc

The output:

Certificate stored in file </root/mcssl.cer>

Then read the certificate:

# /usr/java/default/bin/keytool -printcert -v -file /root/mcssl.cer

The output:

Owner: CN=Administrator, OU=Avamar, O=EMC, L=Irvine, ST=California, C=US

Issuer: CN=Administrator, OU=Avamar, O=EMC, L=Irvine, ST=California, C=US

Serial number: 47a25760

Valid from: Fri Feb 01 00:18:56 CET 2008 until: Mon Jan 29 00:18:56 CET 2018

Certificate fingerprints:

MD5: 61:42:FC:CD:FC:CB:6E:59:CC:48:5E:D9:71:05:F0:B4

SHA1: F1:61:A7:FE:36:A9:E9:7E:DB:92:AE:89:05:52:13:B6:3C:FA:55:A7

SHA256: B4:E6:71:77:58:9B:58:64:E2:F7:3A:A0:2A:07:F8:7B:2E:CA:1B:22:2B:C3:98:A8:90:F8:D8:7A:8E:0A:EE:F9

Signature algorithm name: SHA1withDSA

Version: 1

Due to this expired cert, the java ports are vulnerable. To fix this, you will have to regenerate the certs. The process would be:

1. Backup existing keystore:

# cp /usr/local/avamar/lib/rmi_ssl_keystore ~root/rmi_ssl_keystore_backup-`date -I`

2. Regenerate the mcssl:

# /usr/java/latest/bin/keytool -genkeypair -v -alias mcssl -keyalg RSA -sigalg SHA256withRSA -keystore /usr/local/avamar/lib/rmi_ssl_keystore -storepass changeme -keypass changeme -validity 3650 -dname "CN=`hostname -f`, OU=Avamar, O=EMC, L=Irvine, S=California, C=US" -keysize 2048

Generates a SHA256 SSL which is valid for 10 years.

3. Update the permissions on the rmi_ssl_keystore

# chmod 444 /usr/local/avamar/lib/rmi_ssl_keystore

4. Update owners for the keystore:

# chown root:admin /usr/local/avamar/lib/rmi_ssl_keystore

5. Switch to admin mode and restart MCS:

# mcserver.sh --stop

# mcserver.sh --start --verbose

6. Verify all vCenter Connections are OK:

# mccli server show-services

That should be it. Now when you re-run the scan these ports are no longer vulnerable.

Hope this helps!

There might be instances where the replication will run into an Error State or RPO violation state with NFC errors. When you click on the vCenter object in web client and navigate to the summary tab you can view the list of issues and when you highlight the vSphere Replication issues you will notice the NFC errors.

You will notice the below in the logs.
Note: The GID and other values will be different for each environment.

In source ESX where the virtual machine having issue is hosted, you will notice the below in vmkernel.log

2018-02-09T12:07:02.728Z cpu2:3055234)Hbr: 2998: Command: INIT_SESSION: error result=Failed gen=-1: Error for (datastoreUUID: "4723769b-f34bce3e"), (diskId: "RDID-0aaaa0e1-66e1-447f-97f5-19072c00d01e"), (hostId: "host-575"), (pathname: "Test-VM/hbrdis$
2018-02-09T12:07:02.728Z cpu2:3055234)WARNING: Hbr: 3007: Command INIT_SESSION failed (result=Failed) (isFatal=FALSE) (Id=0) (GroupID=GID-e62e7093-bca9-4f51-9e87-75f17c80bdf6)
2018-02-09T12:07:02.728Z cpu2:3055234)WARNING: Hbr: 4570: Failed to establish connection to [10.254.2.37]:31031(groupID=GID-e62e7093-bca9-4f51-9e87-75f17c80bdf6): Failure

In the hbrsrv.log under /var/log/vmware you will notice:

2018-02-09T13:12:17.024+01:00 warning hbrsrv[7FF152B01700] [Originator@6876 sub=Libs] [NFC ERROR] NfcFssrvrClientOpen: received unexpected message 4 from server
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-525.
2018-02-09T13:12:17.024+01:00 verbose hbrsrv[7FF152B01700] [Originator@6876 sub=HostPicker] AffinityHostPicker forgetting host affinity for context '[] /vmfs/volumes/4723769b-f34bce3e/Test-VM2'
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main] HbrError for (datastoreUUID: "4723769b-f34bce3e"), (hostId: "host-525"), (pathname: "Test-VM2/Tes-VM2.vmdk"), (flags: retriable, pick-new-host) stack:
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main] [0] Class: NFC Code: 8
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main] [1] NFC error: NFC_SESSION_ERROR
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main] [2] Code set to: Host unable to process request.
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main] [3] Set error flag: retriable
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main] [4] Set error flag: pick-new-host
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main] [5] Can't open remote disk /vmfs/volumes/4723769b-f34bce3e/Test-VM2/Test-VM2.vmdk

Now, you can run the below command to check if there is one affected host or multiple:

# grep -i "Destroying NFC connection" /var/log/vmware/hbrsrv.log | awk '{ $1="";print}' | sort -u

This will give you the list of host MoID. Something like this, neatly sorted out:

info hbrsrv[7FF152A7F700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-433.
info hbrsrv[7FF152A7F700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-660.
info hbrsrv[7FF1531E6700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-433.
info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-352.
info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-390.
info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-487.

Then use this host-id to see which host name it corresponds to from the vCenter MOB page.

Then on that affected host, you will see this in the hostd.log

2018-02-09T12:17:21.339Z info hostd[4D4C1B70] [Originator@6876 sub=Libs] NfcServerProcessClientMsg: Authenticity of the NFC client verified.

2018-02-09T12:17:21.399Z info hostd[4B040B70] [Originator@6876 sub=Nfcsvc] PROXY connection to NFC(useSSL=0): found session ticket:[N9VimShared15NfcSystemTicketE:0x4c224d24]

2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Successfully initialized nfc callback for a write to the socket to be invoked on a separate thread

2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Plugin started

2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Libs] NfcServerProcessClientMsg: Authenticity of the NFC client verified.

2018-02-09T12:17:21.448Z warning hostd[4D4C1B70] [Originator@6876 sub=Libs] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 50331560 and allocating 119 will make it more than the maximum allocated: 50331648. Please close some sessions and try again

2018-02-09T12:17:21.448Z warning hostd[4D4C1B70] [Originator@6876 sub=Libs] [NFC ERROR] NfcProcessStreamMsg: fssrvr failed with NFC error code = 5

2018-02-09T12:17:21.448Z error hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Read error from the nfcLib: NFC_NO_MEMORY (done=yep)

To fix this, you will need to increase the hostd NFC memory on the target affected ESX host.

1. SSH to the host and navigate to the below location;

# etc/vmware/hostd/config.xml

You will want the following snippet (backup the file before edit)

<nfcsvc>
<path>libnfcsvc.so</path>
<enabled>true</enabled>
<maxMemory>50331648</maxMemory>
<maxStreamMemory>10485760</maxStreamMemory>
</nfcsvc>

So here change value for maxMemory to 62914560

So after edit:

<nfcsvc>
<path>libnfcsvc.so</path>
<enabled>true</enabled>
<maxMemory>62914560</maxMemory>
<maxStreamMemory>10485760</maxStreamMemory>
</nfcsvc>

2. Restart the hostd service using:

# /etc/init.d/hostd restart

3. Then initiate a force sync on the replication and it should resume successfully.

Hope this helps!

Yet another issue. You will see this generic message in web client when you connect to the web client VDP plugin:

As usual, we always check the web client logs on the vCenter for more information. And in this case, below was the back trace:

[2018-01-24T15:02:43.924Z] [ERROR] -0:0:0:0:0:0:0:1-9090-exec-9 70467127 102381 201761 com.sun.jersey.api.client.ClientResponse A message body reader for Java class com.emc.vdp2.model.error.VdrError, and Java type class com.emc.vdp2.model.error.VdrError, and MIME media type application/octet-stream was not found
[2018-01-24T15:02:43.924Z] [ERROR] -0:0:0:0:0:0:0:1-9090-exec-9 70467127 102381 201761 com.sun.jersey.api.client.ClientResponse The registered message body readers compatible with the MIME media type are:
application/octet-stream ->
com.sun.jersey.core.impl.provider.entity.ByteArrayProvider
com.sun.jersey.core.impl.provider.entity.FileProvider

[2018-01-24T15:02:43.924Z] [WARN ] -0:0:0:0:0:0:0:1-9090-exec-9 70467127 102381 201761 com.emc.vdp2.api.impl.ActionApi Caught UniformInterfaceException [POST https://192.168.246.10:8543/vdr-server/auth/login returned a response status of 400 Bad Request], recieved HTTP response: [400] com.sun.jersey.api.client.UniformInterfaceException: POST https://192.168.246.10:8543/vdr-server/auth/login returned a response status of 400 Bad Request
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:688)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:560)
at com.emc.vdp2.api.impl.ActionApi.connectVDR(ActionApi.java:40)
at sun.reflect.GeneratedMethodAccessor2344.invoke(Unknown Source)

[2018-01-24T15:02:43.933Z] [WARN ] -0:0:0:0:0:0:0:1-9090-exec-9 70467127 102381 201761 org.springframework.flex.core.DefaultExceptionLogger The following exception occurred during request processing by the BlazeDS MessageBroker and will be serialized back to the client: flex.messaging.MessageException: com.sun.jersey.api.client.ClientHandlerException : A message body reader for Java class com.emc.vdp2.model.error.VdrError, and Java type class com.emc.vdp2.model.error.VdrError, and MIME media type application/octet-stream was not found
at flex.messaging.services.remoting.adapters.JavaAdapter.invoke(JavaAdapter.java:412)
at com.vmware.vise.messaging.remoting.JavaAdapterEx.invoke(JavaAdapterEx.java:72)
at flex.messaging.services.RemotingService.serviceMessage(RemotingService.java:180)
at flex.messaging.MessageBroker.routeMessageToService(MessageBroker.java:1472)

The issue is due to a corrupted server.xml for the tomcat library. Due to this, none of the vCenter users would be able to connect or only domain users might face this issue.

To fix this:
1. Download the patched server.xml file from this link here:
https://github.com/happycow92/Patches

2. Copy the file to the /root of VDP via WinSCP

3. Backup the original file:

# cp -p /usr/local/avamar-tomcat/conf/server.xml ~/server.xml.bak

4. Remove the old server.xml file:

# rm -f /usr/local/avamar-tomcat/conf/server.xml

5. Replace the patch file in the conf path:

# cp -p ~/server.xml /usr/local/avamar-tomcat/conf

6. Update permissions and ownership:

# chown root:root server.xml && chmod 644 server.xml

7. Restart tomcat service

# emwebapp.sh --restart

The connection will take a while now since it needs to rebuild the cache, but it will be successful.

Hope this helps.

When I was working on a 6.1.5 VDP deployment, the test emails were successful, however, the scheduled email reports used to fail and in the vdr-server.log the below back trace was seen:

2018-02-14 09:00:00,157 ERROR [VDP-email-report-timer-task]-schedule.EmailReportTimerTask: Failed to send the email summary report. Reason: java.rmi.RemoteException: VI SDK invoke exception:com.vmware.vim25.Manage

dObjectNotFound; nested exception is:

com.vmware.vim25.ManagedObjectNotFound

java.lang.RuntimeException: java.rmi.RemoteException: VI SDK invoke exception:com.vmware.vim25.ManagedObjectNotFound; nested exception is:

com.vmware.vim25.ManagedObjectNotFound

at com.vmware.vim25.mo.ManagedObject.retrieveObjectProperties(ManagedObject.java:158)

at com.vmware.vim25.mo.ManagedObject.getCurrentProperty(ManagedObject.java:179)

at com.vmware.vim25.mo.ManagedEntity.getName(ManagedEntity.java:99)

at com.emc.vdp2.common.converter.VmClientConverterImpl.getRealName(VmClientConverterImpl.java:113)

at com.emc.vdp2.common.converter.VmClientConverterImpl.convert(VmClientConverterImpl.java:62)

at com.emc.vdp2.common.converter.BackupJobConverterImpl.addVmClientToBackupJob(BackupJobConverterImpl.java:329)

at com.emc.vdp2.common.converter.BackupJobConverterImpl.addClientsToBackupJob(BackupJobConverterImpl.java:280)

at com.emc.vdp2.common.converter.BackupJobConverterImpl.addTargetItemsToBackupJob(BackupJobConverterImpl.java:258)

at com.emc.vdp2.common.converter.BackupJobConverterImpl.convert(BackupJobConverterImpl.java:181)

at com.emc.vdp2.common.converter.BackupJobConverterImpl.convert(BackupJobConverterImpl.java:204)

at com.emc.vdp2.services.BackupQueryServiceWS.getAllVmBackupJobs(BackupQueryServiceWS.java:80)

at com.emc.vdp2.services.BackupQueryServiceWS.getAllVmBackupJobs(BackupQueryServiceWS.java:64)

at com.emc.vdp2.email.EmailSummaryReport.createReport(EmailSummaryReport.java:295)

at com.emc.vdp2.email.EmailSummaryReport.createReport(EmailSummaryReport.java:260)

at com.emc.vdp2.email.EmailSummaryReport.createAndSendReport(EmailSummaryReport.java:142)

at com.emc.vdp2.schedule.EmailReportTimerTask.run(EmailReportTimerTask.java:105)

at java.util.TimerThread.mainLoop(Unknown Source)

at java.util.TimerThread.run(Unknown Source)

The main part of the back trace we are interested is this:

com.emc.vdp2.common.converter.BackupJobConverterImpl.addClientsToBackupJob(BackupJobConverterImpl.java:280)

at com.emc.vdp2.common.converter.BackupJobConverterImpl.addTargetItemsToBackupJob(BackupJobConverterImpl.java:258)

at com.emc.vdp2.common.converter.BackupJobConverterImpl.convert(BackupJobConverterImpl.java:181)

at com.emc.vdp2.common.converter.BackupJobConverterImpl.convert(BackupJobConverterImpl.java:204)

The backup job conversion to group was having an exception. So out of curiosity, I ran the below command:

# mccli group show --recursive=true

This showed 2 backups jobs whereas the GUI reported one backup job.

So to fix this, I had to restart the tomcat service using:

# emwebapp.sh --restart

Post this, the GUI updated the right number of backup jobs and the scheduled email reports were sent successfully.

If you run into something similar hope this helps!

In a postgreSQL SRM deployment, the service might crash if the Carriage Return / Line Feed bit is enabled. The back trace would not tell much, even with trivia logging I could not make much out of it. This is what I saw in vmware-dr.log:

2018-02-20T01:05:26.359Z [02712 verbose 'Replication.Folder'] Reconstructing folder '_replicationRoot':'DrReplicationRootFolder' from the database
2018-02-20T01:05:26.468Z [02712 panic 'Default']
-->
--> Panic: TerminateHandler called
--> Backtrace:
-->
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: -
--> backtrace[00] vmacore.dll[0x001C568A]
--> backtrace[01] vmacore.dll[0x0005CA8F]
--> backtrace[02] vmacore.dll[0x0005DBDE]
--> backtrace[03] vmacore.dll[0x001D7405]
--> backtrace[04] vmacore.dll[0x001D74FD]
--> backtrace[05] vmacore.dll[0x001D9FD0]
<<<<SHORTENED BACKTRACE>>>>
--> backtrace[40] vmacore.dll[0x00065FEB]
--> backtrace[41] vmacore.dll[0x0015BC50]
--> backtrace[42] vmacore.dll[0x001D2A5B]
--> backtrace[43] MSVCR90.dll[0x00002FDF]
--> backtrace[44] MSVCR90.dll[0x00003080]
--> backtrace[45] KERNEL32.DLL[0x0000168D]
--> backtrace[46] ntdll.dll[0x00074629]
--> [backtrace end]

The last OpID is 02712 and even if I search with this while trivia logging being enabled, it would not give me much information.

Apparently, the CR/LF bit in ODBC driver causing some kind of truncation causing SRM service to crash.

This setting is available Under:
ODBC 64 bit > System DSN > SRM DSN (Configure) > Datasource > Page 2

Uncheck this option and then the service should start successfully.

Hope this kind of helps!

While working with one of my colleague the last day, there was an issue where the tomcat service would not start up.

root@vdp:/home/admin/#: dpnctl status
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: gsan status: up
dpnctl: INFO: MCS status: up.
dpnctl: INFO: emt status: down.
dpnctl: INFO: Backup scheduler status: down.
dpnctl: INFO: axionfs status: down.
dpnctl: INFO: Maintenance windows scheduler status: enabled.
dpnctl: INFO: Unattended startup status: enabled.
dpnctl: INFO: avinstaller status: down.
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]

The core services seemed to be fine, it was just the EM tomcat that was unresponsive. When we tried to restart the tomcat service using the below command as the "root" user of VDP it used to fail:

# emwebapp.sh --start

Error trace:

syntax error at line 1, column 0, byte 0:
Identity added: /home/admin/.ssh/dpnid (/home/admin/.ssh/dpnid)
^
<entry key="clean_emdb_cm_queue_info_days" value="365" />
at /usr/lib/perl5/vendor_perl/5.10.0/x86_64-linux-thread-multi/XML/Parser.pm line 187

If we ran the alternative command to start the service, it would fail to and the dpnctl.log would have the similar Error trace.

# dpnctl start emt

Error trace:

2018/02/22-17:11:42 syntax error at line 1, column 0, byte 0:
2018/02/22-17:11:42 Identity added: /home/admin/.ssh/dpnid (/home/admin/.ssh/dpnid)
2018/02/22-17:11:42 tomcatctl: ERROR: problem running command "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/emwebapp.sh --start" - exit status 255

So, it looks like the emserver.xml file had been corrupted and we just need to restore this particular file from a previous EM Flush.

To perform this, the steps would be:

1. List out the latest 5 available EM backups using:

# avtar --backups --path=/EM_BACKUPS --count=5 | less

The output would be:

Date Time Seq Label Size Plugin Working directory Targets
---------- -------- ----- ----------------- ---------- -------- --------------------- -------------------
2018-02-21 08:00:05 213 57K Linux /usr/local/avamar var/em/server_data
2018-02-20 08:00:05 212 57K Linux /usr/local/avamar var/em/server_data
2018-02-19 08:00:08 211 57K Linux /usr/local/avamar var/em/server_data
2018-02-18 08:00:04 210 57K Linux /usr/local/avamar var/em/server_data
2018-02-17 08:00:11 209 57K Linux /usr/local/avamar var/em/server_data

2. Next, we will choose a label number and then restore the EM Flush to a temp directory. The command would be:

# avtar -x --labelnum=209 --path=/EM_BACKUPS --target=/tmp/emback

A successful restore would end with a similar snippet:

avtar Info <5259>: Restoring backup to directory "/tmp/emback"
avtar Info <5262>: Restore completed
avtar Info <7925>: Restored 55.55 KB from selection(s) with 55.55 KB in 10 files, 6 directories
avtar Info <6090>: Restored 55.55 KB in 0.01 minutes: 306.5 MB/hour (56,489 files/hour)
avtar Info <7883>: Finished at 2018-02-22 21:26:53 GST, Elapsed time: 0000h:00m:00s
avtar Info <6645>: Not sending wrapup anywhere.
avtar Info <5314>: Command completed (1 warning, exit code 0: success)

3. Rename the older emserver.xml file using:

# mv /space/avamar/var/em/server_data/prefs/emserver.xml /space/avamar/var/em/server_data/prefs/emserver.xml.old

4. Copy to restored file to the actual location using:

# cp -p /tmp/emback/var/em/server_data/prefs/emserver.xml /space/avamar/var/em/server_data/prefs/

5. List out the permissions on the file using:

# ls -lh  /space/avamar/var/em/server_data/prefs/

The permissions should be:

-rw------- 1 admin admin 9.4K Jul 26 2017 emserver.xml
-rw------- 1 admin admin 9.3K Feb 22 09:09 emserver.xml.old
-r-------- 1 admin admin 4.5K Jul 26 2017 preferences.dtd

6. Restart the tomcat service using:

# emwebapp.sh --start

It should now start up successfully. Hope this helps!

So you might be aware that a backup job in VDP can run upto 24 hours, post which it ideally times out with Time Out end status. If this is what you are facing, then you can extend the backup window overtime using this KB article here

In some cases, this might not be the situation. You might not hit the 24 hour window but still the backup fails with Time Out - End

ID Status Error Code Start Time Elapsed End Time Type Progress Bytes New Bytes Client Domain
9151917120002509 Timed Out - End 0 2018-02-21 00:01 IST 02h:59m:31s 2018-02-21 03:00 GMT Scheduled Backup 179.9 GB 15.4% Test-VM /vcenter.southpark.local/VirtualMachines

The backup logs will have the same logging that one would expect to see when the time out issue occurs.

The avtar.log will have something similar to:

2018-02-21T03:00:23.253-01:00 avtar Info <7061>: Canceled by '3016-vmimagew' - exiting...
2018-02-21T03:00:23.258-01:00 avtar Info <9772>: Starting graceful (staged) termination, cancel request (wrap-up stage)
2018-02-21T03:00:23.347-01:00 [avtar] ERROR: <0001> backstreamdir::childdone error merging history stream data during phase_do_hidden while processing directory 'VMFiles/2'
2018-02-21T03:00:23.392-01:00 [avtar] ERROR: <0001> backstreamdir::childdone error merging history stream data during phase_do_hidden while processing directory 'VMFiles'
2018-02-21T03:00:23.438-01:00 avtar Info <7202>: Backup CANCELED, wrapping-up session with Server

The backup.log will have:

2018-02-21T03:00:23.169-01:00 avvcbimage Info <9740>: skipping 2048 unchanged sectors in PAX stream at sector offset 241468416
2018-02-21T03:00:23.196-01:00 avvcbimage Info <9772>: Starting graceful (staged) termination, MCS cancel (wrap-up stage)
2018-02-21T03:00:23.212-01:00 avvcbimage Info <19692>: Cancelled by MCS with timeout=0 sec
2018-02-21T03:00:23.647-01:00 avvcbimage Info <40654>: isExitOK()=169
2018-02-21T03:00:23.647-01:00 avvcbimage Info <16022>: Cancel detected(externally cancelled by Administrator), isExitOK(0).
2018-02-21T03:00:23.657-01:00 avvcbimage Info <16041>: VDDK:VixDiskLib: VixDiskLib_Close: Close disk.

The avagent.log will have the following:

2018-02-21T03:00:23.137-01:00 avagent Info <5964>: Requesting work from 127.0.0.1

2018-02-21T03:00:23.162-01:00 avagent Info <8636>: Starting CTL cancel of workorder "BackupTest-BackupTest-1519171200009".

2018-02-21T03:00:31.454-01:00 avagent Info <6688>: Process 127319 (/usr/local/avamarclient/bin/avvcbimage) finished (code 169: externally cancelled by Administrator)

2018-02-21T03:00:31.454-01:00 avagent Warning <6690>: CTL workorder "BackupTest-BackupTest-1519171200009" non-zero exit status 'code 169: externally cancelled by Administrator'

When I checked the backup duration window, it was set only to 3 hours. Not sure how that changed

root@vdp:/usr/local/avamar/var/vdr/server_logs/#: mccli schedule show --name=/vcenter.southpark.local/VirtualMachines/BackupTest | grep "Backup Window Duration"

Backup Window Duration 3 hours and 0 minutes

So we had to flip it back to 24 hours:

root@vdp:/usr/local/avamar/var/vdr/server_logs/#: mccli schedule edit --name=/vcenter.southpark.local/VirtualMachines/BackupTest --duration=24:00

0,22214,Schedule modified

root@vdp:/usr/local/avamar/var/vdr/server_logs/#: mccli schedule show --name=/vcenter.southpark.local/VirtualMachines/BackupTest | grep "Backup Window Duration"

Backup Window Duration 24 hours and 0 minutes

The backup should not time out before the 24 hour window anymore. Hope this helps!

If you are using SRM 6.0.x or SRM 6.1.x and you are trying to test failover a CentOS 7.4 machine with IP Customization the Customize IP Section of the recovery fails with the message

The guest operating system '' is not supported

In the vmware-dr.log on the DR site SRM, you will notice the following:

2018-03-02T02:10:43.405Z [01032 error 'Recovery' ctxID=345cedf opID=72d8d85a] Plan 'CentOS74' failed: (vim.fault.UnsupportedGuest) {
--> faultCause = (vmodl.MethodFault) null,
--> property = "guest.guestId",
--> unsupportedGuestOS = "",
--> msg = ""
--> }

This is because the CentOS7.4 is not a part of supported guest in the imgcust binaries of the 6.0 release. For CentOS 7.4 customization to work, the SRM needs to be on a 6.5 release. In my case, I upgraded vCenter to 6.5 Update 1 and SRM to 6.5.1 post which the test recovery completed without issues.

If there is no plan for immediate upgrade of your environment, but would still like to have the customizations completing, then use this workaround.

If you look at the redhat-release file

# cat /etc/redhat-release

The contents are:
CentOS Linux release 7.4.1708 (Core)

So you remove this and then add:
Red Hat Enterprise Linux Server release 7.0 (Maipo)

Since RHEL 7.0 is supported in imgcust for 6.0 the test recovery completes fine. Hope this helps!

One of the alternate methods to formatting a new VMFS volume from the GUI is to create the same from the SSH of the ESXi host.

The process is quite simple and you can follow them as mentioned below:

1. Make sure the device is presented to the ESX and visible. If not, perform a Rescan Storage and check if the device is visible.

2. You can get the device identifier from the SSH of the ESX by navigating to:

# cd /vmfs/devices/disks

In my case, the device I was interested was mpx.vmhba1:C0:T3:L0

3. Next, we need to create a partition on this device and we no longer use fdisk for ESX as this is deprecated. So we will use partedUtil

So, we will create a partition (Number=1) at an offset of 128. The partition identifier is 0xfb which is a VMFS partition. 0xfb = 251. Along with this we will specify the ending sector.

To calculate ending sector:
The disk has 512 bytes per sector. In my case the device is 12 GB.
So number of bytes is 12884901887.99998
Dividing this by 512 is 25165824 sectors.

Do not use the complete sector value as it might complain out of bound sector value, so use one number less.

The command would then be:

# partedUtil set /vmfs/devices/disks/device-name "1 128 <ending-sector> 251 0"

Sample command:

# partedUtil set /vmfs/devices/disks/mpx.vmhba1:C0:T3:L0 "1 128 25165823 251 0"

A successful output would be:
0 0 0 0
1 128 25165823 251 0

4. Next, you format a VMFS volume using the vmkfstools -C command.

The command would be:

# vmkfstools -C <vmfs-version> -b <block-size> -S <name-of-datastore> /vmfs/devices/disks/<device-name>:<partition-number>

So the command for me would be (For a VMFS5 partition with 1 mb block size)

# vmkfstools -C vmfs5 -b 1m -S Test /vmfs/devices/disks/mpx.vmhba1:C0:T3:L0:1

A successful output would be:

Checking if remote hosts are using this device as a valid file system. This may take a few seconds...

Creating vmfs5 file system on "mpx.vmhba1:C0:T3:L0:1" with blockSize 1048576 and volume label "Test".

Successfully created new volume: 5aa7d4e8-1e99a608-f609-000c292cd901

Now, back in the GUI just do a refresh on the storage section and this volume is visible for the host.

Hope this helps!

Recently, while working on a case we noticed that we were unable to make changes to any virtual machine on a particular NetApp NFS datastore. We were unable to add disks, increase existing VMDKs or create virtual machines on that datastore.

The error we received was:

When we login to the host directly via UI client then we were able to perform all the above mentioned changes. This pointed out that, there seemed to be an issue with the vCenter server and not the datastore.

So looking into the vpxd.log for vCenter, this is what we saw:

2018-03-16T09:11:48.485Z info vpxd[7FB8F56ED700] [Originator@6876 sub=Default opID=VmConfigFormMediator-applyOnMultiEntity-93953-ngc:70007296-f3] [VpxLRO] -- ERROR task-252339 -- vm-44240 -- vim.VirtualMachine.reconfigure: vim.fault.InvalidDatastoreState:
--> Result:
--> (vim.fault.InvalidDatastoreState) {
--> faultCause = (vmodl.MethodFault) null,
--> faultMessage = <unset>,
--> datastoreName = "dc1_vmware01"
--> msg = ""
--> }
--> Args:
-->
--> Arg spec:
--> (vim.vm.ConfigSpec) {
--> changeVersion = "2018-03-15T08:04:06.266218Z",
--> name = <unset>,

To fix this we had to change the thin_prov_space_flag from 1 to 0 on the vCenter server database. In my case, the vCenter was an appliance (The process remains more or less same for Windows based vCenter as well)

The fix:

1. Always have a snapshot of the vCenter server before making any changes within it.

2. Stop the vCenter server service using:

# service-control --stop vmware-vpxd

3. Connect to the vCenter database using:

# /opt/vmware/vpostgres/current/bin/psql -d VCDB -U vc

The password for the vCenter DB can be found in the below file:

/etc/vmware-vpx/vcdb.properties file

Run this below query to list out the datastores with this vCenter:

select * from vpx_datastore where name='<enter-your-datastore-name>';

The output would be something like:

11 | dc1_vmware01 | ds:///vmfs/volumes/7a567de9-3e3c0969/ | 5277655814144 | 1745005346816 | NFS | | <obj xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xmlns="urn:vim25" versionId="6.5" xsi:type="DatastoreCapability"><directoryHierarchySupported>true</directoryHierarchySupported><rawDiskMappingsSupported>false</rawDiskMappingsSupported><perFileThinProvisioningSupported>true</perFileThi

nProvisioningSupported><storageIORMSupported>true</storageIORMSupported><nativeSnapshotSupported>true</nativeSnapshotSupported><topLevelDirectoryCreateSupported>true</topLevelDirectoryCreateSupported><seSparseSupported>true</seSparseSupp

orted><vmfsSparseSupported>true</vmfsSparseSupported><vsanSparseSupported>false</vsanSparseSupported></obj> | 2 | 0 | 30 | 0 | 1 | automatic

| 90 | 1

Too much information here, so we can filter it out using the below query

select id,thin_prov_space_flag from vpx_datastore;

Now you can see:

VCDB=> select id,thin_prov_space_flag from vpx_datastore;

id | thin_prov_space_flag

------+----------------------

5177 | 0

6449 | 0

5178 | 0

12 | 0

795 | 0

149 | 0

793 | 0

11 | 1

Now, we need to change the thin_prov_space_flag from 1 to 0 for id=11

So run this query:

update vpx_datastore set thin_prov_space_flag=0 where id=<enter-your-id>;

Quit the database view using \q

Start the web client service using:

# service-control --start vmware-vpxd

Re-login back to the vCenter and now you should be able to make the necessary changes.

Hope this helps!

A vSphere replication server comes with an embedded replication service to manage all the traffic and vR queries in addition to an option of deploying add on servers. In 5.8 or older vSphere replication servers, there are scenarios where this embedded replication server is displayed as disconnected. Since this embedded service is disconnected, the replications will be in RPO violation state as the replication traffic is not manageable.

In the hbrsrv.log on the vSphere replication appliance, located under /var/log/vmware, we see the below:

repl:/var/log/vmware # grep -i "link-local" hbrsrv*

hbrsrv-402.log:2018-03-23T11:25:24.914Z [7F70AC62E720 info 'HostCreds' opID=hs-init-1d08f1ab] Ignoring link-local address for host-50: "fe80::be30:5bff:fed9:7c52"

hbrsrv.log:2018-03-23T11:25:24.914Z [7F70AC62E720 info 'HostCreds' opID=hs-init-1d08f1ab] Ignoring link-local address for host-50: "fe80::be30:5bff:fed9:7c52"

So, this is seen when the VMs being replicated are on an ESX host which has IPv6 link local address enabled and the host is using an IPv4 addressing.

The logs, here speak in terms on host MoID, so you can find out the host name from the vCenter MOB page, https://<vcenter-ip/mob

To navigate to the host MoID section:

Content > group-d1 (Datacenters) > (Your datacenter) under childEntity > group-xx under hostFolder > domain-xx (Under childEntity) > locate the host ID

Then using this hostname, disable the IPv6 on the referenced ESX:

> Select the ESXi

> Select Configuration
> Select Networking

> Edit Settings for vmk0 (Management) port group

> IP Address, Un-check IPv6

Then reboot that ESX host. Repeat the steps for the remaining ESX too and then finally reboot the vSphere Replication Appliance.

Now, there should no longer be link-local logging in hbrsrv.log and the embedded server should be connected allowing the RPO syncs to resume.

Hope this helps!

There are many instances where the maintenance task fails on VDP. This article is in specific to VDP when integrated with data domain and moreover when the DDoS version is 6.1 and above.

The checkpoint and HFS tasks were completing fine without issues:

# dumpmaintlogs --types=cp | grep "<4"

2018/03/19-12:01:04.44235 {0.0} <4301> completed checkpoint maintenance
2018/03/19-12:04:17.71935 {0.0} <4300> starting scheduled checkpoint maintenance
2018/03/19-12:04:40.40012 {0.0} <4301> completed checkpoint maintenance

# dumpmaintlogs --types=hfscheck | grep "<4"

2018/03/18-12:00:59.49574 {0.0} <4002> starting scheduled hfscheck
2018/03/18-12:04:11.83316 {0.0} <4003> completed hfscheck of cp.20180318120037
2018/03/19-12:01:04.49357 {0.0} <4002> starting scheduled hfscheck
2018/03/19-12:04:16.59187 {0.0} <4003> completed hfscheck of cp.20180319120042

Garbage collection task was the one that was failing:

# dumpmaintlogs --types=gc --days=30 | grep "<4"

2018/03/18-12:00:22.29852 {0.0} <4200> starting scheduled garbage collection
2018/03/18-12:00:36.77421 {0.0} <4202> failed garbage collection with error MSG_ERR_DDR_ERROR
2018/03/19-12:00:23.91138 {0.0} <4200> starting scheduled garbage collection
2018/03/19-12:00:41.77701 {0.0} <4202> failed garbage collection with error MSG_ERR_DDR_ERROR

From ddrmaint.log located under /usr/local/avamar/var/ddrmaintlogs had the following entry:

Mar 18 12:00:31 VDP01 ddrmaint.bin[14667]: Error: gc-finish::remove_unwanted_checkpoints: Failed to retrieve snapshot checkpoints: LSU: avamar-1488469814 ddr: data-domain.home.local(1), DDR result code: 5009, desc: I/O error

Mar 18 12:00:34 VDP01 ddrmaint.bin[14667]: Info: gc-finish:[phase 4] Completed garbage collection for data-domain.home.local(1), DDR result code: 0, desc: Error not set

Mar 19 12:00:35 VDP01 ddrmaint.bin[13409]: Error: gc-finish::remove_unwanted_checkpoints: Failed to retrieve snapshot checkpoints: LSU: avamar-1488469814 ddr: data-domain.home.local(1), DDR result code: 5009, desc: I/O error

Mar 19 12:00:39 VDP01 ddrmaint.bin[13409]: Info: gc-finish:[phase 4] Completed garbage collection for data-domain.home.local(1), DDR result code: 0, desc: Error not set

It was basically failing to retrieve checkpoint list from the data domain.
Also, the get checkpoint list was failing:

Mar 20 11:16:50 VDP01 ddrmaint.bin[27852]: Error: cplist::body - auto checkpoint list failed result code: 0

Mar 20 11:16:50 VDP01 ddrmaint.bin[27852]: Error: <4750>Datadomain get checkpoint list operation failed.

Mar 20 11:17:50 VDP01 ddrmaint.bin[28021]: Error: cplist::execute_cplist: Failed to retrieve snapshot checkpoints from LSU: avamar-1488469814, ddr: data-domain.home.local(1), DDR result code: 5009, desc: I/O error

Mar 20 11:17:50 VDP01 ddrmaint.bin[28021]: Error: cplist::body - auto checkpoint list failed result code: 0

Mar 20 11:17:50 VDP01 ddrmaint.bin[28021]: Error: <4750>Datadomain get checkpoint list operation failed.

From the mTree LSU of this VDP Server, we noticed that the checkpoints were not expired:

# snapshot list mtree /data/col1/avamar-1488469814

Snapshot Information for MTree: /data/col1/avamar-1488469814
----------------------------------------------
Name Pre-Comp (GiB) Create Date Retain Until Status
----------------- -------------- ----------------- ------------ ------
cp.20171220090039 128533.9 Dec 20 2017 09:00
cp.20171220090418 128543.0 Dec 20 2017 09:04
cp.20171221090040 131703.8 Dec 21 2017 09:00
cp.20171221090415 131712.9 Dec 21 2017 09:04
.
cp.20180318120414 161983.7 Mar 18 2018 12:04
cp.20180319120042 162263.9 Mar 19 2018 12:01
cp.20180319120418 162273.7 Mar 19 2018 12:04
cur.1515764908 125477.9 Jan 12 2018 13:49
----------------- -------------- ----------------- ------------ ------
Snapshot Summary
-------------------
Total: 177
Not expired: 177
Expired: 0

Due to this, all the recent checkpoints on VDP were invalid:

# cplist

cp.20180228120038 Wed Feb 28 12:00:38 2018 invalid --- --- nodes 1/1 stripes 76
.
cp.20180318120414 Sun Mar 18 12:04:14 2018 invalid --- --- nodes 1/1 stripes 76
cp.20180319120042 Mon Mar 19 12:00:42 2018 invalid --- --- nodes 1/1 stripes 76
cp.20180319120418 Mon Mar 19 12:04:18 2018 invalid --- --- nodes 1/1 stripes 76

The case here is the VDP version was 6.1.x and the data domain OS version was 6.1

# ddrmaint read-ddr-info --format=full

====================== Read-DDR-Info ======================
System name : xxx.xxxx.xxxx
System ID : Bxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx4
DDBoost user : ddboost
System index : 1
Replication : True
CP Backup : True
Model number : DDxxx
Serialno : Cxxxxxxxx
DDOS version : 6.1.0.21-579789
System attached : 1970-01-01 00:00:00 (0)
System max streams : 16

6.1 DD OS version is not supported for VDP 6.1.x. 6.0.x is the last DD OS version supported for VDP.

So if your DD OS is on 6.1.x then the choice would be to:
> Migrate the VDP to Avamar Virtual Edition (Recommended)
> Rollback DD OS to 6.0.x

Hope this helps!

When using SRM with array based replication, a test recovery operation will take a snapshot of the replica LUN, present it and mount it on the ESX server to bring up the VMs on an isolated network.

In many instances, the test recovery would fail at the crucial step, which is taking a snapshot of the replica device. The GUI would mention:

Failed to create snapshots of replica devices

In this case, always look into the vmware-dr.log on the recovery site of the SRM. In my case I noticed the below snippet:

2018-04-10T11:00:12.287+01:00 error vmware-dr[16896] [Originator@6876 sub=SraCommand opID=7dd8a324:9075:7d02:758d] testFailoverStart's stderr:

--> java.io.IOException: Couldn't get lock for /tmp/santorini.log

--> at java.util.logging.FileHandler.openFiles(Unknown Source)

--> at java.util.logging.FileHandler.<init>(Unknown Source)

=================BREAK========================

--> Apr 10, 2018 11:00:12 AM com.emc.santorini.log.KLogger logWithException

--> WARNING: Unknown error:

--> com.sun.xml.internal.ws.client.ClientTransportException: HTTP transport error: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?

--> at com.sun.xml.internal.ws.transport.http.client.HttpClientTransport.getOutput(Unknown Source)

--> at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.process(Unknown Source)

--> at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.processRequest(Unknown Source)

--> at com.sun.xml.internal.ws.transport.DeferredTransportPipe.processRequest(Unknown Source)

--> at com.sun.xml.internal.ws.api.pipe.Fiber.__doRun(Unknown Source)

=================BREAK========================

--> Caused by: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?

--> at sun.security.ssl.InputRecord.handleUnknownRecord(Unknown Source)

--> at sun.security.ssl.InputRecord.read(Unknown Source)

--> at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source)

2018-04-10T11:00:12.299+01:00 error vmware-dr[21512] [Originator@6876 sub=AbrRecoveryEngine opID=7dd8a324:9075:7d02:758d] Dr::Providers::Abr::AbrRecoveryEngine::Internal::RecoverOp::ProcessFailoverFailure: Failed to create snapshots of replica devices for group 'vm-protection-group-45026' using array pair 'array-pair-2038': (dr.storage.fault.CommandFailed) {

--> faultCause = (dr.storage.fault.LocalizableAdapterFault) {

--> faultCause = (vmodl.MethodFault) null,

--> faultMessage = <unset>,

--> code = "78814f38-52ff-32a5-806c-73000467afca.1049",

--> arg = <unset>

--> msg = ""

--> },

--> faultMessage = <unset>,

--> commandName = "testFailoverStart"

--> msg = ""

--> }

--> [context]

So here the SRA attempts to establish connection with the RecoverPoint over HTTP which from 3.5.x is disabled. And we need to allow RP and SRM to communicate over HTTPS.

On the SRM, perform the below:

1. Open CMD in admin mode and navigate to the below location:

c:\Program Files\VMware\VMware vCenter Site Recovery Manager\storage\sra\array-type-recoverpoint

2. Then run the below command:

"c:\Program Files\VMware\VMware vCenter Site Recovery Manager\external\perl-5.14.2\bin\perl.exe" command.pl --useHttps true

In 6.5 I have seen the path to be external\perl\perl]bin\perl.exe

So verify what the correct path is for the second command.

You should ideally see an output like:

Successfully changed to HTTPS security mode

3. Perform this on both the SRM sites.

On the RPA, perform the below:

1. Login to each RPA with boxmgmt account

2. [2] Setup > [8] Advanced Options > [7] Security Options > [1] Change Web Server Mode

(option number may change)

3. You will be then presented with this message:

Do you want to disable the HTTP server (y/n)?

4. Disable HTTP and repeat this on production and recovery RPA cluster.

Restart the SRM service on both sites and re-run the test recovery and this should now complete successfully.

Hope this helps.

So as you know, vSphere 6.7 is now GA and this article will speak about upgrading an embedded PSC deployment of 6.5 vCenter appliance to 6.7. Once you download the 6.7 VCSA ISO installer
mount the ISO on a local windows machine and then you can use the ui installer for windows to begin the upgrade phase.

You will be presented with the below choices:

We will be going with the Upgrade option. The upgrade is going to be like the earlier path wherein the process will deploy a new 6.7 VCSA and perform a data and configuration migration from the older 6.5 appliance and then power down the old server when the upgrade is successful.

Accept the EULA to proceed further.

In the next step we will connect to the source appliance so provide in the IP/FQDN of the source 6.5 vCenter server.

Once the Connect To Source goes through you will then be asked to enter the SSO details and the ESX details where the 6.5 vCSA is running.

Then the next step is to provide information about the target appliance, the 6.7 appliance. You will select the ESX where the target appliance should be deployed.

Then provide the inventory display name for the target vCenter 6.7 along with the a root password.

Select the appliance deployment size for the target server. Make sure this matches or is greater than the source 6.7 server.

Then select the datastore where the target appliance should reside.

Next, we will provide a set of temporary network details for the 6.7 appliance. The appliance will inherit the old 6.5 network configuration post a successful migration.

Review the details and Finish the begin the Stage 1 deployment process.

Once the Stage 1 is done, you can Continue to proceed further with the Stage 2

In the Stage 2 we will be performing a data copy from the source vCenter appliance to the deployed target from Stage 1

Provide in the details to connect to the source vCenter server.

Select the type of data to be copied over to the destination vCenter server. In my case, I just want to migrate the configuration data.

Join the CEIP and proceed further

Review the details and Finish to begin the data copy.

The source vCenter will be shutdown post the data copy.

The data migration will take a while to complete and is in 3 stages.

And that's it. If all goes well, the migration is complete and you can access your new vCenter from the URL.

Hope this helps.

In few cases, you might come across a scenario where the Site Recovery Manager service does not start and in the Event Viewer you will notice the following back trace for the vmware-dr service.

VMware vCenter Site Recovery Manager application error.
class Vmacore::Exception "DBManager error: Could not initialize Vdb connection: ODBC error: (IM002) - [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified"
[backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.5.1, build: build-6014840, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release
backtrace[00] vmacore.dll[0x001F29FA]
backtrace[01] vmacore.dll[0x00067EA0]
backtrace[02] vmacore.dll[0x0006A85E]
backtrace[03] vmacore.dll[0x00024064]
backtrace[04] vmware-dr.exe[0x00107621]
backtrace[05] MSVCR120.dll[0x00066920]
backtrace[06] MSVCR120.dll[0x0005E36D]
backtrace[07] ntdll.dll[0x00092A63]
backtrace[08] vmware-dr.exe[0x00014893]
backtrace[09] vmware-dr.exe[0x00015226]
backtrace[10] windowsService.dll[0x00002BF5]
backtrace[11] windowsService.dll[0x00001F24]
backtrace[12] sechost.dll[0x00005ADA]
backtrace[13] KERNEL32.DLL[0x000013D2]
backtrace[14] ntdll.dll[0x000154E4]
[backtrace end]

There are no logs generated in vmware-dr.log and the ODBC connection test completes successfully too.

However, when you go to vmware-dr.xml file located under C:\Program Files\VMware\VMware vCenter Site Recovery Manager\config and search for the tag <DBManager> you will notice the <dsn> name will be incorrect.

Upon providing in the right dsn name within the<dsn> </dsn> you will then notice a new back trace when you attempt to start the service again

VMware vCenter Site Recovery Manager application error.

class Vmacore::InvalidArgumentException "Invalid argument"

[backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.5.1, build: build-6014840, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release

backtrace[00] vmacore.dll[0x001F29FA]

backtrace[01] vmacore.dll[0x00067EA0]

backtrace[02] vmacore.dll[0x0006A85E]

backtrace[03] vmacore.dll[0x00024064]

backtrace[04] listener.dll[0x0000BCBC]

What I suspect is something has gone wrong with the vmware-dr.xml file and the fix for this is to re-install the SRM application with an existing database.

Post this, the service starts successfully. Hope this helps.

File Level Restore - Select Destination Is Empty

VDP Restore Fails: Virtual Machine Must Be Powered Off To Restore

SRM Service Crashes During A Recovery Operation With timedFunc BackTrace

SRM Plugin Not Available In Web Client

Unable To Connect VDP To vCenter: "login returned a response status of 204 No Content"

VDP Expired MCSSL, Reports 7778, 7779, 7780, 7781, 9443 As Vulnerable In Nessus Scan

vSphere Replication Jobs Fail Due to NFC_NO_MEMORY

VDP Cannot Connect To Web Client "MIME media type application/octet-stream was not found"

Unable To Send Scheduled Email Reports In VDP "com.vmware.vim25.ManagedObjectNotFound"

SRM Service Crashes Due To CR/LF Conversion

VDP Tomcat Service Crashes With "line 1, column 0, byte 0"

VDP Backup Fails With Time Out - End

SRM CentOS 7.4 IP Customization Fails

Creating A VMFS Volume From Command Line

Unable To Make Changes On A Virtual Machine - The operation is not allowed in the current state of the datastore

Embedded Replication Server Disconnected In vSphere Replication 5.8

Maintenance Task Fails On VDP When Connected To Data Domain

SRM Test Recovery Fails: "Failed to create snapshots of replica devices"

Upgrading vCenter Appliance From 6.5 to 6.7

SRM Service Fails To Start: "Could not initialize Vdb connection Data source name not found and no default driver specified"