Quantcast
Channel: virtuallyPeculiar
Viewing all 167 articles
Browse latest View live

Configuring Data Domain Virtual Edition And Connecting It To A VDP Appliance

$
0
0
Recently while browsing through my VMware repository, I came across EMC Data Domain Virtual Edition. Before this, the only time where I worked on EMC DD cases where on a web ex, which had integration / backup issues between VDP and Data-Domain. Currently, my expertise on Data-Domain is limited and with this appliance edition, you can expect more updates on VDP-DD issues. 

First things first, this article is going to be about deploying the virtual edition of Data-Domain and configuring it to VDP. You can say this is the initial setup. I am going to skip certain parts such as OVF deployment as I take into account this is already known to you. 

1. The Data Domain ova can be acquired from the EMC download site. It contains and ovf file, vmdk file and .mf (manifest) file. 
A prerequisite would be to have your DNS Forward/Reverse configured for your DD Appliance. Once this is in place, use either the web client or the vSphere Client to complete your deployment. Do not power On the appliance post deployment. 

2. Once the deployment is complete, if you go to the Edit Settings of the Data Domain Appliance, you can see that there are two hard drives for your system, 120GB and 10GB. These are your system drives and you will need to add an additional drive (500GB) for storing the backup data. At this point, you can go ahead and perform an "Add Device" operation and add a new vmdk of 500GB in size. Post successful addition of the drive, go ahead and power on the data domain appliance. 

3. The default credentials are username: sysadmin, password: abc123. At this point of time, there will be a script that will be executed automatically to configure network. I have not included this in the screenshot as it is fairly simple task. The options will be configure IPv4 network, IP address, DNS Server address, hostname, domain name, subnet mask and default gateway. IPv6 is optional and can be ignored if not required. This prompt will also ask you if you would like to change the password for sysadmin

4. The next step is to add the 500GB storage that was created for this appliance. Run the following command in the SSH of the data domain appliance to add the 500GB disk. 
# storage add dev3
This would be seen as:


5. Once the additional disk is added to the active tier, you then need to create a file system on this. Run the following command to create a file system:
# filesys create
This would be seen as:


This will provision the storage and create the FS. The progress will be displayed once the prompt is acknowledged with "yes" as above.


6. Once the file system is created, the next part is to enable the file system, which again is done by a pretty simple command:
# filesys enable

Note that, this is not a Full Linux OS, so most to almost all Linux commands will not work on a data domain. The Data domain has it's own OS, DD-OS which has a set of commands that are only supported. 

The output for enabling file system would be seen as:


Perform a restart of the data domain appliance before proceeding with the next steps. 

7. Once the reboot is completed, open a browser and go to the following link: 
https://data-domain-ip/ddem
If you see an unable to connect message in the browser, then go back to the SSH and run the following command:
# admin show
The output which we need would be:

Service   Enabled   Allowed Hosts
-------   -------   -------------
ssh       yes       -
scp       yes       (same as ssh)
telnet    no        -
ftp       no        -
ftps      no        -
http      no        -
https     no        -
-------   -------   -------------

Web options:
Option            Value
---------------   -----
http-port         80
https-port        443
session-timeout   10800

So the ports are already set, however, the services are not enabled. So we need to enable http and https services here. The command would be:
sysadmin@datadomain# adminaccess enable http

HTTP Access:    enabled
sysadmin@datadomain# adminaccess enable https

HTTPS Access:   enabled
This should do the trick, and now you should be able to login to the data domain manager from the browser.


The login username would be sysadmin and the password for this user. (Either default if you chose not to change it in SSH or the newly configured password)

8. Now, the VDP connects to the data domain via a ddboost user. You need to create a ddboost user with admin rights, and enable ddboost. This can be done from the GUI or from the command line. The command line is much easier.

The first step would be to "add" a ddboost user:
# user add ddboost-user password VMware123!
The next step would be to "set" the created ddboost user:
# ddboost set user-name ddboost-user
The last step would be to enable ddboost:
# ddboost enable 
You can also check ddboost status with:
# ddboost status

This will automatically reflect in the GUI. The location for ddboost in web manager of data domain is. Data Management > DD Boost > Settings. So, if you choose not to add the user from command line, you can do the same from here.


9. Now, you need to grant this ddboost user the admin rights, else, you will get the following error during connecting the data domain to the VDP appliance:


To grant the ddboost user with admin rights, from the web GUI navigate to System Settings > Access Management > Local Users. Check the ddboost user and select the Edit option, From the management role drop-down select admin > Click OK.


10. Post this, there is one more small configuration required for connecting the VDP appliance. This would be the NTP settings. From the data domain web GUI navigate to System Settings > General Configuration > Time and Date Settings and add your NTP Server. 

11. With this completed, now we will go ahead and connect this data domain to the VDP appliance. 
Login to the vdp-configure page at:
https://vdp-ip:8543/vdp-configure
12. Click Storage > Gear Icon > Add  Data Domain


13. Provide the primary IP of the data domain, the ddboost user and password and select Enable Checkpoint Copy

14. Enter the community string for the SNMP configuration.


Once the data domain addition completes successfully, you should be able to see the data domain storage details under the "Storage" section:


If you SSH into the VDP appliance and navigate to /usr/local/avamar/var, you will see a file called ddr_info which has the information regarding the data domain. 

This is the basic setup for connecting virtual edition of data domain to a VDP appliance. 


Understanding VDP Backups In A Data Domain Mtree

$
0
0
This is going to be a high level view of how to find out which backups on data domain relate to which clients on the VDP. Previously, we saw how to deploy and configure a data domain and connect it to a vSphere Data Protection 5.8 appliance. I am going to discuss what is an Mtree and certain information about it, as this is needed for the next article, which would be migration of VDP from 5.8/6.0 to 6.1 

In a data domain file system, a Mtree is created to store the files and checkpoint data of VDP and data domain snapshots under the avamar ID node of the respective appliance. 

Now, on the data domain appliance, the below command needs to be executed to display the mtree list. 
# mtree list
The output is seen as:

Name                                             Pre-Comp (GiB)   Status
----------------------------   ----------------------------------  ------
/data/col1/avamar-1475309625       32.0                      RW
/data/col1/backup                             0.0                        RW
----------------------------   ---------------------------------   ------

So the avamar node ID is 1475309625. To confirm this is the same mtree node created for your VDP appliance, run the following on the VDP appliance:
# avmaint config --ava | grep -i "system"
The output is:
  systemname="vdp58.vcloud.local"
  systemcreatetime="1475309625"
  systemcreateaddr="00:50:56:B9:54:56"

The system create time is nothing but the avamar ID. Using these two commands you can confirm which VDP appliance corresponds to which mtree on the data domain. 

Now, I mentioned earlier that mtree is where your backup data, VDP checkpoints and other related files reside. So, the next question is how to see what are the directories under this mtree? To check the directories under mtree, run the following command in the data domain appliance.
 # ddboost storage-unit show avamar-<ID>
The output is:

Name                Pre-Comp (GiB)   Status   User
-----------------   --------------   ------   ------------
avamar-1475309625             32.0   RW       ddboost-user
-----------------   --------------   ------   ------------
cur
ddrid
VALIDATED
GSAN
STAGING
  • The cur directory is also called as the "current" directory where all your backup data is stored. 
  • The validated folder is where the validated checkpoints reside
  • The GSAN folder contains the checkpoint that was copied over from the VDP appliance. Remember, in the previous article we checked the option for "Enable Checkpoint Copy". This is what copies over the daily generated checkpoints on VDP to the data domain. Why this is required? We will look into this in much detail during the migrate operation. 
  • The STAGING folder is where all your "in-progress" backup job is saved. Once the backup job completes successfully, they will be moved to the cur directory. If the backup job fails, it will remain in the STAGING folder and will be cleared out during the next GC on the data domain.
Now, as mentioned before, DDOS does not have complete commands that is available in Linux, which is why you will have to enter SE (System Engineering) mode and enable bash shell to obtain the superset of commands to browse and modify directories. 

Please note: This is meant to be handled by a EMC technician only. All the information I am displaying here is purely from my lab. Try this at your own risk. If you are uncomfortable, stop now, and involve EMC support. 

To enable the bash shell, we will have to first enter the SE mode. To do this, we will need the password for SE which would be your system serial number. This can be obtained from the below command:
# system show serialno
The output is similar to:
Serial number: XXXXXXXXXXX

Enable SE mode using the below command and enter the Serial number as the password when prompted for:
 # priv set se
Once the password is provided you can see the user sysadmin@data-domain has changed to SE@data-domain

Now, we need to enable the bash shell for your data domain. Run these commands in the same order:

1. Display the OS information using:
# uname
You will see:
Data Domain OS 5.5.0.4-430231

2. Enable the File system using:
# fi st
You will see:
The filesystem is enabled and running.

3. Run the below command to show the filesystem space:
 # filesys show space
You will see

Active Tier:
Resource           Size GiB   Used GiB   Avail GiB   Use%   Cleanable GiB*
----------------   --------   --------   ---------   ----   --------------
/data: pre-comp           -       32.0           -      -                -
/data: post-comp      404.8        2.0       402.8     0%              0.0
/ddvar                 49.2        2.3        44.4     5%                -
----------------   --------   --------   ---------   ----   --------------
 * Estimated based on last cleaning of 2016/10/04 06:00:58.

4. Press "Ctrl+C" three times and then type shell-escape
This enters you to the bash shell and you will see the following screen.

*************************************************************************
****                            WARNING                              ****
*************************************************************************
****   Unlocking 'shell-escape' may compromise your data integrity   ****
****                and void your support contract.                  ****
*************************************************************************
!!!! datadomain YOUR DATA IS IN DANGER !!!! #

Again, proceed at your own risk and 100^10 percent, involve EMC when you do this. 

You saw the mtree was located at the path /data/col1/avamar-ID. The data partition is not mounted by default and needs to be mounted and unmounted manually. 

To mount the data partition run the below command:
# mount localhost:/data /data
This will return to the next line and will not show any output. Once the partition has been mounted successfully, you can then use your regular Linux commands to browse the mtree. 

So, a cd to the /data/col1/avamar-ID will show the following:

drwxrwxrwx  3 ddboost-user users 167 Oct  1 01:46 GSAN
drwxrwxrwx  3 ddboost-user users 190 Oct  2 02:40 STAGING
drwxrwxrwx  9 ddboost-user users 563 Oct  3 20:33 VALIDATED
drwxrwxrwx  4 ddboost-user users 279 Oct  2 02:40 cur
-rw-rw-rw-  1 ddboost-user users  40 Oct  1 01:43 ddrid

As mentioned, before the "cur" directory has all your successfully backed up data. If you change your directory to cur and do a "ls" you will find the following:

drwxrwxrwx  4 ddboost-user users 229 Oct  2 07:30 5890c0677a03211b49a9cf08bf1dcebd2d7cd77d

Now, this is the Client ID of the client (VM) that was successfully backed up by VDP.
To find which client on VDP corresponds to which CID on the data domain, we have 2 simple commands. 

To understand this, I presume you have a fair idea of what MCS and GSAN is on vSphere Data Protection. Your GSAN node is responsible for storing all the actual backup data if you have a local vmdk storage. If your VDP is connected to the data domain, then GSAN only holds the meta data of the backup and not the actual backup data (As this will be on the data domain) 
The MCS in brief is what waits for the work-order and calls in the avagent and avtar to perform the backup. The MCS if it understands there is a data domain connected to it, then, using the DD public-private key combination (Also called SSH keys) will talk to DD to perform the regular maintenance tasks. 

So, first, we will run the avmgr command (avmgr command is only for GSAN and will not work if GSAN is not running), to display the client ID on the GSAN node. The command would be:
# avmgr getl --path=/VC-IP/VirtualMachines
The output is:

1  Request succeeded
1  RHEL_UDlVr74uB7JdXN8jgjRLlQ  location: 5890c0677a03211b49a9cf08bf1dcebd2d7cd77d      pswd: 0d0d7c6b09f2a2234c108e4f0647c277e8bf2562

The one highlighted in red is nothing but the Client ID on the GSAN for the client RHEL (a virtual machine)

Then, we will run the mccli command (mccli command is only for MCS and needs MCS to be up and running) to display the client ID on the MCS server. The command would be:
# mccli client show --domain=/VC-IP/VirtualMachines --name="Client_name"
For example,
# mccli client show --domain=/192.168.1.1/VirtualMachines -name="RHEL_UDlVr74uB7JdXN8jgjRLlQ"
The output is a pretty detailed one, what we are interested is in this particular line:
CID                      5890c0677a03211b49a9cf08bf1dcebd2d7cd77d

So, we see the client ID on data domain = client ID on the GSAN = client ID on the MCS

Here, if your client ID on GSAN does not match the client ID on MCS, then your full VM restore and File Level Restores will not work. We will have this CID to be corrected in case of a mismatch to get the restores working. 

Now, back to the data domain end, we were under the cur directory, right? Next, I will change directory to the CID

# cd 5890c0677a03211b49a9cf08bf1dcebd2d7cd77d

I will then do another "ls" to list the sub directories under it, and you may or may not notice the following:

drwxrwxrwx  2 ddboost-user users 1.2K Oct  2 02:55 1D21C9327C2E4C6
drwxrwxrwx  2 ddboost-user users 1.4K Oct  2 07:30 1D21CB99431214C

If you have one folder which a sub client ID, then it means there has been only one backup executed and completed successfully for the virtual machine. If you see multiple folder, then it means there has been multiple backups completed for this VM. 

To find out which backup was done first and which were the subsequent backups, we will have to query the GSAN, as you know, the GSAN holds the meta-data of the backups. 

Hence, on the VDP appliance, run the below command:
# avmgr getb --path=/VC-IP/VirtualMachines/Client-Name --format=xml
For example:
# avmgr getb --path=/192.168.1.1/VirtualMachines/RHEL_UDlVr74uB7JdXN8jgjRLlQ --format=xml
The output will be:

<backuplist version="3.0">

  <backuplistrec flags="32768001" labelnum="2" label="RHEL-DD-Job-RHEL-DD-Job-1475418600010" created="1475418652" roothash="505f1aba07f19d64df74670afa59ed39a3ece85d" totalbytes="17180938240.00" ispresentbytes="0.00" pidnum="1016" percentnew="0" expires="1476282600" created_prectime="0x1d21cb99431214c" partial="0" retentiontype="daily,weekly,monthly,yearly" backuptype="Full" ddrindex="1" locked="1"/>
  
  <backuplistrec flags="16777217" labelnum="1" label="RHEL-DD-Job-1475401181065" created="1475402150" roothash="22dc0dddea797d909a2587291e0e33916c35d7a2" totalbytes="17180938240.00" ispresentbytes="0.00" pidnum="1016" percentnew="0" expires="1476265181" created_prectime="0x1d21c9327c2e4c6" partial="0" retentiontype="none" backuptype="Full" ddrindex="1" locked="0"/>
</backuplist>

Looks confusing? Maybe, let's look at specific fields:

labelnum field shows the order of the backups. 
labelnum=1 means first backup, 2 means second and so on.

roothash is the hash value of the backup job. Next time you run incremental backup, it will check for the existing hashes, and ddboost will only backup the new hashes. The atomic hashes are then combined to form one unique root hash. So, root hash for each backup is unique. 

created_prectime is the main thing what we need. This is what we called as the sub client ID. 
For labelnum=1, we see the sub CID is 0x1d21c9327c2e4c6
For labelnum=2, we see the sub CID is 0x1d21cb99431214c

Now, let's go further into the CID. For example if I cd into the 0x1d21c9327c2e4c6 and perform a "ls" I will see the following:

-rw-rw-rw-  1 ddboost-user users  485 Oct  2 02:40 1188BE924964359A5C8F5EAEF552E523FBA83566
-rw-rw-rw-  1 ddboost-user users 1.1K Oct  2 02:40 140A189746A6EC3C49D24EA43A7811205345F1F4
-rw-rw-rw-  1 ddboost-user users 3.8K Oct  2 02:40 2CE724F2760C46CB67F679B76657C23606C06869
-rw-rw-rw-  1 ddboost-user users 2.5K Oct  2 02:40 400206DF07A942C066971D84F0CF063D2DE50F08
-rw-rw-rw-  1 ddboost-user users 1.0M Oct  2 02:55 4F50E1E506477801D0A566DEE50E5364B0F04BF0
-rw-rw-rw-  1 ddboost-user users  451 Oct  2 02:55 79DDA236EEEF192EED66CF605CD710B720A41E1F
-rw-rw-rw-  1 ddboost-user users 1.1K Oct  2 02:55 AFB6C8621EB6FA86DD8590841F80C7C78AC7BEEC
-rw-rw-rw-  1 ddboost-user users 1.9K Oct  2 02:40 B17DD9B7E8B2B6EE68294248D8FA42A955539C4C
-rw-rw-rw-  1 ddboost-user users  16G Oct  2 02:55 B212DB46684FFD5AFA41B87FD71A44469B04A38C
-rw-rw-rw-  1 ddboost-user users   15 Oct  2 02:40 D2CFFD87930DAEABB63EAEAA3C8C2AA9554286B5
-rw-rw-rw-  1 ddboost-user users 9.4K Oct  2 02:40 E2FF0829A0F02C1C6FA4A38324A5D9C23B07719B
-rw-rw-rw-  1 ddboost-user users 3.6K Oct  2 02:55 ddr_files.xml

Now there is a main file (record file) called ddr_files.xml. This file will have all the information regarding what the other files are for in this directory.

So if I take the first Hex number and grep for it in the ddr_files.xml I see the following;
# grep -i 1188BE924964359A5C8F5EAEF552E523FBA83566 ddr_files.xml
The interested output is:
clientfile="virtdisk-descriptor.vmdk"

So this a vmdk file that was backed up.

Similarly,
# grep -i 400206DF07A942C066971D84F0CF063D2DE50F08 ddr_files.xml
The interested output is:
clientfile="vm.nvram"

And one more example:
# grep -i 4F50E1E506477801D0A566DEE50E5364B0F04BF0 ddr_files.xml
The interested output is:
clientfile="virtdisk-flat.vmdk"

So if your VM file IDs are not populated correctly in the ddr_files.xml, then again your restores will not work. Engage EMC to get this corrected, because I am stressing again, do not fiddle with this in your production environment.

That's pretty much it for this. If you have questions feel free to comment or in-mail. The next article is going to be about Migrating VDP 5.8/6.0 to 6.1 with a data domain.

vSphere Data Protection /data0? Partitions Are 100 Percent.

$
0
0
VDP can be connected to a data domain or a local deduplication store to contain all the backup data. This article discusses in specific when VDP is connected to a data domain. As far as the deployment process goes, a VDP with data domain attached to it, would still have a local data partition as well. The sda mount is for your OS partitions and sdb, sdc and so on are for your data partitions (Hard Disk1, 2, 3..and so on).

These partitions, data01, data02....(Grouped as data0?) contain the metadata of the backups that is stored on the data domain. So, if you cd to /data01/cur and do a list "ls", you will see the metadata stripes.

0000000000000000.tab 
0000000000000008.wlg 
0000000000000012.cdt 
0000000000000017.chd

Before, we get into the cause of this issue, let's have a quick look at what a retention policy is. When you create a backup job for a client / group of clients, you will define a retention policy for the restore points. The retention policy tells, how long you need your restore points to be saved after a backup. The default is 60 days and can be adjusted as per need. 

Once the retention policy is reached, that restore point which has reached its expiration date will be deleted. Then, during the maintenance window, the Garbage Collection (GC), will be executed, which will perform the space reclamation. If you run, status.dpn, you will notice, "Last GC" and amount of space that was reclaimed. 

Space reclamation by GC is done only on the data0? partitions. So, if your data0? partitions are 100 percent, then there are few explanations. 

1. Your retention period for all backup is set to "Never Expire", which is not recommended to be set.
2. The GC was not executed at all during the maintenance window. 

If you set the backups to never expire, then go ahead and set an expiration date for it, otherwise your data0? partitions will frequently enter 100 percent space usage. 

To check if your GC was executed successfully or not, run the below command:
# status.dpn
The output, you should look at is the "Last GC". You will either see an error here such as DDR_ERROR or Last GC was executed somewhere weeks back. 

Also, if you login to vdp-configure page, you should notice that your maintenance services are not running. If this is the case, then your space reclamation task will not run, and if your space reclamation task is not running, then those metadata for expired backup are not cleared. 

To understand why this happens, let's have a basic look at how MCS talks to data domain. Your MCS will be running on your VDP appliance. If there is a data domain attached to the appliance, the MCS will be querying the data domain via the DD SSH keys

This means, we have a private-public key combination on the VDP appliance and the data domain system. When there is a public-private key combination, there is no need as password authentication for MCS to connect to data domain. Your MCS will use it's private key and Data Domains public key, and similarly, the data domain will use it's private key and VDP's public key to communicate. 

You can do a simple test to see if this working by performing the below steps:

1. On the VDP appliance load and add the private key. 
# ssh-add bash 
# ssh-add ~admin/.ssh/ddr_key
2. Once the key is added, you can login to Data Domain from the VDP SSH directly without a password. This is how the MCS works too. 
# ssh sysadmin@192.168.1.200
Two outcomes here: 

1. If there is no prompt to enter a password, it will directly connect you to the Data Domain console, and we are good to go. 

2. It will prompt you to enter a passphrase and/or a password to login to Data Domain. If you run into this issue, then it means that the SSH public keys for VDP are not loaded / unavailable on the Data Domain end.

For this issue, we will be most likely running into Outcome (2)

How to verify public key availability on data domain end:

1. On the VDP appliance run the following command to list the public key:
# cat .ssh/ddr_key.pub
The output would be similar to:

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAw7XWjEK0jVPrT0z6JDmdKUDLfvvoizdzTpWPoCWNhJ/LerUs9L4UkNr0Q0mTK6U1tnlzlQlqeezIsWvhYJHTcU8rh
yufw1/YZLoGeA0tsHl6ruFAeCIYuf5+mmLXluPhYrjGMdsDa6czjIAtoA4RMY9WjAtSOPX3L2B73Wf3BScigzC/D83aX8GnaldwQU88qkfmhN+dpy2IdxiFm4
hnK+2m4XMtveBTq/8/7medeBTMXYYe7j7DVffViU4DizeEpGj2TBxHIe2dGe0epFDDc9wpa8W5a/XPOeiz4WelHfKtqS1hYUpFEQWXUOngwjDPpqG+6k1t
1HoOp/+OVC3lGw== admin@vmrel-ts2014-vdp

2. On the Data Domain end, run the following command:
# adminaccess show ssh-keys user <ddbost-user>
You can enter your custom ddboost user or sysadmin, if this was itself promoted to ddboost user. 

In our case, we should not see the above mentioned public key in the list. 
The DD will have its private key and VDP will have its private key. The public key of the VDP is not available on the data domain end, which leads to password request when connecting from SSH of VDP to DD. Due to this, the GC will not run as MCS will be waiting for a manual password entry. 

To fix this:

1. Copy the public key of the VDP appliance obtained from the "cat" command mentioned earlier. Copy the entire thing starting from and including ssh-rsa to the end, including -vdp
Make sure no spaces are copied, else this will not work. 

2. Login to DD with sysadmin and run the following command:
# adminaccess add ssh-keys user <ddboost-user>
You will see a prompt like below:

Enter the key and then press Control-D, or press Control-C to cancel.

Then, enter the copied key and Press Ctrl+D (You will see the "key accepted" message)

ssh-user AAAAB3NzaC1yc2EAAAABIwAAAQEAw7XWjEK0jVPrT0z6JDmdKUDLfvvoizdzTpWPoCWNhJ/LerUs9L4UkNr0Q0mTK6U1tnlzlQlqeezIsWvhYJHTcU8rh
yufw1/YZLoGeA0tsHl6ruFAeCIYuf5+mmLXluPhYrjGMdsDa6czjIAtoA4RMY9WjAtSOPX3L2B73Wf3BScigzC/D83aX8GnaldwQU88qkfmhN+dpy2IdxiFm4
hnK+2m4XMtveBTq/8/7medeBTMXYYe7j7DVffViU4DizeEpGj2TBxHIe2dGe0epFDDc9wpa8W5a/XPOeiz4WelHfKtqS1hYUpFEQWXUOngwjDPpqG+6k1t
1HoOp/+OVC3lGw== admin@vmrel-ts2014-vdpSSH key accepted.

3. Now test the login from VDP to DD using ssh sysadmin@192.168.1.200 and you should be directly connected to the data domain.

Even though, we have re-established the MCS connectivity to DD, we will have to now manually run a garbage collection to force clear the expired metadata. 

You have to first stop the backup scheduler and maintenance service else you will receive the below error when trying to run GC:
ERROR: avmaint: garbagecollect: server_exception(MSG_ERR_SCHEDULER_RUNNING)

To stop the backup scheduler and maintenance service:
# dpnctl stop maint
# dpnctl stop sched
Then, run the below command to force start a GC:
# avmaint garbagecollect --timeout=<how many seconds should GC run> --ava
4. run df -h again, and the space has to be reduced considerably provided all the backups have a good retention policy set.


**If you are unsure about this process, open a ticket with VMware to drive this further**

VDP Reports Incorrect Information About Protected Clients

$
0
0
When you connect to vSphere Data Protection in your web client, switch to the Reports tab and select Unprotected Clients, you will see a list of VMs that are not protected by VDP. When I say not protected by VDP, it means that they are not added to any backup jobs in that particular appliance. 

In some cases, you will see the virtual machine is still listed under the Unprotected Client section when the VM is already added in the backup job. This mostly occurs when a rename operation is done on the virtual machine. When a rename is done on the virtual machine, the backup job picks up the new name. The Unprotected Clients under Restore tab will not pick this up. 

Here is the result of a small test.

1. I have a backup job called "Windows" and a VM called "Windows_With_OS" is added under it. 


2. In the Unprotected Client section, you can see that this "Windows_With_OS" VM is not listed as it is already protected. 


3. Now, I will re-name this virtual machine in my vSphere Client to "Windows_New"


4. Back in the vSphere Data Protection, you can see the name is updated in the backup job list, but not in the Reporting Tab.


You can see that Windows_New is now coming up under Unprotected Clients even though it is already protected. (Ignore the vmx file name as this is renamed for other purposes)


This is an incorrect report and the VDP appliance should sync these changes automatically with vCenter naming changes. You can restart services, proxy, the entire appliance too and it will not fix this reporting. 

This can be also confirmed from the virtual name report in MCS and GSAN. To check this:

1. Open a SSH / Putty to the VDP appliance. Login as admin and elevate to root.
2. Run the below command:
# mccli client show --recursive=true


So if you observer here, the MCS still picks up the old virtual machine name. (mccli is only for MCS related information)

3. If you check what the GSAN shows, run the below command:
# avmgr getl --path=/vCenter-IP/Virtual-Machine-Domain.
The vCenter IP and VM domain can be found from the above mccli command which in my case is /192.168.1.1/VirtualMachines. The output is:


The avgmr is only for GSAN related information and also shows the Client ID is for the VM with the older name. 

So your vdr server naming is out of sync with the MCS and GSAN sync. 

The solution:

You will have to force sync the naming changes between the Avamar server and the vCenter Server. To do this, you will need the proxycp.jar file which can be downloaded from here

A brief about proxycp.jar, this is a java archive file which contains a set of built in commands that can be used to automate or run a specific set of tasks from the command line. Some of the things would require changes from multiple locations and numerous files, and the proxycp.jar will help you do these things by running the required commands.

1. So once you download the proxycp.jar file, open a WinSCP to the VDP appliance and copy this file into your /root or preferably /tmp  folder. 

2. Then SSH into your VDP appliance and change directory to where the proxycp.jar file is and run the following command
# java -jar proxycp.jar --syncvmnames
The output:


The In Sync column was false for the renamed virtual machine, and the "syncvmnames" switch updated this value.

3. Now if I go back to the Unprotected Client's list, this VM is no longer listed and if you run the mccli and avmgr command mentioned earlier will show the updated name.

If something is a bit off for you in this case, feel free to comment.

MCS Fails To Start On VDP. ERROR: gsan rollbacktime: xxxxxxx does not match stored rollbacktime: xxxxxxxx

$
0
0
Recently while working on a case, I came across the following issue. The MCS service was not coming up on a newly deployed VDP with existing drives. If I tried to start the MCS manually, the error received during this process was:

root@vdp58:#: dpnctl start mcs

Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: Starting MCS...
dpnctl: INFO: To monitor progress, run in another window: tail -f /tmp/dpnctl-mcs-start-output-26291
dpnctl: ERROR: error return from "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mcserver.sh --start" - exit status 1
dpnctl: ERROR: 1 error seen in output of "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mcserver.sh --start"
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]

And if I tailed the log that was displayed during the start attempt:
tail -f /tmp/dpnctl-mcs-start-output-26291

The actual error message was displayed:
ERROR: gsan rollbacktime: 1475722913 does not match stored rollbacktime: 1475722911

This occurs when GSAN has rolled back to a particular checkpoint but the MCS has not. 
Since these are not on the same rollbacktime the MCS service will not start. 

There are a couple of fixes available for this, and I would recommend you to start in the following order.

Fix 1:
Restore MCS

Run the below command to being MCS restore:
# dpnctl start mcs --force_mcs_restore
In most cases, this too fails. For me, it did, with the error:

root@vdp58:#: dpnctl start mcs --force_mcs_restore

Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: Restoring MCS data...
dpnctl: ERROR: 1 error seen in output of "[ -r /etc/profile ] && . /etc/profile ; echo 'Y' | /usr/local/avamar/bin/mcserver.sh --restore --id='root' --hfsport='27000' --hfsaddr='192.168.1.203' --password='*************'"
dpnctl: ERROR: MCS restore did not succeed, so not restarting MCS
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]

If this worked for you and the MCS is restored and started successfully, then stop here. Else, move further. 

Fix 2:
Restore MCS to an older Flush

Basically, your MCS data is constantly backed up, and this is what is called as MCS flush. This is to protect the MCS from server or any hardware failures.
MCS flushes its data to the avamar server every 60 minutes as a part of system checkpoints. This is why, I would recommend you to roll back to a MCS flush which has a valid local checkpoint on that VDP server. So the more older MCS flush you roll back to, the more MCS data is lost. 

The local checkpoints in my case were:

root@vdp58:#: cplist

cp.20161020033059 Thu Oct 20 09:00:59 2016   valid rol ---  nodes   1/1 stripes     25
cp.20161020033339 Thu Oct 20 09:03:39 2016   valid --- ---  nodes   1/1 stripes     25

To list your MCS Flush, run the below command:
avtar --archives --path=MC_BACKUPS
The output is similar to:

   Date      Time    Seq       Label           Size     Plugin    Working directory         Targets
 ---------- -------- ----- ----------------- ---------- -------- --------------------- -------------------
 2016-10-20 15:25:20   372                      369201K Linux    /usr/local/avamar     var/mc/server_data
 2016-10-20 14:45:20   371                      368582K Linux    /usr/local/avamar     var/mc/server_data
 2016-10-20 13:45:18   370                      367645K Linux    /usr/local/avamar     var/mc/server_data
 2016-10-20 12:45:17   369                      366716K Linux    /usr/local/avamar     var/mc/server_data
 2016-10-20 11:45:19   368                      365779K Linux    /usr/local/avamar     var/mc/server_data
 2016-10-20 10:45:17   367                      364842K Linux    /usr/local/avamar     var/mc/server_data
 2016-10-20 09:45:17   366                      363762K Linux    /usr/local/avamar     var/mc/server_data

Here the numbers 372, 371.....are the MCS Flush labels. This list keeps going on till the day where the VDP appliance was deployed. 

I will rollback my appliance to Label 366

The command would be:
mcserver.sh --restore --labelnum=<flush_ID>
In my case:
mcserver.sh --restore --labelnum=366
This will start a small interactive script, where you need to accept the restore, provide the VDP IP to proceed further. Sample output:

root@vdp58:#: mcserver.sh --restore --labelnum=366

mcserver.sh must be run as admin, please login as admin and retry
root@vdp58:/usr/local/avamar/var/log/#: su admin
admin@vdp58:/usr/local/avamar/var/log/#: mcserver.sh --restore --labelnum=366
=== BEGIN === check.mcs (prerestore)
check.mcs                        passed
=== PASS === check.mcs PASSED OVERALL (prerestore)
--restore will modify your Administrator Server database and preferences.
Do you want to proceed with the restore Y/N? [Y]: y
Enter the Avamar Server IP address or fully qualified domain name to
restore from (i.e. dpn.your_company.com): 192.168.1.203
Enter the Avamar Server IP port to restore from [27000]:

The port will be default 27000. Post this, you will see a long list of logging of the mcsrestore task.
This is going to make certain changes to your MCS database.

If the restore to an older flush completes successfully, then start the MCS using:
mcserver.sh --start --verbose
This started the MCS successfully for me.

Now, I have also worked on a case, where the mcserver restore to an older flush completed with error / warnings causing the mcserver.sh --start to fail with the same error:

ERROR: gsan rollbacktime: 1475722913 does not match stored rollbacktime: 1475722911

You can try rolling to an even older MCS Flush and see how that goes. But, the chances are less that the MCS will ever come up. 

So if this fails, move to the next step:

Fix 3:
Update the MCS Database Manually. 

The last fix for this is to manually update the MCS database with the correct rollbacktime.

**This is a very tricky fix, and is not a best practice or a recommended method. If you are running a lab environment, then go ahead and try this. If you have production data at stake, stop! Involve EMC to check for other alternatives**

With that out of the way, the final fix would be in the order.

1. Connect to the MCS database. 

VDP is a SUSE box, and it runs a PostgreSQL database. The command would be the same as any to connect to the psql DB:
psql -p 5555 -U admin mcdb

The port for MCS database is 5555
We are connecting with admin user as we want to make certain changes on the MCS database. If you want to be in a view only mode then use the "viewuser" to connect to "mcdb"

2. Once you connect, you see the following message:

admin@vdp58:#: psql -p 5555 -U admin mcdb

Welcome to psql 8.3.23, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

3. Run \d to list the MCS tables. The one we are interested in is "property_value"

4. Run the below query to list all the contents of this table:
select * from property_value;
The output is similar to:

     property       |            value
---------------------+------------------------------
 morning_cron_start  | -1
 evening_cron_start  | -1
 mcsnmp_cron_start   | 1
 clean_db_cron_start | 3
 rollbacktime        | 1475722911
 systemid            | 1476126720@00:50:56:B9:3E:6D
 hfscreatetime       | 1476126720
 systemname          | vdp58.vcloud.local
 restoredFlushTime   | 2016-10-10 19:45:00 PDT
 license_period_day  | 14
 license_buffer_pct  | 10
(11 rows)

The row that we are interested in is rollbacktime. Here we see the rollbacktime is 1475722911 which is not matching the GSAN rollback time of 1475722913

5. To update this, run the below query:
update property_value set value = <GSAN_rollbacktime> where property = 'rollbacktime';
So my query would look like:
update property_value set value = 1475722913 where property = 'rollbacktime';
Verify if the rollbacktime parameter is updated with the correct GSAN rollbacktime. 

6. Switch to admin mode of VDP appliance (su admin) and then start the MCS using:
mcserver.sh --start --verbose
This has to start the MCS as we have force synced the MCS. 


If this does not work, then I do not know what else will. 

VDP Stuck In A Configuration Loop

$
0
0
There have been a few cases logged with VMware where the newly deployed VDP appliance gets stuck in a configuration loop. Not to worry, there is now a fix for this. 

A little insight to what this is: So, we will go ahead and deploy a VDP (6.1.2 in my case) as an ova template. The deployment goes through successfully, and then we power On the VDP appliance which too completes successfully. Then, we go to the https://vdp-ip:8543/vdp-configure page and run through the configuration wizard. Everything goes here as well, the configuration wizard completes and requests you to reboot the appliance. Once the appliance is rebooted, it's going to make certain changes to the appliance, configure alarms and initialize core services. There will be a task called as "VDP: Configure Appliance" which will be initiated. Here, this task gets stuck somewhere around 45 to 70 percent. The appliance will boot up completely, however, when you go back to the vdp-configure page, you will notice that it is taking you through the configuration wizard again. You can run up to the configure storage section post which you will receive an error, as the appliance is already configured with the storage. And no matter which browser or how many times you access this vdp-configure page, you will be taken back to the configuration wizard. This will end up as an infinite loop.

This issue is mainly and mostly (almost certainly) seen only on vCenter 5.5 U3e release. This is because, the VDP uses JSAFE/BSAFE Java libraries and these do not go well with the vCenter SSL ciphers in the 5.5 U3e. To fix this, we switch from JSAFE to Java JCE libraries on the VDP appliance.

Before, we get to this, you can visit the vdr-server.log at the time of the issue (/usr/local/avamar/var/vdr/server_logs) to verify the following:

2016-10-29 01:15:40,676 INFO  [Thread-7]-vi.ViJavaServiceInstanceProviderImpl: vcenter-ignore-cert ? true
2016-10-29 01:15:40,714 WARN  [Thread-7]-vi.VCenterServiceImpl: No VCenter found in MC root domain
2016-10-29 01:15:40,714 INFO  [Thread-7]-vi.ViJavaServiceInstanceProviderImpl: visdkUrl = https:/sdk
2016-10-29 01:15:40,715 ERROR [Thread-7]-vi.ViJavaServiceInstanceProviderImpl: Failed To Create ViJava ServiceInstance owing to Remote VCenter connection error
java.rmi.RemoteException: VI SDK invoke exception:java.lang.IllegalArgumentException: protocol = https host = null; nested exception is:
        java.lang.IllegalArgumentException: protocol = https host = null
        at com.vmware.vim25.ws.WSClient.invoke(WSClient.java:139)
        at com.vmware.vim25.ws.VimStub.retrieveServiceContent(VimStub.java:2114)
        at com.vmware.vim25.mo.ServiceInstance.<init>(ServiceInstance.java:117)
        at com.vmware.vim25.mo.ServiceInstance.<init>(ServiceInstance.java:95)
        at com.emc.vdp2.common.vi.ViJavaServiceInstanceProviderImpl.createViJavaServiceInstance(ViJavaServiceInstanceProviderImpl.java:297)
        at com.emc.vdp2.common.vi.ViJavaServiceInstanceProviderImpl.createViJavaServiceInstance(ViJavaServiceInstanceProviderImpl.java:159)
        at com.emc.vdp2.common.vi.ViJavaServiceInstanceProviderImpl.createViJavaServiceInstance(ViJavaServiceInstanceProviderImpl.java:104)
        at com.emc.vdp2.common.vi.ViJavaServiceInstanceProviderImpl.createViJavaServiceInstance(ViJavaServiceInstanceProviderImpl.java:96)
        at com.emc.vdp2.common.vi.ViJavaServiceInstanceProviderImpl.getViJavaServiceInstance(ViJavaServiceInstanceProviderImpl.java:74)
        at com.emc.vdp2.common.vi.ViJavaServiceInstanceProviderImpl.waitForViJavaServiceInstance(ViJavaServiceInstanceProviderImpl.java:212)
        at com.emc.vdp2.server.VDRServletLifeCycleListener$1.run(VDRServletLifeCycleListener.java:71)
        at java.lang.Thread.run(Unknown Source)

Caused by: java.lang.IllegalArgumentException: protocol = https host = null
        at sun.net.spi.DefaultProxySelector.select(Unknown Source)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(Unknown Source)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
        at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(Unknown Source)
        at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(Unknown Source)
        at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(Unknown Source)
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(Unknown Source)
        at com.vmware.vim25.ws.WSClient.post(WSClient.java:216)
        at com.vmware.vim25.ws.WSClient.invoke(WSClient.java:133)
        ... 11 more

2016-10-29 01:15:40,715 INFO  [Thread-7]-vi.ViJavaServiceInstanceProviderImpl: Retry ViJava ServiceInstance Acquisition In 5 Seconds...
2016-10-29 01:15:45,716 INFO  [Thread-7]-vi.ViJavaServiceInstanceProviderImpl: vcenter-ignore-cert ? true
2016-10-29 01:15:45,819 WARN  [Thread-7]-vi.VCenterServiceImpl: No VCenter found in MC root domain

To fix this:

1. Discard the newly deployed appliance completely. 
2. Deploy the VDP appliance again. Go through the ova deployment and power on the appliance. Stop here, do not go to the vdp-configure page.

3. To enable the Java JCE library we need to add a particular line in the mcsutils.pm file under the $prefs variable. The line is exactly as below:

. "-Dsecurity.provider.rsa.JsafeJCE.position=last "

4. vi the following file;
# vi  /usr/local/avamar/lib/mcsutils.pm
The original content would look like:

my $rmidef = "-Djava.rmi.server.hostname=$rmihost ";
   my $prefs = "-Djava.util.logging.config.file=$mcsvar::lib_dir/mcserver_logging.properties "
             . "-Djava.security.egd=file:/dev/./urandom "
             . "-Djava.io.tmpdir=$mcsvar::tmp_dir "
             . "-Djava.util.prefs.PreferencesFactory=com.avamar.mc.util.MCServerPreferencesFactory "
             . "-Djavax.xml.parsers.DocumentBuilderFactory=org.apache.xerces.jaxp.DocumentBuilderFactoryImpl "
             . "-Djavax.net.ssl.keyStore=" . MCServer::get( "rmi_ssl_keystore" ) .""
             . "-Djavax.net.ssl.trustStore=" . MCServer::get( "rmi_ssl_keystore" ) .""
             . "-Dfile.encoding=UTF-8 "
             . "-Dlog4j.configuration=file://$mcsvar::lib_dir/log4j.properties ";  # vmware/axis

After editing it would look like:

 my $rmidef = "-Djava.rmi.server.hostname=$rmihost ";
   my $prefs = "-Djava.util.logging.config.file=$mcsvar::lib_dir/mcserver_logging.properties "
             . "-Djava.security.egd=file:/dev/./urandom "
             . "-Djava.io.tmpdir=$mcsvar::tmp_dir "
             . "-Djava.util.prefs.PreferencesFactory=com.avamar.mc.util.MCServerPreferencesFactory "
             . "-Djavax.xml.parsers.DocumentBuilderFactory=org.apache.xerces.jaxp.DocumentBuilderFactoryImpl "
             . "-Djavax.net.ssl.keyStore=" . MCServer::get( "rmi_ssl_keystore" ) .""
             . "-Djavax.net.ssl.trustStore=" . MCServer::get( "rmi_ssl_keystore" ) .""
             . "-Dfile.encoding=UTF-8 "
             . "-Dsecurity.provider.rsa.JsafeJCE.position=last "
             . "-Dlog4j.configuration=file://$mcsvar::lib_dir/log4j.properties ";  # vmware/axis

5. Save the file
6. There is no use of restarting mcs using mcserver.sh --restart, as the VDP appliance is not yet configured and hence the core services are not yet initialized. 
7. Reboot the appliance.
8. Once the appliance is booted up, go to the configure page and begin the configuration and this should avoid the configuration loop issue.

That's it.

Get Help

Migrating VDP From 5.8 and 6.0 To 6.1.x With Data Domain

$
0
0
You cannot upgrade a vSphere Data Protection appliance from 5.8.x and 6.0.x to 6.1.x due to the difference in the underlying SUSE Linux version. Since the earlier versions of vSphere Data Protection used SLES 11 SP1 and the 6.1.x uses SLES 11 SP3, we will be performing the migrate.

This article only discusses about migrating a VDP appliance from 5.8.x and 6.0.x with a data domain attached. If you had a VDP appliance without a data domain, we would choose the "Migrate" option in the vdp-configure wizard during the setup of the new 6.1.x appliance. However, this is not the path we will follow when the destination storage is an EMC Data Domain. A VDP appliance with Data Domain migration would be done by a process called as checkpoint restore. Let's discuss these steps below...

For this instance let's consider the following setup:
1. A vSphere Data Protection 5.8 appliance
2. A Virtual Edition of EMC Data Domain Appliance (Process is still the same for physical as well)
3. The 5.8 VDP was deployed as a 512GB deployment.
4. The IP address of this VDP appliance was 192.168.1.203
5. The IP address of the Data Domain appliance is 192.168.1.200

Pre-requisites:
1. In the point (3) above you saw that the 5.8 VDP appliance was setup with a 512 GB local drives. The first question that comes here is, why have a local drive when the backups are residing on the Data Domain?
A vSphere Data Protection appliance with a Data Domain would still have a local VMDK is to store the meta-data of the client backups. The actual data of the client is deduplicated and stored on the DD appliance and the meta-data of this backup is stored under the /data0?/cur directory on the VDP appliance. So, if your source appliance was of 512 GB deployment, then the destination has to be either equal to or greater than the source deployment.

2. The IP address, DNS name, domain and all other networking configuration of the destination appliance should be same as the source.

3. It is best to keep the same password on the destination appliance during the initial setup process.

4. On the source appliance make sure the Checkpoint Copy is Enabled. To verify this, go to https://vdp-ip:8543/vdp-configure page, select the Storage tab, click the Gear Icon and click Edit Data Domain. The first page displays this option. If this is not checked, then the checkpoint on the source appliance will not be copied over to the Data Domain, and you will not be able to perform a checkpoint restore.

The migration process:
1. Take a SSH to the source VDP appliance and run the below command to get the checkpoint list:
# cplist

The output would be similar to:
cp.20161011033032Tue Oct 11 09:00:32 2016   valid rol ---  nodes   1/1 stripes     25
cp.20161011033312 Tue Oct 11 09:03:12 2016   valid --- ---  nodes   1/1 stripes     25

Make a note of this output.

2. Run the below command to obtain the Avamar System ID:
# avmaint config --ava | grep -i "system"
The output would be similar to:
  systemname="vdp58.vcloud.local"
  systemcreatetime="1476126720"
  systemcreateaddr="00:50:56:B9:3E:6D"

Make a note of this output as well.  1476126720 would be the Avamar System ID. This is used to determine which mTree this VDP appliance corresponds to on the Data Domain.

3. Run the below command to obtain the hashed Avamar Root Password. Why this, will be explained later:
# grep ap /usr/local/avamar/etc/usersettings.cfg
The output would be similar to:
password=6cbd70a95847fc58beb381e72600a4cb33d322cc3d9a262fdc17acdbeee80860a285534ab1427048

4. Power off the source appliance

5. Deploy VDP 6.1.x appliance via the OVF template, provide the same networking details during the ova deployment and power on the 6.1.x appliance once the ova deployment completes successfully.

6. Go to the https://vdp-ip:8543/vdp-configure page and complete the configuration process for the new appliance. As mentioned above, during the "Create Storage" section in the wizard specify the local storage space, either equal to or greater than the source VDP appliance system. Once the appliance configuration completes, it will reboot the new 6.1.x system.

7. Once the reboot is completed, open a SSH to the 6.1.x appliance and run the below command to list the available checkpoints on the data domain.
# ddrmaint cp-backup-list --full --ddr-server=<data-domain-IP> --ddr-user=<ddboost-user-name> --ddr-password=<ddboost-password>

Sample command from my lab:
# ddrmaint cp-backup-list --full --ddr-server=192.168.1.200 --ddr-user=ddboost-user --ddr-password=VMware123!
The output would be similar to:
================== Checkpoint ==================
 Avamar Server Name           : vdp58.vcloud.local
 Avamar Server MTree/LSU      : avamar-1476126720
 Data Domain System Name      : 192.168.1.200
 Avamar Client Path           : /MC_SYSTEM/avamar-1476126720
 Avamar Client ID             : 200e7808ddcde518fe08b6778567fa4f397e97fc
 Checkpoint Name              : cp.20161011033032
 Checkpoint Backup Date       : 2016-10-11 09:02:07
 Data Partitions              : 3
 Attached Data Domain systems : 192.168.1.200

The highlighted parts are what we need. The avamar-1476126720 would be the Avamar mTree on the data domain. We received this system ID earlier in this article. The checkpoint cp.20161011033032 was also a checkpoint on the source VDP appliance which was copied over to the data domain.

8. Now, we will perform a cprestore to this checkpoint. The command to perform the cprestore is:
# /usr/local/avamar/bin/#: cprestore --hfscreatetime=<avamar-ID> --ddr-server=<data-domain-IP> --ddr-user=<ddboost-user-name> --cptag=<checkpoint-name>

Sample command from my lab:
# /usr/local/avamar/bin/#: cprestore --hfscreatetime=1476126720 --ddr-server=192.168.1.200 --ddr-user=ddboost-user --cptag=cp.20161011033032
Where, 1476126720 is the Avamar System ID and cp.20161011033032 is a valid checkpoint. Do not rollback if the checkpoint is not valid. If the checkpoint is not validated, then on the source VDP appliance you will have to run an integrity check to generate a valid checkpoint and copy this over to the Data Domain system.

The output would be:
Version: 1.11.1
Current working directory: /space/avamar/var
Log file: cprestore-cp.20161011033032.log
Checking node type.
Node type: single-node server
Create DD NFS Export: data/col1/avamar-1476126720/GSAN
ssh ddboost-user@192.168.1.200 nfs add /data/col1/avamar-1476126720/GSAN 192.168.1.203 "(ro,no_root_squash,no_all_squash,secure)"
Execute: ssh ddboost-user@192.168.1.200 nfs add /data/col1/avamar-1476126720/GSAN 192.168.1.203 "(ro,no_root_squash,no_all_squash,secure)"
Warning: Permanently added '192.168.1.200' (RSA) to the list of known hosts.
Data Domain OS
Password:

Enter the data domain password when prompted. Once the password is authenticated, the cprestore will start. It is going to copy the meta data of the backups for the displayed checkpoint on to the 6.1.x appliance. 

The output would be similar to:
[Thu Oct  6 08:24:44 2016] (22497) 'ddnfs_gsan/cp.20161011033032/data01/0000000000000015.chd' -> '/data01/cp.20161011033032/0000000000000015.chd'
[Thu Oct  6 08:24:44 2016] (22498) 'ddnfs_gsan/cp.20161011033032/data02/0000000000000019.wlg' -> '/data02/cp.20161011033032/0000000000000019.wlg'
[Thu Oct  6 08:24:44 2016] (22497) 'ddnfs_gsan/cp.20161011033032/data01/0000000000000015.wlg' -> '/data01/cp.20161011033032/0000000000000015.wlg'
[Thu Oct  6 08:24:44 2016] (22499) 'ddnfs_gsan/cp.20161011033032/data03/0000000000000014.wlg' -> '/data03/cp.20161011033032/0000000000000014.wlg'
[Thu Oct  6 08:24:44 2016] (22498) 'ddnfs_gsan/cp.20161011033032/data02/checkpoint-complete' -> '/data02/cp.20161011033032/checkpoint-complete'
[Thu Oct  6 08:24:44 2016] (22499) 'ddnfs_gsan/cp.20161011033032/data03/0000000000000016.chd' -> '/data03/cp.20161011033032/0000000000000016.chd'

This would keep going on until all the meta-data is copied over. The length of cprestore process would depend on the amount of backup data. Once the process is complete you will see the below message.

Restore data01 finished.
Cleanup restore for data01
Changing owner/group and permissions: /data01/cp.20161011033032
PID 22497 returned with exit code 0
Restore data03 finished.
Cleanup restore for data03
Changing owner/group and permissions: /data03/cp.20161011033032
PID 22499 returned with exit code 0
Finished restoring files in 00:00:04.
Restoring ddr_info.
Copy: 'ddnfs_gsan/cp.20161011033032/ddr_info' -> '/usr/local/avamar/var/ddr_info'
Unmount NFS path 'ddnfs_gsan' in 3 seconds
Execute: sudo umount "ddnfs_gsan"
Remove DD NFS Export: data/col1/avamar-1476126720/GSAN
ssh ddboost-user@192.168.1.200 nfs del /data/col1/avamar-1476126720/GSAN 192.168.1.203
Execute: ssh ddboost-user@192.168.1.200 nfs del /data/col1/avamar-1476126720/GSAN 192.168.1.203
Data Domain OS
Password:
kthxbye

Once the data domain password is entered, the cprestore process completes with a kthxbye message.

9. Run the # cplist command on the 6.1.x appliance and you should notice that the checkpoint that was displayed in the cpbackup list is now listing under the 6.1.x checkpoints:

cp.20161006013247 Thu Oct  6 07:02:47 2016   valid hfs ---  nodes   1/1 stripes     25
cp.20161011033032Tue Oct 11 09:00:32 2016   valid rol ---  nodes   1/1 stripes     25

The cp.20161006013247 is the 6.1.x appliance's local checkpoint and the cp.20161011033032 is the checkpoint of source appliance which was copied over from the data domain during the cprestore.

10. Once the restore is complete, we need to perform a rollback to this checkpoint. So first, you will have to stop all core services on the 6.1.x appliance using the below command:
# dpnctl stop
11. Initiate the force rollback using the below command:
# dpnctl start --force_rollback

You will see the following output:
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
Action: starting all
Have you contacted Avamar Technical Support to ensure that this
  is the right thing to do?
Answering y(es) proceeds with starting all;
          n(o) or q(uit) exits
y(es), n(o), q(uit/exit):

Select yes (y) to initiate the rollback. The next set of output you will see is:

dpnctl: INFO: Checking that gsan was shut down cleanly...
-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
Here is the most recent available checkpoint:
  Tue Oct 11 03:30:32 2016 UTC Validated(type=rolling)
A rollback was requested.
The gsan was shut down cleanly.

The choices are as follows:
  1   roll back to the most recent checkpoint, whether or not validated
  2   roll back to the most recent validated checkpoint
  3   select a specific checkpoint to which to roll back
  4   restart, but do not roll back
  5   do not restart
  q   quit/exit

Choose option 3 and the next set of output you will see is:

Here is the list of available checkpoints:

     2   Thu Oct  6 01:32:47 2016 UTC Validated(type=full)
     1   Tue Oct 11 03:30:32 2016 UTC Validated(type=rolling)

Please select the number of a checkpoint to which to roll back.

Alternatively:
     q   return to previous menu without selecting a checkpoint
(Entering an empty (blank) line twice quits/exits.)

So in the earlier cplist command you will notice that the cp.20161011033032 had a time-stamp of Oct 11. So choose option (1) and the next output you will see is:
-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
You have selected this checkpoint:
  name:       cp.20161011033032
  date:       Tue Oct 11 03:30:32 2016 UTC
  validated:  yes
  age:        -7229 minutes

Roll back to this checkpoint?
Answering y(es)  accepts this checkpoint and initiates rollback
          n(o)   rejects this checkpoint and returns to the main menu
          q(uit) exits

Verify if this indeed the checkpoint and proceed yes (y) upon confirmation. The GSAN and MCS rollback begins and you will notice this in the console:

dpnctl: INFO: rolling back to checkpoint "cp.20161011033032" and restarting the gsan succeeded.
dpnctl: INFO: gsan started.
dpnctl: INFO: Restoring MCS data...
dpnctl: INFO: MCS data restored.
dpnctl: INFO: Starting MCS...
dpnctl: INFO: To monitor progress, run in another window: tail -f /tmp/dpnctl-mcs-start-output-24536
dpnctl: WARNING: 1 warning seen in output of "[ -r /etc/profile ] && . /etc/profile ; /usr/local/avamar/bin/mcserver.sh --start"
dpnctl: INFO: MCS started.

**If this process fails, open a ticket with VMware support. I cannot provide the troubleshooting steps for this as this is confidential. Request / Add information in your support ticket to contact me if needed for the engineer assigned to run a check past me**

If the rollback goes through successfully you might be presented with an option to restore the tomcat database.

Do you wish to do a restore of the local EMS data?

Answering y(es) will restore the local EMS data
          n(o) will leave the existing EMS data alone
          q(uit) exits with no further action.

Please consult with Avamar Technical Support before answering y(es).

Answer n(o) here unless you have a special need to restore
  the EMS data, e.g., you are restoring this node from scratch,
  or you know for a fact that you are having EMS database problems
  that require restoring the database.

y(es), n(o), q(uit/exit):

I would choose no if my database is not causing issues in my environment. Post this, the remaining services will be started. The output:

dpnctl: INFO: EM Tomcat started.
dpnctl: INFO: Resuming backup scheduler...
dpnctl: INFO: Backup scheduler resumed.
dpnctl: INFO: AvInstaller is already running.
dpnctl: INFO: [see log file "/usr/local/avamar/var/log/dpnctl.log"]

That should be pretty much it. When you login to https://vdp-ip:8543/vdp-configure page, you should be able to see the Data Domain automatically in the Storage Tab. If not, open a support ticket with VMware

There are couple of post-migration steps:
1. If you are using internal proxy, un-register the proxy and re-register it back from the VDP configure page.
2. External proxies (if used) will be orphaned, so you will have to delete the external proxies, change the VDP root password and re-add the external proxy
3. If you are using Guest Level backups, then the agents for SQL, Exchange, Sharepoint has to be re-installed. 
4. If this appliance is replicating to another VDP appliance, then the replication agents need to be re-registered. Follow the below 4 commands in the same order to perform this:
# service avagent-replicate stop
# service avagent-replicate unregister 127.0.0.1 /MC_SYSTEM
# service avagent-replicate register 127.0.0.1 /MC_SYSTEM
# service avagent-replicate start

And that should be it...


Avigui.html Shows Err_Connection_Refused in Avamar Virtual Edition 7.1

$
0
0
Recently I started deploying and testing the EMC Avamar Virtual Edition, and one of the first issue I ran into was with the configuration. The deployment of the appliance is pretty simple. The Avamar virtual edition 7.1 is a 7zip file, which when extracted provides the ovf file. Using the deploy ovf template option I was able to get this appliance deployed. Post this, as per the installation guide of AVE (Avamar Virtual Edition), I added the data drives, configured the networking for this appliance and rebooted post a successful configuration. 

However, when trying to access the https://avamar-IP:8543/avi/avigui.html, I received the Err_Connection_Closed message. No matter what I tried I was unable to get into the actual configuration GUI to initialize the services. 

Looks like there are couple of steps I had to run. There is a package called AviInstaller.pl which is responsible for package installations. So, this had to be installed. To do this, SSH into the avamar appliance as root and password as changeme and browse the below directory:
# cd /usr/local/avamar/src/
Run the aviInstaller bootstrap with the below command:
# ./avinstaller-bootstrap-version.sles11_64.x86_64.run

Once this runs, log back into the same avigui.html URL and we should be able to see the below login screen.
That's pretty much it.

VDP Status Error: Data Domain Storage Status Is Not Available

$
0
0
Today when I logged into my lab to do bit of testing on my vSphere Data Protection 6.1 appliance I noticed the following in the VDP plugin in Web Client:

It was stating, "The Data Domain storage status is not available. Unable to get system information
So, I logged into the vdp-configure page and switched to the Storage tab and noticed the similar error message.

So the first instinct was to go ahead and Edit the Data Domain Settings to try re-adding it back. But that failed too, with the below message.


When a VDP is configured to a Data Domain, these configuration error logging will be in ddrmaintlogs located under /usr/local/avamar/var/ddrmaintlogs

Here, I noticed the following:

Oct 29 01:54:27 vdp58 ddrmaint.bin[30277]: Error: get-system-info::body - DDR_Open failed: 192.168.1.200, DDR result code: 5040, desc: calling system(), returns nonzero
Oct 29 01:54:27 vdp58 ddrmaint.bin[30277]: Error: <4780>Datadomain get system info failed.
Oct 29 01:54:27 vdp58 ddrmaint.bin[30277]: Info: ============================= get-system-info finished in 62 seconds
Oct 29 01:54:27 vdp58 ddrmaint.bin[30277]: Info: ============================= get-system-info cmd finished =============================
Oct 29 01:55:06 vdp58 ddrmaint.bin[30570]: Warning: Calling DDR_OPEN returned result code:5040 message:calling system(), returns nonzero

Well, this says a part of the problem. It is unable to fetch the Data Domain System Information. So the next thing is to see vdr-configure logs located under /usr/local/avamar/var/vdr/server_logs
All operations done in the vdp-configure page will be logged under vdr-configure logs.

Here, the following was seen:

2016-10-29 01:54:28,564 INFO  [http-nio-8543-exec-9]-services.DdrService: Error Code='E30973' Message='Description: The file system is disabled on the Data Domain system. The file system must be enabled in order to perform backups and restores. Data: null Remedy: Enable the file system by running the 'filesys enable' command on the Data Domain system. Domain:  Publish Time: 0 Severity: PROCESS Source Software: MCS:DD Summary: The file system is disabled. Type: ERROR'

This is a much detailed Error. So, I logged into my Data Domain system using the "sysadmin" credentials and ran the below command to check the status of the filesystem:
# filesys status

The output was:
The filesystem is enabled and running.

The Data Domain was reporting the file-system is already up and running. Perhaps this was in a non responding / stale state. So, I re-enabled the file-system using:
# filesys enable

Post this, the data domain automatically connected to the VDP appliance and the right status was displayed.

VDP / EMC Avamar Console: Login Incorrect Before Entering Password

$
0
0
Today while force simulating failed root login attempts I ran into an issue. So, I had a VDP 6.1.2 appliance installed and opened a console to this. Upon requesting for the "root" credentials, I simply went ahead and entered wrong password multiple times. Then, at one point of time, as soon as I enter "root" for VDP login, it complained Login Incorrect.


Even before entering the password the login was failing. If I SSH into the appliance, I was able to access the VDP. If I use admin to login to VDP either via console or SSH it works and I can sudo to root from there.

From messages.log the following was recorded during incorrect root login from the VM Console:

Nov  6 09:58:40 vdp58 login[26056]: FAILED LOGIN 1 FROM /dev/tty2 FOR root, Authentication failure

If I run which terminal I am on in the SSH of the VDP using the # tty command, I saw that the terminal used was /dev/pts/0
And when I logged into VDP as admin and sudo to root and checked for the terminal name, as expected, it was /dev/tty2

The Fix:

1. Change your directory to:
# cd /etc
2. Edit the securetty file using vi
# vi securetty
The contents of this file was "console" and "tty1"
Go ahead and add the terminal which you are trying to access the root user for the appliance from, in my case tty2, and save the file.

Post this, I was able to login as root from the VDP console.

Also, applies to EMC Avamar Virtual Edition.


vSphere 6.5: What is vCenter High Availability

$
0
0
In 6.0 we had the option to provide high availability for the Platform Services Controller by deploying redundant PSC nodes in the same SSO domain and utilizing a manual re point command or a Load balancer to switch to a new PSC if the current one was down. However, for vCenter nodes there was no such option available, and the only way to have HA for vCenter node was to either configure Fault Tolerance or have the vCenter virtual machine in a HA enabled cluster.

Now with the release of vSphere 6.5, there has been a new much awaited feature added to provide redundancy or high availability for your vCenter node too. This is the VCHA or the vCenter High Availability feature.

The design of VCHA is somewhat similar to your regular clustering mechanism. Before we get to the working of this, here are few prerequisites for VCHA to work:

1. Applicable to vCenter Server Appliance only. Embedded VCSA is currently not supported.
2. Three unique ESXi hosts. One for each node (Active, Passive and Witness)
3. Three unique datastores to contain each of these nodes.
4. Same Single Sign On Domain for Active and Passive nodes
5. One public IP to access and use vCenter
6. 3 Private IP in a different subnet to that of public IP. This will be used for internal communication to check node state.

vCenter High Availability (VCHA) Deployment:
There are three nodes available or deployed once your vCenter is configured for high availability. Active node, Passive node and the Witness (Quorum) node. The active node will be the one that would have the Public IP vNIC in up state. This public IP will be used to access and connect to your vSphere Web Client for management purpose.

The second node is the Passive node which is the exact clone of the active node. It has the same memory, CPU and disk configurations as that of the Active node. The public IP vNIC will be down for this node and the vNIC used for Private IP will be up. The private network between Active and Passive is for cluster operations. The active node will have it's database and files updated regularly and this has to be synced across the Passive node, and these information will be synced over the Private network.

The third node, also called as quorum node acts as a witness. This node is introduced to avoid split-brain scenario which arises due to network partition. In a case of network partition we cannot have two active nodes up and running and the quorum node decides which node is active and which has to be passive.

The vPostgres Replication is used to enable database replication between active and passive nodes and this is a synchronous replication. The vCenter files are replicated using native Linux Rsync which is a asynchronous replication.

 What happens during a failover?

When the active node goes down, the passive node becomes the active and assumes the public IP address. The state of the VCHA cluster enters a degraded state since one of the node is down. The recovery time is not transparent and there will be a RTO of ~5 minutes.

Also, your cluster can enter a degraded state when your active node is still running in a healthy state, but either the passive or the witness node are down. In short, if one node in the cluster is down, then the VCHA is in a degraded state. More about VCHA states and deployment will be in a later article.

Hope this was helpful.

vSphere 6.5: Installing vCenter Appliance 6.5

$
0
0
With the release of vSphere 6.5, the installation of vCenter appliance just got a whole lot easier. Earlier, we required client integration plugin to be available, and then the deployment was done through a browser. And as we know, client integration plugin had multiple compatibility issues. Well, no more of client integration plugin is used. The deployment is going to be via an ISO which would have an installation wizard that can be executed on Windows, MAC or Linux.

The vCenter Server Appliance consists of a 2-stage deployment.
1. Deploying VCSA 6.5
2. Configuring VCSA 6.5

Deploying VCSA 6.5

Download the vCenter Server appliance from this link here. Once the download is complete, mount the ISO onto any machine and run the installer. You should be seeing the below screen.


We will be choosing the Install option as this is a fresh deployment. The description then shows that there are two steps involved in the installation. The first step will deploy a vCenter Server appliance and the second step will be configuring this deployed appliance.


Accept the EULA


Choose the type of deployment that is required. I will be going with an embedded Platform Services Controller deployment.


Next, choose the ESXi host where you would like to have this vCenter appliance deployed and provide the root credentials of the host for authentication.


Then, provide a name for the vCenter appliance VM that is going to be deployed and set the root password for the appliance.


Based upon your environment size, select the sizing of the vCenter appliance.


Select the datastore where the vCenter appliance files need to reside.


Configure the networking of vCenter appliance. Please have a valid IP which can be resolved both forward / reverse prior to this to prevent any failures during installation.


Review and finish the deployment, and the progress for stage 1 begins.


Upon completion, you can Continue to proceed to configure the appliance. If you close this window out, then you need login to the web management page for VCSA in the https://vcenter-IP:5480 to continue with the configuration. In this scenario, I will choose the Continue option to proceed further.

Configuring VCSA 6.5


The stage 2 wizard begins at this point. The first section is to configure NTP for the appliance and enable Shell access for the same.


Here, we will mention the SSO domain name, the SSO password and the Site name for the appliance.
In the next step, if you would like to enable Client Experience Improvement Program, you can do so, else you can skip and proceed to completion.


Once the configuration wizard is completed the progress for Stage 2 begins.


Once the deployment is complete you can login to the web client (https://vCenter-IP:9443/vsphere-client) or the html 5 client (https://vCenter-IP/ui) The HTML web client is available only with vCenter server appliance.

vSphere 6.5: Login To Web Client Fails With Invalid Credentials

$
0
0
So, today I was making certain changes on my password policies on vSphere 6.5 and I ran into an interesting issue. I had created a user in the SSO domain (vmware.local), called as happycow@vmware.local and I tried to login to web client with this user. However, the login failed with the error: Invalid Credentials.


In the vmware-sts-idmd.logs located at C:\ProgramData\VMware\vCenterServer\logs\sso the following were noticed:

[2016-11-16T12:51:22.541-08:00 vmware.local         6772f8c3-7a11-479e-a224-e03175cc1b1a ERROR] [IdentityManager] Failed to authenticate principal [happycow@vmware.local]. User password expired. 
[2016-11-16T12:51:22.542-08:00 vmware.local         6772f8c3-7a11-479e-a224-e03175cc1b1a INFO ] [IdentityManager] Authentication failed for user [happycow@vmware.local] in tenant [vmware.local] in [115] milliseconds with provider [vmware.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider] 
[2016-11-16T12:51:22.542-08:00 vmware.local         6772f8c3-7a11-479e-a224-e03175cc1b1a ERROR] [ServerUtils] Exception 'com.vmware.identity.idm.PasswordExpiredException: User account expired: {Name: happycow, Domain: vmware.local}' 
com.vmware.identity.idm.PasswordExpiredException: User account expired: {Name: happycow, Domain: vmware.local}
at com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider.checkUserAccountFlags(VMwareDirectoryProvider.java:1378) ~[vmware-identity-idm-server.jar:?]
at com.vmware.identity.idm.server.IdentityManager.authenticate(IdentityManager.java:3042) ~[vmware-identity-idm-server.jar:?]
at com.vmware.identity.idm.server.IdentityManager.authenticate(IdentityManager.java:9805) ~[vmware-identity-idm-server.jar:?]
at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_77]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_77]
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:323) ~[?:1.8.0_77]
at sun.rmi.transport.Transport$1.run(Transport.java:200) ~[?:1.8.0_77]
at sun.rmi.transport.Transport$1.run(Transport.java:197) ~[?:1.8.0_77]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_77]
at sun.rmi.transport.Transport.serviceCall(Transport.java:196) ~[?:1.8.0_77]
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568) ~[?:1.8.0_77]
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826) ~[?:1.8.0_77]
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:683) ~[?:1.8.0_77]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_77]
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682) [?:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]

The account and it's password was coming up as expired. I was able to login to Web Client with the default SSO user account without issues.

This issue occurs when the SSO password expiration lifetime has a larger value than the maximum value permitted.

Under Administration > Configuration > Policies, the password expiration was set to 36500 days. KB 2125495 similar to this, talks about this value should be less than 999999.

Changed this value to 3650 days (10 years) and the other users under the SSO were able to login. The same is seen on 6.0 as well, with a different error: Authentication Failure.

vSphere 6.5: Installing vCenter Server Appliance Via Command Line

$
0
0
Along with the GUI method of deploying vCenter appliance, there is a command line path as well, which I would say is quite fun and easy to follow. There are a set of pre-defined templates available and these are on the vCenter 6.5 appliance ISO file. Download and mount the VCSA 6.5 ISO and browse to

CD Drive:\vcsa-cli-installer\templates\install

You will see the following list of templates:

You can choose the required template from here for deployment. I will be going with the embedded VCSA being deployed on an ESXi host. So my template will be embedded_vCSA_on_ESXi.json

Open the required template using a notepad. The notepad has a list of details that you need to fill out. For my case, it was to provide ESXi host details, appliance details, networking details for the appliance, Single Sign On details. The file looks somewhat similar to the below image, post the edit:


Save the file as .json on to your desktop. Next, you will have to call this file using the vcsa-deploy.exe file. 
In the CD drive, browse to vcsa-cli-installer-win32/lin64/mac depending on the OS you are accessing this from and run the below command from PowerShell

vcsa-deploy.exe install --no-esx-ssl-verify --accept-eula --acknowledge-ceip “Path to the json file”

If there are errors in the edited file, it is going to display what the error is and in which line of the file this error is contained. The first step is a template verification, and if the template verification completes successfully, you should be seeing the below message:


The next step it starts automatically is the appliance deployment, and you will see the below task in progress:


Once the appliance is deployed, the final stage is configuring of the services. At this point you will see the below task in progress:


That's pretty much it, you can go ahead and login to web client (Flash / HTML)

Hope this helps!


Dude, Where's My Logs?

$
0
0
Well, what do we have here? Let's say, we created one backup Job in vSphere Data Protection Appliance, and we added 30 virtual machines to it. Next, we started the backup for this job, and you let it run overnight. Next morning when you login, you see that 29 VMs completed successfully and 1 VM failed for backup. 


You login to the SSH of the VDP and browse to /usr/local/avamarclient/var and notice that there are 30 logs for the same backup Job with different IDs which makes no sense. You don't know which backup log this failed VM is contained in. 


Here's a sneaky way to get this done.

I have created 3 VMs: Backup-Test, Backup-Test-B and Backup-Test-C
I have a backup Job called Job-A

I initiated a manual backup and all 3 VMs completed successfully. If I go to the backup job logs folder, I see the following:

-rw-r----- 1 root root 1.7K Nov 18 04:33 Job-A-1479423789022-02ef0cfc8a738a3e98c44fdbef3354ae1ce86a4b-1016-vmimagel-xmlstats.log
-rw-r----- 1 root root  15K Nov 18 04:34 Job-A-1479423789022-02ef0cfc8a738a3e98c44fdbef3354ae1ce86a4b-1016-vmimagel.alg
-rw-r----- 1 root root  40K Nov 18 04:34 Job-A-1479423789022-02ef0cfc8a738a3e98c44fdbef3354ae1ce86a4b-1016-vmimagel.log
-rw-r----- 1 root root  18K Nov 18 04:33 Job-A-1479423789022-02ef0cfc8a738a3e98c44fdbef3354ae1ce86a4b-1016-vmimagel_avtar.log
-rw-r----- 1 root root  383 Nov 18 04:33 Job-A-1479423789022-3628808a321ddac3a6e0d1eaae3446ad996d9d43-3016-vmimagew-xmlstats.log
-rw-r----- 1 root root  15K Nov 18 04:33 Job-A-1479423789022-3628808a321ddac3a6e0d1eaae3446ad996d9d43-3016-vmimagew.alg
-rw-r----- 1 root root  40K Nov 18 04:33 Job-A-1479423789022-3628808a321ddac3a6e0d1eaae3446ad996d9d43-3016-vmimagew.log
-rw-r----- 1 root root  19K Nov 18 04:33 Job-A-1479423789022-3628808a321ddac3a6e0d1eaae3446ad996d9d43-3016-vmimagew_avtar.log
-rw-r----- 1 root root 1.7K Nov 18 04:33 Job-A-1479423789022-a573f30f993e9a58420c69564cf2e16135540c49-1016-vmimagel-xmlstats.log
-rw-r----- 1 root root  15K Nov 18 04:33 Job-A-1479423789022-a573f30f993e9a58420c69564cf2e16135540c49-1016-vmimagel.alg
-rw-r----- 1 root root  41K Nov 18 04:33 Job-A-1479423789022-a573f30f993e9a58420c69564cf2e16135540c49-1016-vmimagel.log
-rw-r----- 1 root root  18K Nov 18 04:33 Job-A-1479423789022-a573f30f993e9a58420c69564cf2e16135540c49-1016-vmimagel_avtar.log

Looking at this, neither you or I have any idea on which log here contains logging for Backup-Test-C.

Step 1 of the sneaky trick:

Run the following command:
# mccli activity show --verbose=true

The output:

0,23000,CLI command completed successfully.
ID               Status                   Error Code Start Time           Elapsed     End Time             Type             Progress Bytes New Bytes Client        Domain                               OS            Client Release Sched. Start Time    Sched. End Time      Elapsed Wait Group                                      Plug-In              Retention Policy Retention Schedule                 Dataset                                    WID                 Server Container
---------------- ------------------------ ---------- -------------------- ----------- -------------------- ---------------- -------------- --------- ------------- ------------------------------------ ------------- -------------- -------------------- -------------------- ------------ ------------------------------------------ -------------------- ---------------- --------- ------------------------ ------------------------------------------ ------------------- ------ ---------
9147942378903409 Completed w/Exception(s) 10020      2016-11-18 04:33 IST 00h:00m:22s 2016-11-18 04:33 IST On-Demand Backup 64.0 MB        <0.05%    Backup-Test   /vc65.happycow.local/VirtualMachines windows9Guest 7.2.180-118    2016-11-18 04:33 IST 2016-11-19 04:33 IST 00h:00m:11s  /vc65.happycow.local/VirtualMachines/Job-A Windows VMware Image Job-A            D         Admin On-Demand Schedule /vc65.happycow.local/VirtualMachines/Job-A Job-A-1479423789022 Avamar N/A

9147942378902609 Completed                0          2016-11-18 04:33 IST 00h:00m:21s 2016-11-18 04:33 IST On-Demand Backup 64.0 MB        <0.05%    Backup-Test-C /vc65.happycow.local/VirtualMachines rhel7_64Guest 7.2.180-118    2016-11-18 04:33 IST 2016-11-19 04:33 IST 00h:00m:21s  /vc65.happycow.local/VirtualMachines/Job-A Linux VMware Image   Job-A            DWMY      Admin On-Demand Schedule /vc65.happycow.local/VirtualMachines/Job-A Job-A-1479423789022 Avamar N/A

9147942357262909 Completed w/Exception(s) 10020      2016-11-18 04:29 IST 00h:00m:43s 2016-11-18 04:30 IST On-Demand Backup 64.0 MB        <0.05%    Backup-Test   /vc65.happycow.local/VirtualMachines windows9Guest 7.2.180-118    2016-11-18 04:29 IST 2016-11-19 04:29 IST 00h:00m:12s  /vc65.happycow.local/VirtualMachines/Job-A Windows VMware Image Job-A            DWMY      Admin On-Demand Schedule /vc65.happycow.local/VirtualMachines/Job-A Job-A-1479423572598 Avamar N/A

9147942378903009Completed                0          2016-11-18 04:33 IST 00h:00m:22s 2016-11-18 04:34 IST On-Demand Backup 64.0 MB        <0.05%    Backup-Test-B /vc65.happycow.local/VirtualMachines centosGuest   7.2.180-118    2016-11-18 04:33 IST 2016-11-19 04:33 IST 00h:00m:31s  /vc65.happycow.local/VirtualMachines/Job-A Linux VMware Image   Job-A            DWMY      Admin On-Demand Schedule /vc65.happycow.local/VirtualMachines/Job-A Job-A-1479423789022 Avamar N/A

Again, all the confusing? Well, we need to look at two fields here. The Job ID field and the Work Order ID Field. 

The Job ID filed is the one highlighted in Red and the Work Order ID is the one highlighted in Orange.

The Work Order ID will match the first Name-ID in the log directory, but still this will not be helpful if there are too many VMs in the same backup Job as they will all have the same Work Order ID. 

Step 2 of the sneaky trick:

We will use the Job ID to view the logs. The command would be:
# mccli activity get-log --id=<Job-ID> | less

The first thing you will see as the output is:

0,23000,CLI command completed successfully.
Attribute Value

Followed by tons of blank spaces and dashes. 

Step 3 of the sneaky trick

The first Event ID of any backup is performing the logging and this event ID is 5008

So as soon as you run the get-log command, search for this Event ID and you will be directly taken to the start of the client's backup logs. You will see:

2016-11-18T04:33:29.903-05:-30 avvcbimage Info <5008>: Logging to /usr/local/avamarclient/var/Job-A-1479423789022-a573f30f993e9a58420c69564cf2e16135540c49-1016-vmimagel.log

Not only you can view the logs from here, you also have the complete log file name too.

That's it. Sneaky tricks in place. 

VDP Backups Stuck In "Waiting-Client" State

$
0
0
I was recently working on a case where VDP 5.5.6 never started its backup jobs. When I select the backup job and select Backup Now, it shows the job has been started successfully, however, there is no task created at all for this.

And, when I login to SSH of the VDP appliance to view the progress of the backup the state is in "Waiting-Client". So, in SSH the below command is executed to view backup status:
# mccli activity show --active

The output:

ID               Status            Error Code Start Time           Elapsed     End Time             Type             Progress Bytes New Bytes Client   Domain
---------------- ----------------- ---------- -------------------- ----------- -------------------- ---------------- -------------- --------- -------- --------------------------
9147944954795109 Waiting-Client    0          2016-11-18 07:12 IST 07h:40m:23s 2016-11-19 07:12 IST On-Demand Backup 0 bytes        0%        VM-1 /10.0.0.27/VirtualMachines
9147940920005609 Waiting-Client    0          2016-11-17 20:00 IST 18h:52m:51s 2016-11-18 20:00 IST Scheduled Backup 0 bytes        0%        VM-2 /10.0.0.27/VirtualMachines

The backups were always stuck in this status and never moved further. If I look at the vdr-server.log, I do see the job has been issued to MCS:

2016-11-18 14:33:06,630 INFO  [http-nio-8543-exec-10]-rest.BackupService: Executing Backup Job "Job-A"

However, If I look at the MCS logs, the mcserver.log, then I see the Job is not executed by MCS as MCS thinks that the server is in read-only state:

WARNING: Backup job skipped because server is read-only

If I run status.dpn, I see the Server Is In Full Access state. I checked the dispatcher status using the below command:

# mcserver.sh --status

You will have to be in admin mode to run the mcserver.sh script. The output of this script was:

Backup dispatching: suspended

This is a known issue on the 5.5.6 release of VDP. 

To fix this:

1. Cancel any existing backup jobs using the command:
# mccli acitivity cancel --id=<job-id>

The Job ID is the first section in the above mccli activity show command.

2. Browse the below location:
# cd /usr/local/avamar/var/mc/server_data/prefs/mcserver.xml

3. Open the mcserver.xml file in a vi editor.

4. Locate the parameter "stripeUtilizationCapacityFactor" and edit the value to 2.0.
**Do not change anything else in this file at all. Just the value needs to be changed**

5. Save the file and restart the MCS using the below command:
# dpnctl stop mcs
# dpnctl start mcs

6. Run the mcserver.sh to check the dispatcher status:
# mcserver.sh --status | grep -i dispatching

This time the output should be:

Backup dispatching: running

7. Run the backup again and it should start it immediately now.

If this does not work, raise a support request with VMware. Hope this helps.

VDP 6.1.3 First Look

$
0
0
So over the last week, there has been multiple products from VMware going live, and the most awaited one was the vSphere 6.5.
With vSphere 6.5, the add-on VMware products have to be on their compatible versions. This one is specifically dedicated to vSphere Data Protection 6.1.3. I will try to keep this post short, and mention the changes I have seen post deploying this in my lab. There is already a release notes for this, which covers most of the fixed issues and known issues in the 6.1.3 release.

This article speaks only about the issues that is not included in the release notes.

Issue 1: 

While using internal proxy for this appliance, the vSphere Data Protection configure page comes up blank for Name of the proxy, ESX Host where the proxy is deployed and the Datastore where the proxy resides.

The vdr-configure log does not have any information on this and a reconfigure / disable and re-enable of internal proxy does not fix this.

Workaround:
A reboot of appliance populates the right information back in the configure page.

Issue 2:

This is an intermittent issue and seen only during fresh deployment of the 6.1.3 appliance. The vSphere Data Protection plugin is not available in the web client. The VDP plugin version for this release is com.vmware.vdp2-6.1.3.70, and this folder is not created in the vsphere-client-serenity folder in the vCenter Server. The fix is similar to this link here.

Issue 3:

vDS Port-Groups are not listed during deployment of external proxies. The drop-down shows only standard switch port-groups.

Workaround:
Deploy the external proxy on a standard switch and then migrate it to a distributed switch. The standard switch port group you create does not need any up-links as this will be migrated to vDS soon after the deployment is completed.

Issue 4:

VM which is added to a backup job comes as unprotected after a rename on the VM is done. VDP does not sync it's naming with vCenter inventory automatically.

Workaround:
Force sync the vCenter - Avamar Names using the proxycp.jar utility. The steps can be found in this article here.

Issue 5:

Viewing logs for a failed backup from Job Failure tab does not return anything. The below message is seen while trying to view the logs:


This was seen in all version of 6.x even if none of the mentioned reasons are true. EMC has acknowledged this bug, however there is no fix for it currently.

Workaround:
View logs from Task Failure tab, command line or gather logs from vdp-configure page.


Fixed Issue: Controlling concurrent backup jobs. (Not contained in Release Notes)

In VDP releases prior to 6.1.3 and after 6.0.x, the vdp-configure page provided an option to control how many backup jobs should run at a time. The throttling was set under the "Manage Proxy Throughput" section. However, this never worked for most deployments.

This is fixed in 6.1.3

The test:

Created a backup job with 5 VMs.
Set the throughput to 1. 5 iterations of backup were executed - Check Passed
Set the throughput to 2. 3 iterations of backup were executed - Check Passed

Set 2 external proxies and throughput to 1
The throughput would be 2 x 1 = 2
3 Iterations of backups were executed - Check Passed.

Re-registered the appliance. Same test - Check Passed
vMotion the appliance. Same test - Check Passed Reboot the
VDP. Same test - Check Passed


I will update this article as and when I come across new issues / fixes that is NOT included in the vSphere Data Protection 6.1.3 release notes.

VDP SQL Agent Backup Fails: Operating System Error 0x8007000e

$
0
0
While the SQL agent was configured successfully for the SQL box, the backups were failing continuously for this virtual machine. Upon viewing the backup logs, located at C:\Program Files\avp\var\, the following error logging was observed.

2016-11-13 21:00:27 avsql Error <40258>: sqlconnectimpl_smo::execute Microsoft.SqlServer.Management.Common.ExecutionFailureException: An exception occurred while executing a Transact-SQL statement or batch. ---> System.Data.SqlClient.SqlException: BackupVirtualDeviceSet::SetBufferParms: Request large buffers failure on backup device '(local)_SQL-DB-1479099600040-3006-SQL'. Operating system error 0x8007000e(Not enough storage is available to complete this operation.).

BACKUP DATABASE is terminating abnormally.

   at Microsoft.SqlServer.Management.Common.ConnectionManager.ExecuteTSql(ExecuteTSqlAction action, Object execObject, DataSet fillDataSet, Boolean catchException)

   at Microsoft.SqlServer.Management.Common.ServerConnection.ExecuteWithResults(String sqlCommand)

   --- End of inner exception stack trace ---

   at Microsoft.SqlServer.Management.Common.ServerConnection.ExecuteWithResults(String sqlCommand)

   at SMOWrap.SMO_ExecuteWithResults(SMOWrap* , UInt16* sql_cmd, vector<std::basic_string<unsigned short\,std::char_traits<unsigned short>\,std::allocator<unsigned short> >\,std::allocator<std::basic_string<unsigned short\,std::char_traits<unsigned short>\,std::allocator<unsigned short> > > >* messages)

This is caused due to SQL VDI timeouts. The default timeout for VDI transfer is 10 seconds. 

To fix this:

1. Browse to the below directory:
C:\Program Files\avp\var

2. Add the below two lines to the avsql.cmd file. In my case, this file was unavailable so I had to create it as a text file and save it as a cmd file using Save All Types:

--max-transfer-size=65536
--vditransfertimeoutsecs=900

If the SQL virtual machine is a pretty huge deployment, then it would be necessary to increase the vditransfertimeoutsecs parameter. 

3. Run the backup again and it should complete successfully this time. 

Hope this helps.

VDP 6.1: Unable To Expand Storage

$
0
0
So, there has been few tricky issues going on with expanding VDP storage drives. This section would talk about specifically about OS Kernel not picking up the partition extents.

A brief intro about what's going on here. So, you know in vSphere Data Protection 6.x onward the dedup storage drives can be expanded. If your backup data drives are running out of space and you do not wish to delete restore points, then this feature allows you to extend your data partitions. In this case, we will login into the https://vdp-ip:8543/vdp-configure page, go to the Storage tab and select the Expand Storage option. The wizard successfully expands the existing partitions. Post this, if you run df -h from the SSH of the VDP, it should pick up the expanded information. In this case, either none of the partitions are expanded or few of them report inconsistent information. 

So, in my case, I had a 512 GB of VDP deployment, which by default deploys 3 drives of ~256 GB each. 

Post this, I expanded the storage to 1 TB. Which would ideally have 3 drives of ~512 GB each. In my case the expansion in wizard completed successfully, however, the data drives were inconsistent when viewed from command line. In the GUI, Edit Settings of the VM, the correct information was displayed.



When I ran df -h, the below was seen:

root@vdp58:~/#: df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        32G  5.8G   25G  20% /
udev            1.9G  152K  1.9G   1% /dev
tmpfs           1.9G     0  1.9G   0% /dev/shm
/dev/sda1       128M   37M   85M  31% /boot
/dev/sda7       1.5G  167M  1.3G  12% /var
/dev/sda9       138G  7.2G  124G   6% /space
/dev/sdb1       256G  2.4G  254G   1% /data01
/dev/sdc1       512G  334M  512G   1% /data02
/dev/sdd1       512G  286M  512G   1% /data03

The sdb1 was not expanded to 512 GB whereas the data partitions sdc1 and sdd1 were successfully extended. 

If I run fdisk -l then I see the partitions have been extended successfully for all the 3 data0? mounts with the updated space. 

**If you run the fdisk -l command and do not see the partitions updated, then raise a case with VMware**

Disk /dev/sdb: 549.8 GB, 549755813888 bytes
255 heads, 63 sectors/track, 66837 cylinders, total 1073741824 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1  1073736404   536868202   83  Linux

Disk /dev/sdc: 549.8 GB, 549755813888 bytes
255 heads, 63 sectors/track, 66837 cylinders, total 1073741824 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1  1073736404   536868202   83  Linux

Disk /dev/sdd: 549.8 GB, 549755813888 bytes
255 heads, 63 sectors/track, 66837 cylinders, total 1073741824 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Device Boot      Start         End      Blocks   Id  System
/dev/sdd1               1  1073736404   536868202   83  Linux

If this is the case, run partprobe command. This makes the SUSE Kernel aware of the partition table changes. Post this, run a df -h to verify if the data drives are now updated with the correct size. If yes, then stop here. If not, then proceed further.

**Make sure you do this with a help of VMware engineer if this is a production environment**

If the partprobe does not work, then we will have to grow the xfs volume. To do this:

1. Power down the VDP appliance gracefully
2. Change the data drives from Independent Persistent to Dependent
3. Take a snapshot of the VDP appliance
4. Power On the VDP appliance
5. Once the appliance is booted successfully, stop all the services using the command:
# dpnctl stop
6. Grow the mount point using the command:
# xfs_growfs <mount point>
In my case:
# xfs_growfs /dev/sdb1
If successful, you will see the below output: (Ignore the formatting)

root@vdp58:~/#: xfs_growfs /dev/sdb1
meta-data=/dev/sdb1              isize=256    agcount=4, agsize=16776881 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=67107521, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=32767, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 67107521 to 134217050

Run df -h and verify if the partitions are now updated.

root@vdp58:~/#: df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        32G  5.8G   25G  20% /
udev            1.9G  148K  1.9G   1% /dev
tmpfs           1.9G     0  1.9G   0% /dev/shm
/dev/sda1       128M   37M   85M  31% /boot
/dev/sda7       1.5G  167M  1.3G  12% /var
/dev/sda9       138G  7.1G  124G   6% /space
/dev/sdb1       512G  2.4G  510G   1% /data01
/dev/sdc1       512G  334M  512G   1% /data02

/dev/sdd1       512G  286M  512G   1% /data03

If yes, then stop here.
If not, then raise a support with VMware, as this would go for engineering fix.

Hope this helps. 

Viewing all 167 articles
Browse latest View live