Troubleshooting deploys using the Operation’s Graph (AFC tree)¶
Once a deploy is ongoing, navigate to the Job Queue in the Admin Dashboard under Advanced
Search for the job which has the infrastructure ID which is being deployed
Right click on the Graph number and open in new window. This will show the AFC graph. You can use the + and - on the right to zoom in and out. AFC steps which have passed will be in green and will be marked with Success AFC steps which are running will be in orange and will be marked with Running AFC steps which are retrying will be in orange and will be marked with Retrying and will go back to Running AFC steps which have failed will be in red and will be marked with Failed
To see details of an AFC step, click on the specific AFC. From here you will be able to see its status, and if it has failed, the reason
If an AFC step has failed, here you will see this reason
In some instances, you can just retry the job by clicking on Retry in the top right (Skip and advanced steps should be avoided unless you know what you are doing and you have found the reason for the issue)
To assist in narrowing down the problem, you can see what the AFC step is doing in the name which is shown here (In this example, server_start_cleanup_via_oob)
There are various reasons for an AFC to fail, including network issues, hardware failures and others.
Below are some guidelines to assist in troubleshooting failed AFC steps.
Please contact Metalsoft if the following AFC’s fail. Do not skip these without Metalsoft guidance:
Call back AFC’s like: “ip_provision_callback”
AFC’s like: “pushDNSDatacenterAgent”, “updatePowerDNS”, “regenerateAndPushDHCPAndTFTP”, “push_iSNSDatacenterAgent”, “awaitPushConfigurationIntoAgents”, “setDHCPStatusAnsibleInstances”
The AFC’s above may suggest an issue with the Agents VM (Site controller VM)
Please contact your networking team if the following AFC’s fail. Please do not skip these:
AFC relating to switches like: “switch_device_provision”
AFC relating to switches lile: “networkSetupStartedEvent”
If any AFC’s like this fail, please investigate the OS template and Variables and Secrets which are being used by the deploy:
“materializeInstancesOSAssetVariablesAndSecrets”
If the deploy is not required:
You can Kill AFC Running Process and skip from “applyDefaultRaidProfile” onwards to clear the deploy, then you can delete the failed instance array from the infrastructure.
If “applyDefaultRaidProfile” fails:
Please connect to the iDRAC of the server and check if RAID is being set on the server. If the job has completed, you should be able to retry the event
If “osImageBuild” fails:
This could either be an issue with NFS or the Agents VM
If “osImageTransfer” fails:
This could either be an issue with NFS, the network, the agents VM or the iDRAC, please start by checking if the image can be mounted on the iDRAC of the server. If so, then please retry this step, if not contact your networking team to troubleshoot further
If “bootGeneratedOSImageFromVirtualMedia” fails:
this normally suggests the iDRAC has lost connection to the NFS. This can be retried, but if it fails again, start by retrying “osImageBuild”, then “osImageTransfer”, then “bootGeneratedOSImageFromVirtualMedia”