The operations graph#

The Asynchronous Function Graph (AFC) is central coordination mechanism of the MetalSoft.

It is a queue of operations that need to be executed. Some tasks have dependencies, such as other tasks that need to be finished before a task can start. If a task has failed, depending on it’s specification it will be retried a number of times.

an example AFC list

Notice the Call count, the max retries, Duration, memory usage etc. The AFC graph is also a log of all operations executed by the system.

The AFC Graph#

Since tasks have dependencies, the dependencies all form a DAG that can be visualized using the AFC Group:

an example AFC graph

You will notice that the graph is much longer than expected even for empty infrastructures. The tasks are called but they do nothing in most of the cases. This is because the graph is similar to most situations and the tasks determine if they have to execute something or not. This is normal and has no impact on performance of the graph.

Troubleshooting failed operations#

The AFC should be the go-to page when an issue is reported. Given an infrastructure id, a server id, or user id the Admin can find the respective AFC graph and identify the failed task.

When an operation fails, it will show up as a thrown_error instead of returned_success. Depending on the task definition (max retries parameter) the task will be automatically retried after the minimum interval until success or max_retries reached.

Some operations (such as DNS updates, or waiting for server access, which depend on external factors) are designed to fail a number of times and it’s considered normal behavior.

This is how a successful execution of a task looks like:

an example AFC call

This is how a failed task looks like:

![an example AFC call](/assets/advanced/ .png)

Typically, after analyzing the error messages, the cause of the issue could be identified, in this case reason: getaddrinfo ENOTFOUND meaning the host could not be resolved. Fixing that issue (eg: adding the host in the DNS zone or as in this case, fixing the typo in the stage that requested the URL) and hitting Retry will continue the provisioning process of the infrastructure that called this step from where it left off. The retry button does not check the max_retries parameter.

The application is designed in such a way to allow the system to recover from failed tasks in most situations. However in some edge cases this might not be possible or the fix can be to do the task manually. The stage can then be ‘skipped’ to allow other tasks to continue if possible. Which is the case depends on the exact cause of the issue being troubleshooted.