MetalSoft Scalability Guide

MetalSoft is a complex software with many moving parts and its scalability depends on both internal and external factors. This guide covers the elements that affect the performance of a MetalSoft deployment.

Overall architecture

The following is a typical deployment of MetalSoft in an Enterprise environment:

  • MetalSoft Global controller is a Kubernetes application running many different microservices connected via an internal Kafka server and other means. The performance of the MetalSoft application is directly related to the performance of the underlying Kubernetes cluster.

  • MetalSoft Site Controller is a VM running in every site. The performance of the VM is directly related to the performance of the node running it.

  • Between the Site Controller and Global controller there is a HTTPs connection that may cross a Firewall.

  • Between the Site Controller and target servers there may be a Firewall.

  • Between the Global Controller and external systems there may also be a Firewall.

Scalability considerations

There are multiple dimensions to be considered:

  1. Number of concurrent users accessing the system (see below some performance numbers)

  2. Number of concurrent operations (deploys, registers, cleanups etc.)

  3. Total number of devices under management (servers, switches etc.)

  4. Total number of sites

Concurrent users performance considerations

The number of concurrent users scales with the performance of the Kubernetes cluster with Storage IOPs being the most important consideration and CPU performance the second.

The following are some typical results of executing this benchmark Jmeter.

  • Benchmark performance The following are typical performance numbers for a deployment with the recommended settings:

The test has been configured with 100 concurrent threads (users).

This is to be interpreted as:

Given a MetalSoft deployment with recommended settings, 100 users can use the app at the same time and expect a response time of around 1 second.

Concurrent deployments performance considerations

How many deployments or operations can the system do at the same time depend on the type of operation:

  1. Server deployments depend on:

  • disk 4k random RW IOPs during the image building process (primary consideration)

  • OOB network performance

  • Global Controller-Site controller connectivity

With recommended settings the system should be able to handle around 100 parallel operations. Note that disk space is also an important consideration as almost 6x an ISO disk size is used at peak disk space consumption during image packaging. There must be sufficient available space for the concurrent builds. Note that the system will automatically balance tasks and will continue with some tasks while the others are still waiting to minimize wait times.

Total number of devices

The total number of devices under management influences the system’s performance as it increases the amount of traffic on the monitoring system. This is asyncronous and should not impact the overall MetalSoft’s application performance however it may saturate the network connection between the global controller and site controllers.

Total number of sites

Each Site Controller opens a single HTTPS connection to the global controller thus the number of possible sites should be at least 1000 but likely many more. The component that will be loaded will be the ingress controller (Traefik).