Servers overview

MetalSoft supports full lifecycle management of physical serves from Dell, HPE, Lenovo and Supermicro.

Lifecycle stages

The following are the stages through which a server goes throughout its lifetime:

  1. Zero Touch Discovery The system detects new equipment on the network, assigns IPs to them, logs in and configures the BMC users. It supports either factory passwords or pre-configured passwords for a given serial number.

  2. Registration (Enrollment) The system discovers hardware components, assigns server types to servers with identical configuration, discovers where each network link is connected to, in which switch and port, discovers GPUs etc. It configures the BMC, BIOS and TPM chip. It also securely erases the drives during this process.

  3. Provisioning It configures the RAID, provisions the system deploys operating system, configures the network and storage configuration of the OS to match the switch network configuration.

  4. Firmware upgrades It coordinates custom catalog-based firmware upgrades depending on the operating system being deployed, server type and vendor.

  5. Cleanup/decommissioning When releasing a server back to the pool or when decommissioning the system will erase hard drives by using SED drive’s encryption key change facility or if not available by writing zeros on the drives.

General guidelines when registering servers

Important

Server registration should only start after switch registration is functioning correctly either via ZTP or manually as the system interacts with switches during server registration.

Warning

Note that normal registration process is a destructive process as the server drives get erased. Use the Unmanaged servers process if the data on the servers needs to be preserved.

When adding new equipment straight from the factory the best option is to setup the zero touch provisioning process as that is configured once and then reused without further input apart from racking and cabling the equipment. However it is equally fine to just add servers that already have a configured BMC IP username and password. The system will take the BIOS and other BMC configurations so that the final result is the same.

After the registration process has finished the servers will be in “unavailable” state and will be powered off, the switch ports also shutdown. This represents a successful registration. When adding many servers it is important to check their assigned server types.

Determining cabling and other hardware issues

MetalSoft assigns server types automatically based on the total CPU core count, RAM quantity and number of disks. Thus an M.40.256.12 will could signify a server with a single CPU 40 threads (20 hyper-threaded cores), 256 GB of RAM and 12 disks.

Note that the system will also append a v2, v3, v4… suffix to the server type if it detects a delta on other parameters such as a different number of NICs or GPUs.

The system will also use LLDP and actively send packets through network links to the switch (from the server) to determine what switch port the server port is connected to. If the LLDP packets fail to arrive to the switch port that will result in a NIC port showing up as “not connected” and the system will assign a v2 server type to an otherwise identical server. This makes it easy to spot cabling issues.

If all servers being registered in a batch are supposed to be the same, the server types should all be the same. Investigate all outliers and re-register them after the issue has been resolved. Usually this requires a cable replacement or a NIC card to be re-seated.

We recommend using the CLI to more quickly spot differences between servers as well as cabling issues, especially if operating with more than a few servers.

For example one can quickly detect differences in hardware by using th --show-hardware param to the server list command.

Registering servers that cannot be touched

If certain servers need to be added to infrastructure and the MetalSoft inventory but cannot be touched in any way, they can be registered as Unmanaged servers

Troubleshooting a stuck server registration process

Server registration process can take up to 30 minutes and can vary dramatically with hardware. On older hardware the system might boot an utility operating system on the server it rather than just use the BMC.

If the process takes more than 1h investigate the AFC graph by clicking on the graph link on the server overview page or checking the Events section.

The most common issues are:

  1. communication error between the agent and the server usually due to a network or firewall misconfiguration. Verify that all the agent-related firewall ports are open

  2. A stuck BMC chip. This typically requires a hard reset by un-plugging the server completely from power, waiting a few minutes and plugging it back in.