Troubleshooting guide#

The following are some scenarios that you might encounter while operating MetalSoft in production:

  1. Server not receiving DHCP replies

    If the servers are powered on and PXE is enabled but on any interface the server doesn’t get any reply from the DHCP server it’s worth checking the following: a. check if a server has been added to the list from the servers table. If so a request has been received from the server but the DHCP process is stuck mid-way.

    b. Check the DHCP tab if the DHCP server receives Discover packets. If not, check that the interfaces are connected to the right interfaces and on the switch the ports are configured in VLAN 5.

    c. Check DHCP relay setting on the switch and check the destination IP. It should be the ip of the DC agent. (In this example the record is:

    dhcp relay server-address 172.16.10.6
    

    d. Log-in to the router and execute a TCP dump on the input interface:

    tcpdump -i ens224 port 67 or port 68 -e -n -vv
    

    You should see Discover and Offer. If you only see Discover it means the DHCP traffic reaches the router, the agent replies but the replies do not reach the server.

    d. Try to ping the VSI from the router:

    root@bsirouter-hpe01:~# ping 172.16.0.1
    PING 172.16.0.1 (172.16.0.1) 56(84) bytes of data.
    64 bytes from 172.16.0.1: icmp_seq=1 ttl=255 time=0.918 ms
    ^C
    --- 172.16.0.1 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 0.918/0.918/0.918/0.000 ms
    

    If no reply check the static route in the router that points to the quarantine VSI:

    ip r add 172.16.0.0/16 via 172.16.10.2 dev ens224
    
  2. Discovery kernel (BDK) booted but server stuck in registering

    a. The BDK might not be able to reach the DC agents from the quarantine network. Check router routing table.

    b. The datacenter agent might not be able to download assets from the master repository. Check the repoURLMaster to make sure they are reachable from the datacenter agent machine.

    There are two records in the datacenter config object: repoURLRoot, repoURLRootQuarantineNetwork both need to be reachable by the agents (on the agents’ machine) and to have valid SSL certificates. During registration, servers will pull the assets from the datacenter agent not directly from the repository. The tftp/http agent pulls assets from these repositories at runtime and caches them.

    Pull any asset from the repository on the dc agents machine:

    root@uk-metalsoft-poc-dcagents01  ~$ curl https://<repo>/extra/BSIAgent/releases/agentVersions.json
    

    c. The BDK does not resolve a DNS record: <repo> Name or service unknown. Check the DNSServers record in the datacenter configuration. Make sure it lists an authoritative DNS server that will resolve the provided record.

    d. To login into the discover kernel during a auto-registration process, set the BDK auto-reboot flag on the server and then use the BMC KVM facility to login into the BDK.

    e. Check the agent’s logs. They are available at /opt/Agent_log/TFTP*.log

    f. Check the AFC logs. It shows errors such as OOB interface not reachable after configuration.

  3. DC agent configuration uses different settings then the one configured in the datacenter object

Reinitialize the agents after any change in the datacenter configuration.

cd /opt/
cd agents/
export DCNAME="https://api.poc.metalsoft.io/api/url?rqi=br.
docker-compose down && docker-compose up -d

The agent configuration is available at /opt/BSIAgentsVolume.

Check if the datacenter agents are running:

docker exec -it dc-agents supervisorctl status
  1. Volume template replication does not start

    a. If an volume_templates_pre_replicate_to_all_storages() AFC fails with Timed out while waiting for handshake, check the route towards the storage in-band connection:

    root@bsirouter:~$ ip route | grep 100.96
    100.96.0.0/16 via 172.16.10.2 dev ens224 
    
  2. Provisioning stuck in waitForSSHInstances stage

Note that it is normal for this stage to retry a number of times. However if this takes more than a few minutes for iSCSI boot check the following:

a. Login to the server using the KVM interface. If the server seems stuck in the PXE stage, cycling through interfaces but in the DHCP tab the DHCP server seems to be sending DHCP offers to the server, check that the router is able to reach the server SAN interface:

```bash
root@bsirouter:~$ ip r | grep 172.24.4
100.64.0.0/22 via 172.16.10.2 dev ens224 proto static 
```

b. If the server seems to have booted but the root password that is supposed to be configured on the server doesn’t work, and the server’s WAN interface is not reachable via the router, and if the DHCP logs show “offers”, most likely the DHCP requests from the agent on the WAN interface do not reach the server. Check the route towards the dummy subnet:

```bash
root@bsirouter:~$ ip r | grep 172.24.4
172.24.4.0/22 via 172.16.10.2 dev ens224 proto static 
```

c. Check the ACLs on the switch port. They might block traffic from the any of the working subnets. In the following example traffic from 172.16.0.0/16 is blocked by ACL 3399 rule 30 which prevents the server from reaching the datacenter agent and initializing itself with cloud-init.

```
HP5900-H1060>display ip routing-table 10.255.226.2

Summary Count : 2

Destination/Mask    Proto  Pre  Cost         NextHop         Interface
0.0.0.0/0           Static 1    0            172.16.10.1     GE1/0/48
10.255.226.0/30     Direct 0    0            10.255.226.1    Vlan100

<HP5900-H1060>dis

<HP5900-H1060>display cu

<HP5900-H1060>display current-configuration in

<HP5900-H1060>display current-configuration interface vlan100
#
interface Vlan-interface100
description Vlan-interface100_329_WAN
ip address 172.24.4.1 255.255.255.252
ip address 10.255.226.1 255.255.255.252 sub
packet-filter filter route
packet-filter 3399 inbound
packet-filter ipv6 3399 inbound
dhcp select relay
dhcp relay information circuit-id string Vlan-interface100_329_WAN
dhcp relay information enable
dhcp relay information remote-id sysname
dhcp relay server-address 172.16.10.6
ipv6 address FD1F:8BBB:56B3:2000::1/64
#
return

<HP5900-H1060>dis

<HP5900-H1060>display ac

<HP5900-H1060>display acl 3399
Advanced ACL  3399, named IPV4_WAN_GENERIC, 6 rules,
ACL's step is 10
rule 10 deny ip destination 10.0.0.0 0.255.255.255
rule 20 deny ip destination 100.64.0.0 0.63.255.255
rule 30 deny ip destination 172.16.0.0 0.15.255.255
rule 40 deny ip destination 169.254.0.0 0.0.255.255
rule 50 deny ip destination 192.168.0.0 0.0.255.255
rule 65534 permit ip

<HP5900-H1060>
```
  1. Server boots the BDK during a local install but does nothing afterwards, especially after a very long provisioning process

This might be caused by:

  1. BDK is unable to reach the agent machine Log into the BDK and check reachability into the agents machine

  2. The agent did not serve the next stage of the install process to the agent Log into the BDK machine and check /tmp/preos* log files. If there is a mention of expiry, use DHCP regenerate button in the interface to regenerate the DHCP configuration and reboot.

This is because the DHCP caches the configuration URL and if a provisioning takes a long time (hours) the URL might expire to prevent reply attacks.