Skip to content

AI Ops overview

ELI (now at version 2.0) is MetalSoft’s AI component. It is designed to support admins and users in day-to-day activities, reduce toil and improve efficiency. As an consolidated system, MetalSoft leverages clean cross-function data and enables agents with capabilities unmatched by other solutions.

CapabilityDescriptionStrategic Value
Tenant Infrastructure Operation and RightsizingTenants can use Natural Language to describe what they need. This is an agentic system in the sense that it can implement a multi-step process to implement the desired intentThis, connected with Service Now or a Service Desk system could implement the user’s requirements optimally.
Admin Operation and TroubleshootingAdmins can use the AI assistant to quickly retrieve information from the Database or from devicesThis helps reduce toil and learning curves.
Root Cause AnalsyisIf enabled ELI can perform automatic troubleshooting when an event is matched.This helps expedite ticket resolution, MTTR and generally reduce toil.

ELI offers multiple services:

  1. A user-level chat interface that can be used to build or modify tenant infrastructures.
  2. An admin-level chat interface that can be used to perform troubleshooting and generally to interact with both the MetalSoft database and with MetalSoft-managed equipment (readonly for the moment)
  3. Automatic RCA (Root Cause Analysis) that, if enabled, will perform a troubleshooting session that goes beyond the traditional methods of RCA and actively plans and executes a troubleshooting process.

There are multiple types of access ELI uses:

  1. ELI in the infrastructure editor uses the user’s access level. It is no different than using the API key to access the API. Only the respective functions will be accesable.
  2. ELI in the Admin uses the Admin user’s access.
  3. ELI in the troubleshooting process uses readonly admin access.

An important advantage of using MetalSoft in conjunction with AI is that our System maintains accurate System of Record for all relevant information such as that the AI can use directly:

MCP ToolDescriptionStrategic Value to an agent
Customer intentStores the desired end-state for all infrastructure elements.Critical: Detects drift and suggests automated fixes.
Inventory dataProvides a real-time, accurate catalog of available hardware and resources.Optimizes resource allocation for specific use cases.
Topology informationMapping of physical cabling and high-level logical interconnects.Visualizes dependencies for rapid troubleshooting.
Log & Event AccessReal-time stream of equipment logs and system events.Ingests live telemetry to diagnose incidents.
MetalSoft MetadataAccess to server profiles, network policies, and system states.Performs impact analysis before executing actions.
Live Protocol DataDirect access to live switch info (LLDP, BGP, etc.).Provides ground truth for protocol-level repair.

This data access allows the Agent to make proper planning in the case of troubleshooting, compare with what is actually expected, determine anomalies and more.

It does not however have access to:

  1. Credentials
  2. PII data

MetalSoft provides safe access to equipment to the agents by interposing our deterministic services between the agent and the devices. The agents never have access to the credentials of the devices or access to execute arbitrary commands. For example to be able to retrieve the logs of a switch the Agent calls an MCP tool called “retrieve logs from switch x”.

Generally the agents have only read-only access from an Admin perspective. The agents do not have access to:

  1. Deploy tenant infrastructures. It can design them but the human must press the deploy button.
  2. Deploy underlay or other switch changes.
  3. Delete servers or effect changes on the servers, storage or other equipment at the Admin level.

ELI can be connected to any model that is thinking and can use tools such as those provided by OpenAI or Google. ELI expects an OpenAI-compatible interface.

ELI can be deployed fully on-prem, air-gapped environments, leveraging an open-source model such as qwen3:235b or Mistral. At the minimum 2x H100 NVIDIA GPUs are required. MetalSoft provides all of the the required components for a fully dedicated setup but a shared Enterprise-wide AI model inference setup can also be leveraged.

Inside the Infrastructure editor you can use the new Agent manage, optimize, rightsize infrastructure elements using natural language. For example:

  1. “Create a kubernetes cluster for 100 cores, select the appropriate server types”
  2. “Create a server group with the smallest available bare metal server”
  3. “Create a Terraform manifest for implementing this infrastructure”

Everything that can be done via the UI can also be done via the new agent.

The Admin agent has access to a subset of the functionality available via the API.

These skills can be combined in any way in complex prompts such as: get me all the vlans configured on ports where LLDP neighbors show a switch or what is the mac address of the Management0 switch connected to switch 15?

Eli will execute a multi-stage process to retrieve the needed information and finally combine it into a single result etc. Eli is able to iterate of elements, do complex logic and more.

prompt example: what switch ports are admin up on switch 5? output:

The following switch ports are admin up on switch 5:
ethernet1/1/2
ethernet1/1/13:2
ethernet1/1/13:4
ethernet1/1/14:2
ethernet1/1/14:4
ethernet1/1/15:4
ethernet1/1/7
ethernet1/1/11
ethernet1/1/15:1
ethernet1/1/15:2
ethernet1/1/15:3
ethernet1/1/5
ethernet1/1/6
ethernet1/1/8
ethernet1/1/12
ethernet1/1/13:1
ethernet1/1/14:1
ethernet1/1/1
ethernet1/1/13:3
ethernet1/1/14:3

Getting the VLANs configurations on a switch

Section titled “Getting the VLANs configurations on a switch”

prompt example: get all vlans configured on switch 15

Ports with VLAN 5 configured on switch 15: Eth1/2/2, Eth1/1/1, Eth1/2/3, Eth1/2/4, Eth1/4/2, Eth1/3/3, Eth1/1/2, PortChannel256, Eth1/4/1, Eth1/3/4, Eth1/1/3, Eth1/2/1, Eth1/3/1, Eth1/1/4, Eth1/4/4, Eth1/3/2, Eth1/4/3

another prompt example:is there any port that is down configured with vlan 5 on switch 15?

Analysis result
The following down ports are configured with VLAN 5 on switch 15:
Eth1/1/3
Eth1/3/3
Eth1/2/1
Eth1/4/2
Eth1/1/2
Eth1/2/3
Eth1/2/4
Eth1/3/2
Eth1/1/1
Eth1/3/4
Eth1/4/3
Eth1/2/2
Eth1/4/4
Eth1/3/1
Eth1/1/4
Eth1/4/1

prompt example:get the bgp sessions of switch 5 output:

BGP Sessions for switch 5:
Session 1: Neighbor Name: 192.168.122.1 Enabled: true Active: true Uptime Seconds: 1203847756 State: established Last Error: Reset by peer

prompt example: is there any lldp neighbour of type switch on switch 15?

LLDP Neighbors Analysis Results
The following LLDP neighbors of type 'switch' were found on switch 15:
Neighbor Name: sw1-dell-s5232f
Neighbor Chassis MAC Address: c4:5a:b1:48:e4:85
Neighbor Name: US_CHG_QTS01_01_MJ40_ML43_01
Neighbor Chassis MAC Address: d8:94:03:96:5c:fb
Neighbor Name: sw1-dell-s5232f
Neighbor Chassis MAC Address: c4:5a:b1:48:e4:85
Neighbor Name: US_CHG_QTS01_01_MJ40_MGMT_01
Neighbor Chassis MAC Address: d0:7e:28:ec:3b:e8
Neighbor Name: US_CHG_QTS01_01_MJ40_ML43_01
Neighbor Chassis MAC Address: d8:94:03:96:5c:fb
  1. Getting a switch log prompt examples: get the switch logs of switch 5 output:
Thu Jun 27 22:25:26 2024 informational - LACP_PORT_UNGROUPED port-channel30 : ethernet1/1/4
Thu Jun 27 22:25:26 2024 informational - VLT_PORT_CHANNEL_DOWN 30 is down
Thu Jun 27 22:25:53 2024 informational - IFM_OSTATE_UP :ethernet1/1/4
Thu Jun 27 22:25:54 2024 informational - IFM_OSTATE_DN :ethernet1/1/4
Thu Jun 27 22:25:55 2024 informational - IFM_OSTATE_UP :ethernet1/1/4
Thu Jun 27 22:26:10 2024 informational - VLT_PORT_CHANNEL_UP 30 is up
Thu Jun 27 22:26:11 2024 informational - IFM_OSTATE_DN :ethernet1/1/4
Thu Jun 27 22:26:11 2024 informational - LACP_PORT_GROUPED port-channel30 : ethernet1/1/4
Thu Jun 27 22:26:11 2024 informational - LACP_PORT_UNGROUPED port-channel30 : ethernet1/1/4
Thu Jun 27 22:26:11 2024 informational - IFM_OSTATE_UP :ethernet1/1/4
Thu Jun 27 22:26:11 2024 informational - VLT_PORT_CHANNEL_DOWN 30 is down
Thu Jun 27 22:26:11 2024 informational - IFM_OSTATE_DN :ethernet1/1/4
Thu Jun 27 22:26:12 2024 informational - IFM_OSTATE_UP :ethernet1/1/4
Thu Jun 27 22:26:27 2024 informational - VLT_PORT_CHANNEL_UP 30 is up
Thu Jun 27 22:30:13 2024 informational - VLT_PORT_CHANNEL_DOWN 30 is down
Thu Jun 27 22:30:15 2024 informational - IFM_OSTATE_UP :port-channel30
Thu Jun 27 22:30:15 2024 informational - LACP_PORT_GROUPED port-channel30 : ethernet1/1/4
Thu Jun 27 22:30:15 2024 informational - VLT_PORT_CHANNEL_UP 30 is up
Fri Jun 28 01:29:05 2024 informational - IFM_OSTATE_DN :ethernet1/1/4
Fri Jun 28 01:29:06 2024 informational - IFM_OSTATE_DN :port-channel30
Fri Jun 28 01:29:06 2024 informational - LACP_PORT_UNGROUPED port-channel30 : ethernet1/1/4
Fri Jun 28 01:29:06 2024 informational - VLT_PORT_CHANNEL_DOWN 30 is down
Fri Jun 28 01:29:34 2024 informational - IFM_OSTATE_UP :ethernet1/1/4
Fri Jun 28 01:29:34 2024 informational - IFM_OSTATE_DN :ethernet1/1/4
Fri Jun 28 01:29:35 2024 informational - IFM_OSTATE_UP :ethernet1/1/4
Fri Jun 28 01:29:50 2024 informational - VLT_PORT_CHANNEL_UP 30 is up
Fri Jun 28 01:29:51 2024 informational - IFM_OSTATE_DN :ethernet1/1/4
Fri Jun 28 01:29:51 2024 informational - IFM_OSTATE_UP :ethernet1/1/4
Fri Jun 28 01:29:51 2024 informational - VLT_PORT_CHANNEL_DOWN 30 is down
Fri Jun 28 01:29:52 2024 informational - IFM_OSTATE_DN :ethernet1/1/4
Fri Jun 28 01:29:53 2024 informational - IFM_OSTATE_UP :ethernet1/1/4
Fri Jun 28 01:30:08 2024 informational - VLT_PORT_CHANNEL_UP 30 is up
Fri Jun 28 01:33:50 2024 informational - VLT_PORT_CHANNEL_DOWN 30 is down
Fri Jun 28 01:33:52 2024 informational - LACP_PORT_GROUPED port-channel30 : ethernet1/1/4
Fri Jun 28 01:33:52 2024 informational - IFM_OSTATE_UP :port-channel30
Fri Jun 28 01:33:53 2024 informational - VLT_PORT_CHANNEL_UP 30 is up

Prompt example: is there any error in the log for switch 15?

Note that this part is subject to change.

The topology information is available to the AI component in the form of a graph with nodes. Note that the AI might confuse lldp-based toplogy with the known topology. When questioned if switch 5 is connected to server 15 it will try an lldp query first. To force it to use the known topology use explicit prompts:

prompt: based on the topology what are the peers of switch 15?

ELI can also be automatically invoked when an equipment-generated syslog event is triggered. To enable this behaviour check the “Send to ELI” checkbox in the Global Configurations -> Device Alerts for the rules that need to invoke the agent.

The Alert emails that will sent our will then contain the output of the investigation. Note that this will delay somewhat the alert from being received with perhaps a few minutes.