christoph ender's

blog

friday the 10th of february, 2023

icinga cluster check

In case all satellites from a non-master zone are going offline at once – if, for example, the only connection to the zone has gone down – there are initially no notifications since there's no entitiy left which could relay messages to the parent/master zone.

This is where icinga's “cluster-zone” check joins comes in:

apply Service "satellite-zone-health" {
  check_command = "cluster-zone"
  check_interval = 30s
  retry_interval = 10s
  vars.cluster_zone = "child-zone-name"
  assign where match("master*", host.name)
}

The cluster-zone check can be assigned to the master/parent nodes of a child zone. It will check whether the child zone can relay messages to the parent, and will complain if it doesn't.

Since agents are implemented as single endpoints in their own zone, the cluster-zone check can also be applied to agents of a zone:

apply Service "agent-health" {
  check_command = "cluster-zone"
  display_name = "agent-health-" + host.name
  vars.cluster_zone = host.name
  assign where host.zone == "current-zone" && host.vars.agent_endpoint && !match("master*", host.name)
}

The agent check is applied to all non-master/non-satellites hosts in a zone which have an agent assigned. As with the zone-based check, this check will complain when the assigned agents cannot relay messages any more.