blog
icinga cluster check
In case all satellites from a non-master zone are going offline at once – if, for example, the only connection to the zone has gone down – there are initially no notifications since there's no entitiy left which could relay messages to the parent/master zone.
This is where icinga's “cluster-zone” check joins comes in:
apply Service "satellite-zone-health" { check_command = "cluster-zone" check_interval = 30s retry_interval = 10s vars.cluster_zone = "child-zone-name" assign where match("master*", host.name) }
The cluster-zone check can be assigned to the master/parent nodes of a child zone. It will check whether the child zone can relay messages to the parent, and will complain if it doesn't.
Since agents are implemented as single endpoints in their own zone, the cluster-zone check can also be applied to agents of a zone:
apply Service "agent-health" { check_command = "cluster-zone" display_name = "agent-health-" + host.name vars.cluster_zone = host.name assign where host.zone == "current-zone" && host.vars.agent_endpoint && !match("master*", host.name) }
The agent check is applied to all non-master/non-satellites hosts in a zone which have an agent assigned. As with the zone-based check, this check will complain when the assigned agents cannot relay messages any more.