Prometheus

Scout can integrate with Prometheus Node Exporter, review the Configuration section for guidance on configuring the Scout integration.

Alerts

If you use the Prometheus Alert Manager you can define alerts as below, of course you will need to tweak to your specific environment but this might be a good starting point.

- alert: HeartbeatCritical
  expr: choria_machine_nagios_watcher_last_run_seconds{name="heartbeat"} < time() - 150
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.name }} checks are not active on {{ $labels.instance }}"
    description: "Node {{ $labels.instance }} {{ $labels.name }}"
    value: WARNING

- alert: CheckUnknown
  expr: choria_machine_nagios_watcher_status==3
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.name }} in WARNING state on {{ $labels.instance }}"
    description: "Node {{ $labels.instance }} {{ $labels.name }}"
    value: WARNING

- alert: CheckCritical
  expr: choria_machine_nagios_watcher_status==2
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.name }} in CRITICAL state on {{ $labels.instance }}"
    description: "Node {{ $labels.instance }} {{ $labels.name }}"
    value: CRITICAL

- alert: CheckWarning
  expr: choria_machine_nagios_watcher_status==1
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.name }} in WARNING state on {{ $labels.instance }}"
    description: "Node $labels.instance $labels.name"
    value: WARNING

Dashboards

A basic dashboard is available on the Grafana Plugins Repository.

Scout Checks

Data

Here is a sample of the data that will be published to Prometheus:

# HELP choria_machine_nagios_start_time Time the Choria Machine subsystem started in unix seconds
# TYPE choria_machine_nagios_start_time gauge
choria_machine_nagios_start_time 1594911782
# HELP choria_machine_nagios_watcher_status Choria Nagios Check Status
# TYPE choria_machine_nagios_watcher_status gauge
choria_machine_nagios_watcher_status{name="check_mailq",status="OK"} 0
# HELP choria_machine_nagios_watcher_last_run_seconds Choria Nagios Check Time
# TYPE choria_machine_nagios_watcher_last_run_seconds gauge
choria_machine_nagios_watcher_last_run_seconds{name="check_mailq"} 1594915444
# HELP choria_machine_nagios_watcher_checks_count Choria Nagios Check Count
# TYPE choria_machine_nagios_watcher_checks_count counter
choria_machine_nagios_watcher_checks_count{name="check_mailq"} 14