Alerts and Notifications in Netdata
Introduction to Alerting
Netdata provides a distributed, real-time health monitoring framework that evaluates conditions against your metrics and executes actions on state transitions. You can configure notifications as one of these actions.
Unlike traditional monitoring systems, Netdata evaluates alerts simultaneously at multiple levels - on the edge (Agents), at aggregation points (Parents), and deduplicates them in Netdata Cloud. This allows your teams to implement different alerting strategies at different infrastructure levels.
Understanding Alerts in Netdata
Netdata alerts function as component-level watchdogs. You attach them to specific components/instances (network interfaces, database instances, web servers, containers, processes) where they evaluate metrics at configurable intervals.
To simplify your configuration, you can define alert templates once and apply them to all matching components. The system matches instances by host labels, instance labels, and names, allowing you to define the same alert multiple times with different matching criteria.
Each alert provides a name, value, unit, and status - making them easy to display in dashboards and send as meaningful notifications regardless of your infrastructure's complexity.
Where Your Alerts Run
Your alerts evaluate at the edge. Every Netdata Agent and Parent runs alerts on the metrics it processes and stores (enabled by default, but you can disable alerting at any level). When you stream metrics to a Parent, the Parent evaluates its own alerts on those metrics independently of the child's alerts. Each Agent maintains its own alert configuration and evaluates alerts autonomously. Metric streaming doesn't propagate alert configurations or transitions to Parents.
┌─────────┐ Metrics ┌──────────┐ Metrics ┌──────────┐
│ Child │ ───────────────> │ Parent 1 │ ───────────────> │ Parent 2 │
│ Agent │ of child │ Agent │ of child + │ Agent │
└────┬────┘ └────┬─────┘ Parent 1 └────┬─────┘
│ │ │
│ Evaluates alerts on │ Evaluates alerts on │ Evaluates alerts on
│ local metrics │ child + local metrics │ all streamed + local
│ │ │
▼ ▼ ▼
Alerts Alerts Alerts
Alert Actions and Notifications
Your Netdata Agents treat notifications as actions triggered by alert status transitions. Agents can dispatch notifications or perform automation tasks like scaling services, restarting processes, or rotating logs. Actions are shell scripts or executable programs that receive all alert transition metadata from Netdata.
When you claim Agents to Netdata Cloud, they send their alert configurations and transitions to Cloud, which deduplicates them (merging multiple transitions from different Agents for the same host). Netdata Cloud triggers notifications centrally through its integrations (Slack, Microsoft Teams, Amazon SNS, PagerDuty, OpsGenie).
Netdata Cloud's intelligent deduplication works by:
- Consolidating multiple Agents reporting the same alert
- Prioritizing highest severity: CRITICAL > WARNING > CLEAR
- Creating unique keys: Alert name + Instance + Node
Your Agents and Netdata Cloud trigger actions independently using their own configurations and integrations.
This design enables you to:
- Maintain team independence: Different teams run their own Parents with custom alerts
- Implement edge intelligence: Critical alerts trigger automations directly on nodes
- Scale naturally: Alert evaluation distributes with your infrastructure
- Mix strategies: Combine edge, regional, and central alerting
Quick Example
Web Server (Child):
- Alert: system CPU > 80% triggers scale out
- Alert: process X memory > 90% restarts process X
DevOps Parent:
- Alert: Response time > 500ms across all web servers
- Alert: Error rate > 1% for any service
SRE Parent:
- Alert: Anomaly detection on traffic patterns
- Alert: Capacity planning thresholds
Netdata Cloud:
- Receives all alert transitions
- Deduplicates overlapping alerts
- Shows CRITICAL if any instance reports CRITICAL
- Provides unified view for incident response
Each level operates independently while Netdata Cloud provides a coherent, deduplicated view of your entire infrastructure's health (when all agents connect directly to Cloud).
Managing Alert Configuration
You configure Netdata alerts in 3 layers:
- Stock Alerts: Netdata provides hundreds of alert definitions in the stock
health.d/directory to detect common issues. Don't edit these directly - updates will overwrite your changes. - Your Custom Alerts: Create your own definitions in your Netdata config directory under
health.d/. - Dynamic UI Configuration: Use Netdata dashboards to edit, add, enable, or disable alerts on any node through the streaming transport.
Config paths vary by install prefix. Run sudo ./edit-config health.d/<file> from your Netdata config directory to resolve the correct user path automatically, or check the [directories] section of netdata.conf (keys health config and stock health config) for exact locations.
Managing Notification Configuration
You can configure notifications for any infrastructure node at 3 levels:
| Level | What It Evaluates | Where Notifications Come From | Use Case | Documentation |
|---|---|---|---|---|
| Netdata Agent | Local Metrics | Netdata Agent | Edge automation | Agent integrations |
| Netdata Parent | Local and Children Metrics | Netdata Parent | Edge automation | Agent integrations |
| Netdata Cloud | Receives Transitions | Netdata Cloud | Web-hooks, role/room based | Cloud integrations |
When using Parents and Cloud with default settings, you may receive duplicate email notifications. Agents send emails by default when an MTA exists on their systems. Disable email notifications on Agents and Parents when using Cloud by setting SEND_EMAIL="NO" in health_alarm_notify.conf, edited with sudo ./edit-config health_alarm_notify.conf from your Netdata config directory.
Best Practices for Large Deployments
Central Alerting Strategy
When you:
- Don't need edge automation (no scripts reacting to alerts)
- Use highly available Parents for all nodes
- Use Netdata Cloud (at least for Parents)
Follow these steps:
- Disable health monitoring on child nodes
- Share the same alert configuration across Parents (use git repo or CI/CD)
- Disable Parent notifications (
SEND_EMAIL="NO"inhealth_alarm_notify.conf) - Keep only Cloud notifications
This emulates traditional monitoring tools where you configure alerts centrally and dispatch notifications centrally.
Edge Flexible Alerting Strategy
When you:
- Need edge automation (scale out, restart processes)
- Use Parents
- Use Cloud for all nodes
Follow these steps:
- Disable stock alerts on children (set
enable stock health configurationtonoin the[health]section ofnetdata.conf; edit withsudo ./edit-config netdata.conf) - Configure only automation-required alerts on children
- Keep stock alerts on Parents but disable notifications (
SEND_EMAIL="NO") - Keep only Cloud notifications
This enables edge automation on children while maintaining central alerting control and deduplicated Cloud notifications.
Set Up Alerts via Netdata Cloud
- Connect your nodes to Netdata Cloud
- Navigate to:
Space → Notifications - Choose your integration (Slack, Amazon SNS, Splunk)
- Configure alert severity filters
Set Up Alerts via Netdata Agent
-
Open notification config:
sudo ./edit-config health_alarm_notify.conf -
Enable your method (example: email):
SEND_EMAIL="YES"
DEFAULT_RECIPIENT_EMAIL="you@example.com" -
Verify your system can send mail (sendmail, SMTP relay)
-
Restart the agent:
sudo systemctl restart netdata
Core Alerting Concepts
Netdata supports two alert types:
- Alarms: Attach to specific instances (specific network interface, database instance)
- Templates: Apply to all matching instances (all network interfaces, all databases)
Alert Lifecycle and States
Your alerts produce more than threshold checks. Each generates:
- A value: Combines metrics or other alerts using time-series lookups and expressions
- A unit: Makes alerts meaningful ("seconds", "%", "requests/s")
- A name: Identifies the alert
This enables sophisticated alerts like:
out of disk space time: 450 seconds- Predicts when disk fills based on current rate3xx redirects: 12.5 percent- Calculates redirects as percentage of totalresponse time vs yesterday: 150%- Compares current to historical baseline
Alert States
Your alerts exist in one of these states:
| State | Description | Trigger |
|---|---|---|
| CLEAR | Normal - conditions exist but not triggered | Warning and critical conditions evaluate to zero |
| WARNING | Warning threshold exceeded | Warning condition evaluates to non-zero |
| CRITICAL | Critical threshold exceeded | Critical condition evaluates to non-zero |
| UNDEFINED | Cannot evaluate | No conditions defined, or value is NaN/Inf |
| UNINITIALIZED | Never evaluated | Alert just created |
| REMOVED | Alert deleted | Child disconnected, agent exit, or health reload |
Alerts transition freely between states based on:
- Calculated value (including NaN, Inf, or valid numbers)
- Warning/critical conditions (evaluation results)
- External events (disconnections, reloads, exits)
Key behaviors:
- Alerts jump directly from CLEAR to CRITICAL (no WARNING required)
- WARNING and CRITICAL evaluate independently
- Alerts return to appropriate state when data becomes available
- CRITICAL takes precedence when both conditions are true
Alert Evaluation Process
1. Calculate Value
Your alerts perform complex calculations:
lookup calc warn,crit status
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐
│ Database │ │Expression│ │ Warning │ │ Execute │
│ Query │──────>│Processor │──────>│ Critical │ ───> │ Action on │
│(optional)│ $this │(optional)│ $this │ Checks │ │Transition │
└──────────┘ └──────────┘ └──────────┘ └───────────┘
Examples:
# Simple threshold
calc: $used
# Result: $this = latest value of dimension 'used'
# Time-series lookup
lookup: average -1h of used
# Result: $this = average of 'used' over last hour
# Combined calculation
lookup: average -1h of used
calc: $this * 100 / $total
# Result: $this = percentage of hourly average vs total
# Baseline comparison
lookup: average -1h of used
calc: $this * 100 / $average_yesterday
# Result: $this = percentage vs yesterday's average
2. Evaluate Conditions
After calculating, check conditions:
# Simple conditions
warn: $this > 80
crit: $this > 90
# Flapping prevention
warn: ($status >= $WARNING) ? ($this > 50) : ($this > 80)
crit: ($status == $CRITICAL) ? ($this > 70) : ($this > 90)
# Complex conditions
warn: $this > 80 AND $rate > 10
crit: $this > 90 OR $failures > 5
3. Determine State
Each condition evaluates to:
- NaN or Inf → UNDEFINED
- Non-zero → RAISED
- Zero → CLEAR
Final status:
- Critical RAISED → CRITICAL (priority)
- Warning RAISED → WARNING
- Either CLEAR → CLEAR
- Both missing/UNDEFINED → UNDEFINED
Evaluation Timing
Alert evaluation runs independently from data collection:
Data Collection Alert Evaluation
│ │
▼ every 1s ▼ configurable interval
[Metrics] ──────────> [Alert Engine]
│
▼
Query metrics,
Calculate values,
Check conditions
- Default interval: Query window duration (with lookup) or manual setting required
- Configurable: Use
everyfor custom intervals - Constrained: Cannot evaluate faster than data collection frequency
Anti-Flapping Mechanisms
Netdata prevents alert flapping through:
1. Hysteresis
warn: ($status < $WARNING) ? ($this > 80) : ($this > 50)
Triggers at 80, clears at 50, preventing flapping between 50-80.
2. Dynamic Delays
Alerts transition immediately in dashboards but notifications use exponential backoff.
3. Duration Requirements
lookup: average -10m of used
warn: $this > 80
Requires 10 minutes of data before triggering.
Multi-Stage Alerts
Create dependent alerts:
# Stage 1: Baseline
template: requests_average_yesterday
on: web_log.requests
lookup: average -1h at -1d
every: 10s
# Stage 2: Current
template: requests_average_now
on: web_log.requests
lookup: average -1h
every: 10s
# Stage 3: Compare
template: web_requests_vs_yesterday
on: web_log.requests
calc: $requests_average_now * 100 / $requests_average_yesterday
units: %
warn: $this > 150 || $this < 75
crit: $this > 200 || $this < 50
Available Variables
Variables resolve in order (first match wins):
1. Built-in Variables
| Variable | Description | Value |
|---|---|---|
$this | Current calculated value | Result from lookup/calc |
$after | Query start timestamp | Unix timestamp |
$before | Query end timestamp | Unix timestamp |
$now | Current time | Unix timestamp |
$last_collected_t | Last collection time | Unix timestamp |
$update_every | Collection frequency | Seconds |
$status | Current status code | -2 to 3 |
$REMOVED | Status constant | -2 |
$UNINITIALIZED | Status constant | -1 |
$UNDEFINED | Status constant | 0 |
$CLEAR | Status constant | 1 |
$WARNING | Status constant | 2 |
$CRITICAL | Status constant | 3 |
2. Dimension Values
| Syntax | Description | Example |
|---|---|---|
$dimension_name | Last normalized value | $used |
$dimension_name_raw | Last raw collected value | $used_raw |
$dimension_name_last_collected_t | Collection timestamp | $used_last_collected_t |
template: disk_usage_percent
on: disk.space
calc: $used * 100 / ($used + $available)
units: %
3. Chart Variables
calc: $used > $threshold # If chart defines 'threshold'
4. Host Variables
warn: $connections > $max_connections * 0.8 # If host defines 'max_connections'
5. Other Alerts
# Alert 1
template: cpu_baseline
calc: $system + $user
# Alert 2
template: cpu_check
calc: $system
warn: $this > $cpu_baseline * 1.5
6. Cross-Context References
template: disk_io_vs_iops
on: disk.io
calc: $reads / ${disk.iops.reads}
units: bytes per operation
Variable Resolution and Label Scoring
When alerts reference variables matching multiple instances, Netdata uses label similarity scoring:
- Collect candidates with matching names
- Score by labels - count common labels
- Select best match - highest label overlap
Example: Alert on disk.io (labels: device=sda, mount=/data) references ${disk.iops.reads}:
disk.iopsfor sda (labels match) → Score: 2disk.iopsfor sdb (no match) → Score: 0 Result: Uses sda's value
Missing Data Handling
During lookups with missing data:
- All values NULL:
$thisbecomesNaN - Some values exist: Ignores NULL, continues calculation
- Dimension doesn't exist:
$thisbecomesNaN
This handles intermittent collection, dynamic dimensions, and partial outages.
Evaluation Frequency
Determine frequency by:
-
With lookup: Defaults to window duration
lookup: average -5m # Evaluates every 5 minutes -
Without lookup: Set explicitly
every: 10s
calc: $system + $user -
Custom interval: Override default
lookup: average -1m
every: 10s # Check every 10s despite 1m window
Constraints:
- Cannot exceed data collection frequency
- High frequency impacts performance
- Use larger intervals with
unalignedfor efficiency
Troubleshooting Your Alerts
Netdata Assistant
The Netdata Assistant provides AI-powered troubleshooting when alerts trigger:
- Click the alert in your dashboard
- Press the Assistant button
- Receive customized troubleshooting tips
The Assistant window follows you through dashboards for easy reference while investigating.
Missing or No Stock Alerts
If your node has no stock alerts (the built-in alerts that ship with Netdata), check these common causes in order. The commands below use default install paths; adjust them if your Netdata was installed with a non-standard prefix (see Managing Alert Configuration above).
1. Health monitoring disabled entirely
When enabled = no is set in the [health] section of netdata.conf, the Agent stops evaluating all alerts.
Check:
grep 'enabled' /etc/netdata/netdata.conf
No output means the setting uses its default value (yes) — health monitoring is enabled.
Restore: set enabled = yes in the [health] section (or remove the line to use the default), then restart the Agent:
sudo systemctl restart netdata
2. Stock health configuration disabled
The enable stock health configuration = no setting in the [health] section of netdata.conf disables all stock alerts while keeping custom alerts active.
Check:
grep 'enable stock health configuration' /etc/netdata/netdata.conf
Restore: set enable stock health configuration = yes (or remove the line to use the default), then restart the Agent — netdatacli reload-health does not reload netdata.conf:
sudo systemctl restart netdata
3. File shadowing
If a file in your user config directory has the same filename as a stock file (e.g., both contain cpu.conf), the stock file is completely ignored — only the user copy is loaded. If the user copy contains only a subset of the original alerts, the rest are missing.
This is different from overriding individual alerts by name. With file shadowing, you must include all alerts you want from that file. See Alert Configuration Ordering for the conceptual explanation.
Check — compare filenames between your user and stock directories:
comm -12 <(ls /etc/netdata/health.d/ | sort) <(ls /usr/lib/netdata/conf.d/health.d/ | sort)
Restore: if the user copy is no longer needed, remove it from your user health config directory:
sudo rm /etc/netdata/health.d/<filename>.conf
sudo netdatacli reload-health
If you need a modified version, ensure it includes all desired alerts from the stock file.
4. Dynamic UI/API configuration override
Editing an alert through the Cloud dashboard or Agent UI creates a dynamic configuration (DynCfg) override that replaces the file-based definition. The override persists even if the underlying file changes.
Restore: use the Reset to default option in the UI for each affected alert, or remove the dynamic config via the API. See Overriding Stock Alerts for full override documentation.
Verify stock alerts are active
After any fix, confirm the number of active alerts:
curl -s "http://localhost:19999/api/v1/alarms?all" | jq '.alarms | to_entries[].value.name' | sort -u | wc -l
A healthy Netdata Agent typically has hundreds of stock alerts. If the count is very low, one of the causes above may still apply.
Community Resources
Visit our Alerts Troubleshooting space for complex issues. Get help through GitHub or Discord. Share your solutions to help others.
Customizing Alerts
Tune alerts for your environment by adjusting thresholds, writing custom conditions, silencing alerts, and using statistical functions.
Related Documentation
Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.