# Netdata Integration

### Overview

[Netdata](https://www.netdata.cloud/) is an open-source, real-time infrastructure monitoring tool that tracks thousands of metrics per second with per-second resolution across servers, containers, databases, and applications.

This integration listens for alerts dispatched by the Netdata Agent via its **custom webhook notification** method and turns them into actionable alerts on the platform. Alerts are automatically resolved when Netdata sends a `CLEAR` notification for the same alarm.

### Integration Flow

1. The Netdata Agent continuously evaluates health rules against the metrics it collects.
2. When an alarm's state transitions (for example `CLEAR → WARNING`, `WARNING → CRITICAL`, or `CRITICAL → CLEAR`), the Agent invokes `alarm-notify.sh`.
3. `alarm-notify.sh` calls the `custom_sender()` function that you configure in `health_alarm_notify.conf`.
4. That function builds a JSON payload from the alarm context and posts it to the platform endpoint.
5. The platform classifies the event by `status` — any value other than `CLEAR` opens or updates an alert, and `CLEAR`resolves it.

### Webhook Payload Schema

```json
{
  "name": "string",
  "chart": "string",
  "family": "string",
  "status": "WARNING | CRITICAL | CLEAR | UNDEFINED | UNINITIALIZED",
  "old_status": "string",
  "value": "string",
  "old_value": "string",
  "value_string": "string",
  "old_value_string": "string",
  "units": "string",
  "info": "string",
  "when": "unix timestamp",
  "host": "string",
  "unique_id": "string",
  "alarm_id": "string",
  "event_id": "string",
  "duration": "string",
  "non_clear_duration": "string",
  "classification": "string",
  "component": "string",
  "type": "string",
  "severity": "string",
  "calc_expression": "string",
  "total_warnings": "string",
  "total_critical": "string",
  "src": "string",
  "goto_url": "string",
  "image_url": "string"
}
```

### Setup

#### Step 1 — Create an Alert Source on the Platform

1. Navigate to **Sources → Add Source**.
2. Search for **Netdata** and select it.
3. Give the source a name and click **Save**.
4. Copy the **ITOC360 URL** and **Token** — you will need them in Step 3. The full webhook URL has the form:

```
https://api.itoc360.app/functions/v1/events?token=<x-itoc360-token>
```

#### Step 2 — Access the Netdata Configuration

Netdata's notification settings live in `health_alarm_notify.conf`. How you reach this file depends on how you run Netdata.

**If Netdata runs in Docker:**

Copy the default configuration out of the container so you can edit it on your host:

```bash
docker exec netdata cat /usr/lib/netdata/conf.d/health_alarm_notify.conf > ./health_alarm_notify.conf
```

Open the file in your preferred editor. You will push the edited file back into the container in Step 4.

**If Netdata runs on a host (apt/yum install):**

Use the `edit-config` helper so the file ends up in the right location:

```bash
cd /etc/netdata 2>/dev/null || cd /opt/netdata/etc/netdata
sudo ./edit-config health_alarm_notify.conf
```

#### Step 3 — Add the Custom Sender Block

Append the following block to the **end** of `health_alarm_notify.conf`. Replace the webhook URL with the one you copied in Step 1.

```bash
#------------------------------------------------------------------------------
# itoc360 custom webhook notifications
#------------------------------------------------------------------------------

SEND_CUSTOM="YES"
DEFAULT_RECIPIENT_CUSTOM="sysadmin"

custom_sender() {
    local webhook_url="https://api.itoc360.app/functions/v1/events?token=<x-itoc360-token>"

    local payload=$(cat <<EOF
{
  "name": "${name}",
  "chart": "${chart}",
  "family": "${family}",
  "status": "${status}",
  "old_status": "${old_status}",
  "value": "${value}",
  "old_value": "${old_value}",
  "src": "${src}",
  "duration": "${duration}",
  "non_clear_duration": "${non_clear_duration}",
  "units": "${units}",
  "info": "${info}",
  "when": "${when}",
  "host": "${host}",
  "unique_id": "${unique_id}",
  "alarm_id": "${alarm_id}",
  "event_id": "${event_id}",
  "calc_expression": "${calc_expression}",
  "total_warnings": "${total_warnings}",
  "total_critical": "${total_critical}",
  "classification": "${classification}",
  "component": "${component}",
  "type": "${type}",
  "severity": "${severity}",
  "value_string": "${value_string}",
  "old_value_string": "${old_value_string}",
  "image_url": "${image_url}",
  "goto_url": "${goto_url}"
}
EOF
)

    httpcode=$(docurl \
        --request POST \
        --header "Content-Type: application/json" \
        --data "${payload}" \
        "${webhook_url}")

    if [ "${httpcode}" = "200" ] || [ "${httpcode}" = "201" ] || [ "${httpcode}" = "204" ]; then
        info "sent custom webhook notification for: ${host} ${chart}.${name} is ${status}"
        sent=$((sent + 1))
    else
        error "failed to send custom webhook notification for: ${host} ${chart}.${name} is ${status}, http code ${httpcode}"
    fi

    return 0
}
```

`SEND_CUSTOM="YES"` turns on the custom notification method, and `DEFAULT_RECIPIENT_CUSTOM="sysadmin"` ensures alerts routed to the `sysadmin` role reach this integration. Adjust the recipient role if your setup routes alerts elsewhere.

Save the file.

#### Step 4 — Apply the Configuration

**If Netdata runs in Docker:**

```bash
docker cp ./health_alarm_notify.conf netdata:/etc/netdata/health_alarm_notify.conf
docker exec netdata chown netdata:netdata /etc/netdata/health_alarm_notify.conf
docker restart netdata
```

**If Netdata runs on a host:**

Changes made via `edit-config` take effect on the next alarm evaluation, but a restart forces a clean reload:

```bash
sudo systemctl restart netdata
```

#### Step 5 — Verify the Integration

Netdata ships with a built-in test command that dispatches three sample alarms in sequence (`WARNING`, `CRITICAL`, and `CLEAR`):

```bash
# Docker
docker exec -u netdata netdata bash -c \
  'export NETDATA_ALARM_NOTIFY_DEBUG=1 && /usr/libexec/netdata/plugins.d/alarm-notify.sh test'

# Host install
sudo su -s /bin/bash netdata
export NETDATA_ALARM_NOTIFY_DEBUG=1
/usr/libexec/netdata/plugins.d/alarm-notify.sh test
```

You should see three `# OK` lines in the output, one per alarm. On the platform, under the source you created in Step 1, a single alert should open on the `WARNING` event, stay open through the `CRITICAL` event, and automatically resolve when the `CLEAR` event arrives.

### Sample Payloads

The following payloads were captured during integration testing.

**ALERT — WARNING state:**

```json
{
  "name": "test_alarm",
  "chart": "test.chart",
  "family": "",
  "status": "WARNING",
  "old_status": "CLEAR",
  "value": "100",
  "old_value": "90",
  "src": "/usr/libexec/netdata/plugins.d/alarm-notify.sh",
  "duration": "1",
  "non_clear_duration": "1",
  "units": "units",
  "info": "this is a test alarm to verify notifications work",
  "when": "1776694259",
  "host": "3b60a2cac47d",
  "unique_id": "1",
  "alarm_id": "1",
  "event_id": "1",
  "severity": "WARNING",
  "classification": "Test",
  "value_string": "new value",
  "old_value_string": "old value",
  "goto_url": "https://registry.my-netdata.io/registry-alert-redirect.html?host=3b60a2cac47d&chart=test.chart&alarm=test_alarm&alarm_status=WARNING"
}
```

**RESOLVE — CLEAR state:**

```json
{
  "name": "test_alarm",
  "chart": "test.chart",
  "family": "",
  "status": "CLEAR",
  "old_status": "CRITICAL",
  "value": "100",
  "old_value": "90",
  "src": "/usr/libexec/netdata/plugins.d/alarm-notify.sh",
  "duration": "1",
  "non_clear_duration": "3",
  "units": "units",
  "info": "this is a test alarm to verify notifications work",
  "when": "1776694259",
  "host": "3b60a2cac47d",
  "unique_id": "1",
  "alarm_id": "1",
  "event_id": "3",
  "severity": "Recovered from CRITICAL",
  "classification": "Test",
  "value_string": "new value",
  "old_value_string": "old value",
  "goto_url": "https://registry.my-netdata.io/registry-alert-redirect.html?host=3b60a2cac47d&chart=test.chart&alarm=test_alarm&alarm_status=CLEAR"
}
```

### Field Mapping Reference

| Payload Field                           | Description                                                                                      |
| --------------------------------------- | ------------------------------------------------------------------------------------------------ |
| `status`                                | Alarm state — any value other than `CLEAR` opens or updates an alert, `CLEAR` resolves it        |
| `host`                                  | Hostname of the Netdata Agent — combined with `alarm_id` and `name` to fingerprint the alert     |
| `alarm_id`                              | Stable identifier for the alarm across state transitions                                         |
| `name`                                  | Name of the alarm rule — shown in the alert title                                                |
| `chart`                                 | Chart the alarm is attached to — shown in the alert title                                        |
| `info`                                  | Human-readable description of why the alarm fired                                                |
| `value` / `units`                       | Current metric value and its unit of measurement                                                 |
| `old_value` / `old_status`              | Previous metric value and alarm state                                                            |
| `severity`                              | Descriptive severity string, which may include escalation context (e.g. `Escalated to CRITICAL`) |
| `classification` / `component` / `type` | Alarm taxonomy fields set in the alarm definition                                                |
| `when`                                  | Unix timestamp of the state change                                                               |
| `duration` / `non_clear_duration`       | How long the alarm has been in its current state, and since it was last clear                    |
| `calc_expression`                       | Expression that was evaluated to produce the current value                                       |
| `total_warnings` / `total_critical`     | Total warning and critical alarms currently active on the host                                   |
| `goto_url`                              | Direct link to the alarm in the Netdata dashboard                                                |
| `unique_id` / `event_id`                | Identifiers used internally by Netdata for this specific transition                              |

### Priority Mapping

The platform maps Netdata's `status` field to an internal priority level.

| Netdata `status` | Platform Priority                                           |
| ---------------- | ----------------------------------------------------------- |
| `CRITICAL`       | CRITICAL                                                    |
| `WARNING`        | MEDIUM                                                      |
| `UNDEFINED`      | LOW                                                         |
| `UNINITIALIZED`  | LOW                                                         |
| `CLEAR`          | LOW (this event resolves the alert rather than opening one) |

The `severity` field in the payload (for example `Escalated to CRITICAL` or `Recovered from CRITICAL`) is a human-readable description that Netdata composes at notification time. The platform does not use it for priority routing — `status` is the authoritative field.

### RESOLVE Detection

The platform resolves an alert when the Netdata Agent sends a payload with `"status": "CLEAR"`. This is the standard notification that `alarm-notify.sh` emits when an alarm's condition no longer holds.

The resolve event is matched to the original alert using a fingerprint derived from `host`, `alarm_id`, and `name`. As long as those three fields stay the same across the firing and resolving notifications — which is the default Netdata behavior — the fingerprint will match and the alert will be closed.

### Troubleshooting

**No payload arrives on the platform after a test:**

Check the Netdata error log for delivery failures:

```bash
# Docker
docker exec netdata tail -50 /var/log/netdata/error.log

# Host
sudo tail -50 /var/log/netdata/error.log
```

Look for `failed to send custom webhook notification` lines — they include the HTTP status code returned by the platform endpoint.

**Test runs but no `# OK` for the custom sender:**

Confirm the configuration was reloaded. For Docker, make sure you ran `docker restart netdata` after copying the file in. For host installs, confirm `SEND_CUSTOM="YES"` appears in the active file and that the file belongs to the `netdata` user:

```bash
sudo ls -la /etc/netdata/health_alarm_notify.conf
```

**Alerts are created but never resolve:**

The most common cause is a custom alarm that does not transition back to `CLEAR` cleanly — for example, if the alarm rule is deleted before the condition recovers. Check the Netdata dashboard's alerts view to confirm the alarm itself is actually clearing.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.itoc360.com/integrations/inbound-integrations/infrastructure-monitoring/netdata-integration.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
