Monitoring the Health of KurrentDB Clusters Part 3

A healthy KurrentDB cluster is vital for the reliability of your systems, but a dashboard only helps when someone is looking at it. Alerting closes that gap.

Having covered metrics-based monitoring in Part 1 and log-based monitoring in Part 2 , this part turns those recommendations into Grafana-managed alert rules, with notifications delivered to Slack and email.

Implementing the Alerts with Grafana

The monitoring sections in the previous parts define what to watch and why; this section provides end-to-end instructions for implementing those recommendations as Grafana-managed alert rules. Dashboards answer the question “what is happening right now?” Alerting answers the more important operational question: “who finds out, and how quickly, when something goes wrong?”

All alert evaluation is performed by Grafana itself; no separate Prometheus Alertmanager deployment is required. The prerequisites below give the full checklist. The intended audience is the same as for the rest of this document: platform engineers, site reliability engineers, and database administrators operating KurrentDB clusters

Alerting glossary

Prometheus – An open-source monitoring system that collects (“scrapes”) metrics from HTTP endpoints at a regular interval and stores them as time series. KurrentDB exposes metrics in Prometheus format natively

PromQL – The Prometheus Query Language, used to select and aggregate time series. The alert rules in this section are written in PromQL

Scrape interval – How frequently Prometheus collects metrics from each target, typically 15 seconds

Grafana-managed alerting – Grafana’s built-in alert evaluation engine and notification system (also called unified alerting). Grafana evaluates queries against the data source on a schedule and delivers notifications directly; no external Alertmanager is required

Alert rule – A query plus a condition, an evaluation schedule, and a pending period. When the condition holds for the full pending period, the rule fires

Pending period – The length of time a condition must remain true before the alert transitions from Pending to Firing. This suppresses alerts on brief transient spikes

Contact point – A Grafana notification destination, such as a Slack channel or an email address, together with the credentials needed to deliver to it

Notification policy – The Grafana routing tree that matches alert labels (for example severity=critical) to contact points, and controls grouping and timing of notifications

Gauge / Counter – The two most common Prometheus metric types. A gauge is a value that can go up or down (disk used bytes); a counter only increases (total events written)

RecentMax – A KurrentDB-specific gauge type whose value is the maximum of recent measurements, sized by the ExpectedScrapeIntervalSeconds setting. It guarantees that short spikes (such as a garbage collection pause) are captured by at least one scrape

Google App Password – A 16-character credential generated by a Google account for use by applications that authenticate over SMTP. Gmail does not accept regular account passwords from third-party applications; an App Password (which requires 2-Step Verification) is mandatory for Grafana’s SMTP integration

Solution overview

The pipeline has four stages. Each KurrentDB node exposes metrics on its /metrics endpoint over HTTPS. Prometheus scrapes every node on a fixed interval and stores the resulting time series. Grafana evaluates the alert rules defined in this section against the Prometheus data source on a one-minute schedule. When a rule fires, Grafana’s notification policy routes it by severity label to the appropriate contact points: Slack for all alerts, and email (delivered through Gmail SMTP) for critical alerts

The same Prometheus data source powers the KurrentDB Cluster Summary and KurrentDB Panels dashboards, so the series your alert rules evaluate are exactly the series your operators see on the dashboards during an incident. The dashboards are maintained in the open-source repository at https://github.com/kurrent-io/Kurrent-Grafana

Slack and Gmail-based email are the worked examples in this section. Grafana contact points integrate with many other notification technologies, including Microsoft Teams, PagerDuty, Opsgenie, Discord, Telegram, and generic webhooks for anything else. The alert rules, labels, and notification policy are identical regardless of the delivery technology; only the contact point configuration in Steps 1 through 3 changes

One detail in this section intentionally differs from the dashboards. The dashboards’ single-stat disk and CPU panels divide using the PromQL scalar() function, which is only correct when a single instance is selected. The alert expressions here use label matching instead, so they evaluate correctly per node and per disk mount on multi-node clusters

Prerequisites

Confirm each of the following before you begin:

KurrentDB 26.1 is deployed, with the default metrics configuration. All metrics referenced in this section are enabled by default. Two metricsconfig.json values are worth verifying: ExpectedScrapeIntervalSeconds should match your Prometheus scrape interval (the default of 15 matches a 15-second scrape and sizes the measurement window of RecentMax metrics), and the QueueLabels section should group StorageReaderQueue under the Readers label, as the shipped configuration does
Prometheus is scraping the /metrics endpoint of every node, with the job name kurrentdb (the node-availability rule references this job label). A minimal scrape configuration is shown below
Grafana version 10 or later is running, the Prometheus data source is connected, and you have an account with Admin (or at minimum Editing and Alerting) permissions. The KurrentDB Cluster Summary and KurrentDB Panels dashboards are imported

scrape_configs:
- job_name: kurrentdb
scheme: https
metrics_path: /metrics
static_configs:
- targets: ['node1:2113', 'node2:2113', 'node3:2113']

To verify that the required series are present, open the Explore view in Grafana (or the Prometheus expression browser), select the Prometheus data source, and confirm that each of the following queries returns data:

kurrentdb_sys_disk_bytes
kurrentdb_sys_mem_bytes
kurrentdb_proc_cpu
kurrentdb_gc_pause_duration_max_seconds
kurrentdb_queue_queueing_duration_max_seconds{name="Readers"}
kurrentdb_cache_hits_misses_total{cache="stream-info"}
kurrentdb_projection_progress
kurrentdb_projection_status
kurrentdb_persistent_sub_parked_messages
kurrentdb_incoming_grpc_calls_total
kurrentdb_elections_count_total
kurrentdb_checkpoints
up{job="kurrentdb"}

Persistent subscription series appear only once at least one persistent subscription exists; projection state size series appear only when a projection’s state exceeds half of the configured limit, so their absence is normal on a healthy cluster

Step 1: Create a Slack bot token

Grafana’s Slack contact point accepts either an incoming webhook URL or a bot API token. The bot token is recommended: a single credential can post to multiple channels, and Grafana can update and thread its messages. The following steps create a Slack app and produce the token

Browse to https://api.slack.com/apps while signed in to the workspace that should receive alerts. Select Create New App, then From scratch. Name the app (for example, Grafana Alerts), select the workspace, and select Create App
In the app’s left sidebar, open OAuth & Permissions. Under Scopes, in the Bot Token Scopes section, select Add an OAuth Scope and add chat:write. To allow the app to post to public channels without being invited to each one, also add chat:write.public
Scroll to the top of OAuth & Permissions and select Install to Workspace, then Allow on the consent screen. Depending on workspace settings, a Slack administrator may need to approve the installation
After installation the page displays a Bot User OAuth Token beginning with xoxb-. Copy it; this value is entered into Grafana in Step 3
For every private channel that should receive alerts (and for public channels if chat:write.public was not granted), invite the bot from within Slack: type /invite @Grafana Alerts in the channel. A missing invitation is the most common cause of a not_in_channel error when testing the contact point

Treat the token as a password: it grants posting rights in the workspace. Store it in your secrets management system, and if it is ever exposed, regenerate it from the same OAuth & Permissions page

Step 2: Configure Grafana SMTP for Gmail

Gmail does not accept regular account passwords from third-party applications over SMTP. Grafana must authenticate with a Google App Password, which requires 2-Step Verification on the sending account

On the Google account that will send alert email, enable 2-Step Verification at https://myaccount.google.com/security if it is not already enabled. App Passwords are unavailable without it
Browse to https://myaccount.google.com/apppasswords , enter a name such as Grafana SMTP, and select Create. Google displays a 16-character password exactly once; copy it immediately. Google Workspace note: an administrator may need to allow App Passwords for the organization, and for higher email volumes Workspace administrators may prefer the Google SMTP relay service (smtp-relay.gmail.com); the steps below use the standard smtp.gmail.com route
Locate grafana.ini. On Linux package installations the path is /etc/grafana/grafana.ini; on standalone installations, edit custom.ini in the conf directory of the Grafana installation (never edit defaults.ini)
Find the [smtp] section, uncomment it, and set the values shown below

[smtp]
enabled = true
host = smtp.gmail.com:587
user = alerts@yourdomain.com
password = """abcd efgh ijkl mnop"""
from_address = alerts@yourdomain.com
from_name = Grafana KurrentDB Alerts
skip_verify = false
startTLS_policy = MandatoryStartTLS

The triple quotes around the password prevent the # and ; characters from being interpreted as comments. The spaces Google displays inside the App Password may be kept or removed; both forms authenticate. Port 587 with STARTTLS is the standard pairing; if outbound traffic to port 587 is blocked, port 465 (implicit TLS) is the fallback

If Grafana runs in Docker or Kubernetes, set the equivalent environment variables instead of editing the file, via the compose file or a Kubernetes Secret:

GF_SMTP_ENABLED=true
GF_SMTP_HOST=smtp.gmail.com:587
GF_SMTP_USER=alerts@yourdomain.com
GF_SMTP_PASSWORD=abcdefghijklmnop
GF_SMTP_FROM_ADDRESS=alerts@yourdomain.com

Restart Grafana to load the change: sudo systemctl restart grafana-server, or restart the container or pod
If a later contact point test fails, the Grafana server log (/var/log/grafana/grafana.log) records the SMTP error. A 535 Username and Password not accepted response means the App Password is wrong or organization policy is blocking it; a connection timeout means egress to port 587 is blocked

Step 3: Create the contact points

Slack contact point

In Grafana’s left navigation, open Alerting, then Contact points, and select + Create contact point
Name it kurrentdb-slack. Choose the Slack integration
Paste the xoxb- bot token from Step 1 into the Token field, and enter the destination channel (for example #kurrentdb-alerts) in the Recipient field
Select Test and confirm the test message arrives in the channel, then select Save contact point

Email contact point

Select + Create contact point again. Name it kurrentdb-email and choose the Email integration
Enter the recipient address or distribution list (for example oncall@yourdomain.com ). Multiple addresses are separated with semicolons
Select Test and confirm the message is delivered, then select Save contact point. If the test fails, review the SMTP troubleshooting guidance at the end of Step 2

Step 4: Configure the notification policy

The notification policy routes alerts by label. The alert rules in Step 5 attach a severity label of either warning or critical; the policy below sends everything to Slack and additionally sends critical alerts to email

Open Alerting, then Notification policies
Edit the Default policy and set its contact point to kurrentdb-slack. This is the catch-all: any alert that matches no more specific route is delivered to Slack
Select + New child policy (also labeled New nested policy in some versions). Add the matcher severity = critical, set the contact point to kurrentdb-email, and enable Continue matching subsequent sibling nodes so that critical alerts are also delivered to Slack through the default route
Review the grouping settings on the default policy. Grouping by alertname and instance (the defaults plus instance) means one notification per condition per node rather than one per time series, which matters for the Gmail sending limits discussed under Operational considerations
Optionally, add a second child policy matching team = application routed to the channel or address owned by your application development team. Rule 11 (parked messages) attaches this label, because parked messages are application-owned failures, as the Persistent Subscription Parked Messages section explains

Step 5: Create the alert rules

Fifteen rules implement the recommendations from the metrics-based monitoring sections; several have two variants for different platforms or severity tiers. The table below summarizes them; the subsections that follow give the exact expression and settings for each, with the operational rationale living in the corresponding monitoring section in previous part.

General procedure

Repeat the following for each rule, substituting the values from the subsections below

Open Alerting, then Alert rules, and select + New alert rule
Enter the rule name
Under Define query and alert condition, select the Prometheus data source and paste the rule’s expression into query A; if the query editor opens in Builder mode, switch it to Code mode to paste the expression directly. Each expression contains its own comparison, so no separate Grafana threshold is needed; see the note on comparison semantics after these steps
In the expressions area beneath the query, Grafana adds a Reduce and a Threshold expression by default. Delete the Threshold expression and keep Reduce (function: Last), set as the alert condition. Alternatively, remove the comparison from the PromQL and express it in Grafana’s Threshold expression instead; both approaches are valid, but use one style consistently across all rules
Under Set evaluation behavior, place all rules in one folder (for example KurrentDB) and one evaluation group (for example kurrentdb-1m) evaluating every 1m. Set the pending period to the value listed for the rule
Under Labels, add severity with the listed value (and any additional labels the rule specifies). These labels drive the routing configured in Step 4
Under Annotations, add the listed summary. Label values interpolate with the {{ $labels.name }} syntax
Select Save rule and exit

A note on comparison semantics. PromQL comparisons come in two forms, and the rules in this section deliberately use both. The filter form (for example > 0.90) returns only the series where the condition holds, carrying the original sample value, and Grafana fires on any non-zero result. That works when the matched value is necessarily non-zero, but fails when it can be 0: kurrentdb_projection_running == 0 matches with a sample value of 0, which the alert engine treats as not firing, and kurrentdb_projection_progress < 0.99 returns the progress value itself, which is 0 for a fully stalled projection – the worst case is exactly the one that cannot fire. The bool modifier (== bool 0, < bool 0.99) instead returns every series, with value 1 where the condition holds and 0 where it does not, so the rule fires reliably and a healthy system reports a true Normal state with data. Rules 6, 7, 8, and 13 use bool for this reason. When writing your own variants, use bool for any comparison against zero or a lower bound

Rule 1: Disk utilization

kurrentdb_sys_disk_bytes{kind="used"}
/ ignoring(kind) kurrentdb_sys_disk_bytes{kind="total"} > 0.90

Pending period: 5m. Severity: critical. Summary: Disk {{ $labels.disk }} on {{ $labels.instance }} is over 90% full

The metric carries one series per mount point (the disk label) and per node, so this rule fires independently for each mount on each node; the ignoring(kind) clause matches the used series to its corresponding total series. KurrentDB cannot write new chunks or complete scavenges if the data disk fills; treat this alert as a page. If your index is on its own mount, as the Disk Utilization section recommends, add this variant for the tighter index headroom requirement, substituting your index mount point:

kurrentdb_sys_disk_bytes{kind="used", disk="/var/lib/kurrentdb/index"}
/ ignoring(kind) kurrentdb_sys_disk_bytes{kind="total", disk="/var/lib/kurrentdb/index"} > 0.40

Pending period: 5m. Severity: warning. Summary: Index disk on {{ $labels.instance }} is over 40% full; index merges need headroom

Rule 2: Memory utilization above 85%

1 - (kurrentdb_sys_mem_bytes{kind="free"}
/ ignoring(kind) kurrentdb_sys_mem_bytes{kind="total"}) > 0.85

Pending period: 10m. Severity: warning. Summary: Memory on {{ $labels.instance }} above 85% for 10 minutes

The 85% threshold and baselining guidance are covered in the Memory Utilization section

Rule 3: CPU utilization above 80%

On Linux and other Unix-like systems, use the process CPU gauge:

kurrentdb_proc_cpu > 80

On Windows, use the system-wide gauge (the same series used by the CPU panel on the Cluster Summary dashboard) instead:

kurrentdb_sys_cpu > 80

Pending period: 10m. Severity: warning. Summary: CPU on {{ $labels.instance }} above 80% for 10 minutes

Create whichever variant matches your platform (both, in mixed environments). The 10-minute pending period prevents pages on short bursts such as scavenges or index merges

Rule 4: Garbage collection pause above threshold

max by(instance) (kurrentdb_gc_pause_duration_max_seconds) > 8.0

Pending period: 5m. Severity: warning. Summary: GC pause on {{ $labels.instance }} exceeded 8 s

The metric is a RecentMax gauge labeled with a range (for example 16-20 seconds); the max by(instance) aggregation collapses that label so the rule evaluates one value per node. The 8-second threshold leaves margin under the default heartbeat timeout; the Garbage Collection Pauses section explains the election risk behind that ceiling and the re-baselining guidance for 26.x ServerGC

Rule 5: Reader queue duration

max by(instance) (kurrentdb_queue_queueing_duration_max_seconds{name="Readers"}) > 0.5

Pending period: 10m. Severity: warning. Summary: Reader queue on {{ $labels.instance }} exceeded 0.5 s for 10 minutes

The Readers label comes from the QueueLabels grouping in the default metricsconfig.json; the same pattern can be cloned for the other queue groups (workers, projections, subscriptions), keeping a steady ceiling per group. The IOPS section explains how growing reader queues indicate IOPS exhaustion and the remediation

Rule 6: Cache hit ratio below 80%

rate(kurrentdb_cache_hits_misses_total{cache="stream-info",kind="hits"}[10m])
/ ignoring(kind) (rate(kurrentdb_cache_hits_misses_total{cache="stream-info",kind="hits"}[10m])
+ ignoring(kind) rate(kurrentdb_cache_hits_misses_total{cache="stream-info",kind="misses"}[10m])) < bool 0.80

Pending period: 15m. Severity: warning. Summary: Stream-info cache hit ratio on {{ $labels.instance }} below 80%

Tuning guidance, including the StreamInfoCacheCapacity parameter and its memory and GC trade-off, is in the Cache Hit Ratio section

Rule 7: Projection progress below 99%

kurrentdb_projection_progress < bool 0.99

Pending period: 10m. Severity: warning. Summary: Projection {{ $labels.projection }} on {{ $labels.instance }} below 99% progress

The gauge is scaled 0 to 1, so the 99% threshold is 0.99. Progress legitimately dips during catch-up after a node restart or a burst of writes; the 10-minute pending period absorbs normal catch-up so the rule fires only on sustained lag. The large-database precision caveat in the Projection Progress section means this rule is necessary but not sufficient, which is one reason Rule 8 exists alongside it

Rule 8: Projection in a state other than Running

The projection status metric exposes one series per state (Running, Faulted, Stopped), with the series for the current state set to 1. Two rules implement the Faulted-versus-Stopped distinction described in the Stopped Projections section: a fast, critical page on Faulted, and a slower warning when a projection is simply not running

Rule 8a, projection Faulted:

kurrentdb_projection_status{status="Faulted"} == bool 1

Pending period: 1m. Severity: critical. Summary: Projection {{ $labels.projection }} on {{ $labels.instance }} is Faulted

Rule 8b, projection not in the Running state:

kurrentdb_projection_running == bool 0

Pending period: 5m. Severity: warning. Summary: Projection {{ $labels.projection }} on {{ $labels.instance }} is not Running

Projection metrics are emitted by the node currently running the projection subsystem, so these series move between instances after a failover. Do not pin projection rules to a specific instance

Rule 9: Projection state size approaching the limit

The state-size series appears only once a projection’s state exceeds half of the configured limit, so the mere presence of the series is the warning condition:

kurrentdb_projection_state_size > 0

Pending period: 5m. Severity: warning. Summary: Projection {{ $labels.projection }} state exceeds 50% of the size limit

Add a critical tier when state passes 75% of the configured limit:

kurrentdb_projection_state_size
/ on() group_left max(kurrentdb_projection_state_size_bound{bound="LIMIT"}) > 0.75

Pending period: 5m. Severity: critical. Summary: Projection {{ $labels.projection }} state exceeds 75% of the size limit

The limit, the consequences of exceeding it, and the remediation are covered in the Projection State Size section

Rule 10: Persistent subscription lag

(kurrentdb_persistent_sub_last_known_event_number
- kurrentdb_persistent_sub_checkpointed_event_number) > 1000

Pending period: 10m. Severity: warning. Summary: Subscription {{ $labels.group_name }} on {{ $labels.event_stream_id }} is more than 1,000 events behind

This expression covers subscriptions on named streams; for subscriptions on the $all stream, use the kurrentdb_persistent_sub_last_known_event_commit_position and kurrentdb_persistent_sub_checkpointed_event_commit_position pair instead, with a much larger threshold since those values are byte positions in the transaction log rather than event numbers. The 1,000-event threshold is a starting point; calibrate it against the subscription’s normal throughput

Rule 11: Parked messages

kurrentdb_persistent_sub_parked_messages > 0

Pending period: 15m. Severity: warning. Summary: Subscription {{ $labels.group_name }} on {{ $labels.event_stream_id }} has parked messages

A companion rule on the age of the oldest parked message catches the parked-and-forgotten case directly:

kurrentdb_persistent_sub_oldest_parked_message_seconds > 3600

Pending period: 15m. Severity: warning. Summary: Oldest parked message on {{ $labels.group_name }} is over an hour old

Add the label team = application to both rules, because parked messages are application-owned failures, as the Persistent Subscription Parked Messages section explains. The 15-minute pending period is deliberate: a single retried message should not page anyone, but a parked count that persists means nobody is replaying them

Rule 12: Failed gRPC calls

rate(kurrentdb_incoming_grpc_calls_total{kind="failed"}[5m]) > 0.1

Pending period: 5m. Severity: warning. Summary: gRPC failures on {{ $labels.instance }} above 0.1 per second

Transient client disconnects produce occasional failures, so a small non-zero rate floor prevents flapping. A companion rule on kind=“deadline-exceeded” with the same shape is worth adding; the Failed gRPC Calls section explains what that failure kind indicates

Rule 13: Node down

up{job="kurrentdb"} == bool 0

Pending period: 2m. Severity: critical. Summary: KurrentDB node {{ $labels.instance }} is not responding to scrapes

The up series is generated by Prometheus itself for every scrape target, so this rule works even when the node exposes no metrics at all. Together with Rule 14, it covers the node-availability expectation described in the Node Status section

Rule 14: Election occurred

increase(kurrentdb_elections_count_total[15m]) > 0

Pending period: 0m. Severity: warning. Summary: A leader election occurred on the cluster in the last 15 minutes

Elections are expected during deployments and planned failovers, and unexpected the rest of the time. An unexpected election could be a symptom of several other conditions: long garbage collection pauses (Rule 4), an unresponsive node (Rule 13), or network problems between nodes. The zero pending period is intentional; the counter increase is already a discrete, completed event

Rule 15: Replication lag

max(kurrentdb_checkpoints{name="writer"})
- on() group_right kurrentdb_checkpoints{name="replication"} > 100000000

Pending period: 15m. Severity: warning. Summary: Follower {{ $labels.instance }} is more than 100 MB behind the leader

The expression subtracts each node’s replication checkpoint from the cluster-wide maximum writer checkpoint, yielding the byte distance each follower trails the leader. The on() group_right clause is required: max() produces a single series with no labels, so without it the subtraction cannot vector-match the per-node replication checkpoints and the rule returns no data – the same one-to-many matching pattern Rule 9’s critical tier uses with group_left. The threshold is bytes of transaction log, so set it relative to your write rate (100 MB shown here), and note the legitimate-lag cases described in the Replication Lag section; the 15-minute pending period approximates growing rather than transient lag

Step 6: Verify the configuration

Open Alerting, then Alert rules, and expect two different healthy states. Rules whose comparison filters inside the query (most of them, for example the disk rule’s > 0.90) return no series at all while the system is healthy, so Grafana reports No data for every one of them: on each such rule, set Alert state if no data or all values are null to Normal. The bool-modified rules (6, 7, 8, and 13) genuinely return 0-valued series while healthy, so they display Normal with data; leave their no-data handling at the default, because for them a disappearing series is itself a meaningful signal
Rule 13 (Node Down) is what makes the NoData-to-Normal setting safe on the filter-style rules: written with bool, it is the backstop that fires when a target vanishes from scraping entirely, so a silently disappearing series elsewhere cannot mask a dead node
Prove the pipeline end to end: edit one low-risk rule and temporarily lower its threshold so it must fire (for example, change the disk rule’s comparison to > 0.01). Wait out the pending period and confirm the notification arrives in Slack; for a rule labeled critical, confirm the email arrives as well
Restore the original threshold and confirm the alert resolves. If the contact point’s Send resolved option is enabled, a resolution notification is delivered
Open the KurrentDB Panels dashboard alongside a firing test alert and confirm operators can pivot from the notification to the corresponding panel: the GC pause, queue duration, cache hit ratio, projection, persistent subscription, and replication panels chart the same series these rules evaluate

Operational considerations

Gmail sending limits. Consumer Gmail accounts are limited to roughly 500 outbound messages per day, and Google Workspace accounts to roughly 2,000. Configure grouping in the notification policy (Step 4) so that an incident produces one grouped email rather than one message per firing series; a flapping condition across a multi-node cluster can exhaust the daily quota quickly without grouping

App Password lifecycle. Google revokes App Passwords automatically when the account password changes or 2-Step Verification is reset. Email alerting can therefore stop silently after a routine account security event. The Slack contact point provides a parallel notification channel that covers this gap, which is a strong reason to configure both

Credential hygiene. Both the Slack bot token and the Google App Password should live in your secrets management system. In containerized deployments, supply them through environment variables sourced from secrets rather than baking them into images or committed configuration

Application-owned alerts. The parked-message rules carry the team = application label so the notification policy can route them to the team that owns subscriber logic. Keeping the routing aligned with ownership shortens time to resolution and keeps the database on-call rotation focused on conditions it can act on

Version control. Grafana-managed alert rules are stored in Grafana’s database. Once tuned, export them (Alerting, Alert rules, Export, or the provisioning API) as provisioning YAML and commit the file to version control alongside your dashboard JSON. This restores the auditable, reviewable artifact you would otherwise have with Prometheus-native rule files

Threshold tuning. The thresholds in this section are starting points. Review them against two weeks of production telemetry on the Kurrent dashboards: in particular, baseline the GC pause maximum, queue durations, subscription lag, and replication lag for your workload, and tighten or relax the pending periods to match your operational tolerance for paging

Panel-linked alerts. As an alternative to free-standing rules, an alert can be created from the Alert tab of any dashboard panel; Grafana pre-fills the panel’s query and annotates the panel when the alert fires. The evaluation engine is identical. Some teams prefer this style because the alert and its visualization stay in sync; the free-standing style used in this section keeps all rules in one folder for easier export

Conclusion

In this part of our series, we've turned the monitoring recommendations from earlier in the series into a working alerting pipeline. By creating Slack and email contact points, routing alerts by severity through a notification policy, and implementing the fifteen PromQL alert rules covering everything from disk and CPU utilization to projection health, persistent subscription lag, node availability, and replication lag, you give your operators a way to learn about problems the moment they emerge rather than the moment a user reports them. Following the supporting practices — handling no-data states correctly, grouping notifications to stay within Gmail's sending limits, keeping credentials in a secrets manager, and exporting your tuned rules to version control — will keep that pipeline dependable and auditable over the long term.

Proactive alerting is the natural complement to proactive monitoring: the dashboards show you what is happening, and these rules make sure you find out about it in time to act.

Back to all posts