Monitoring the Health of KurrentDB Clusters Part 2

A healthy KurrentDB cluster is vital for the performance and reliability of your systems. Proactive monitoring helps you detect and resolve potential problems before they escalate into outages.
This two-part series will guide you through the essential aspects of monitoring a KurrentDB cluster. In the previous part, we dived into metrics-based monitoring, covering key performance indicators from IOPS and memory to CPU and disk utilization. In this second and final part, we’ll explore log-based monitoring and share general tips for maintaining cluster health.
Note that this guide is based on KurrentDB Server version 25.0. For the full and latest version of this guide, please visit Kurrent Documentations’s Monitoring Best Practices Page.
Log-based Monitoring
These items may appear in the KurrentDB log files
Projection State Size
The maximum projection state size is 16 MB. Projection states exceeding this size will fail to checkpoint and enter a faulted state. Once the projection state reaches 8 MB in size - 50% of the limit - the server will begin logging messages such as the following:
"messageTmpl": "Checkpoint size for the Projection {projectionName} is greater than 8 MB. Checkpoint size for a projection should be less than 16 MB. Current checkpoint size for Projection {projectionName} is {stateSize} MB."
Customers should alert on this log message, and reduce the size of the Projection state as quickly as possible to avoid exceeding the maximum. In newer versions of KurrentDB this can be monitored in the metrics - see later in this document.
Long Index Merge
While not directly a cause for concern, organizations may wish to monitor the duration of index merges to ensure their regular maintenance scripts are completing during scheduled windows, and are not impacting performance of their solutions. Index merge times are logged with messages such as the following:
"PTables merge finished in 15:32:13.5846974 ([128000257, 128000194, 128000135, 128000181, 128000211, 46303330995] entries merged into 46943331973)."
Cluster Node Version Mismatch
Except during rolling upgrades, cluster nodes should be running the exact same version of KurrentDB. If they are not, the servers will log this as an error state that should be corrected, with messages such as the following:
{
"@t": "2024-04-08T12:15:53.1465960+01:00",
"@mt": "MULTIPLE ES VERSIONS ON CLUSTER NODES FOUND [ (Unspecified/node0.acme.com:2112,24.2.0), (Unspecified/node4.acme.com:2112,23.10.0.0), (Unspecified/node3.acme.com:2112,23.10.0.0), (Unspecified/node2.acme.com:2112,23.10.0.0), (Unspecified/node1.acme.com:2112,23.10.0.0) ]",
"@l": "Warning",
"@i": 498346887,
"SourceContext": "EventStore.Core.Services.Gossip.ClusterMultipleVersionsLogger",
"ProcessId": 77265,
"ThreadId": 17
}
Certificate Expiration Warnings
When SSL Certificates are expiring soon, cluster nodes will log a warning. Organization should alert on this warning to ensure that certificates are renewed before expiration to avoid communication failures and cluster-down scenarios. Messages such as the following are logged:
{
"@t": "2024-10-13T02:43:38.9403418+00:00",
"@mt": "Certificates are going to expire in {daysUntilExpiry:N1} days",
"@r": [
"9.6"
],
"@l": "Warning",
"@i": 40508766706,
"daysUntilExpiry": 9.602558584001157,
"SourceContext": "EventStore.Core.ClusterVNode",
"ProcessId": 88375,
"ThreadId": 21
}
General Health Tips
Below are some some general health tips
Scavenge Regularly
Scavenging removes deleted events and streams, and should be done regularly. Not scavenging or not removing a significant amount of data is not, however, an indicator of poor cluster health. Log entries do report how much space was reclaimed from scavenging, and can, in fact, be a negative number if the scavenging activity did not remove a large number of events.
{
"@host": "cbcns65o0aek8srtnlsg-1.mesdb.eventstore.cloud",
"@i": 211285875,
"ProcessId": 3870715,
"ScavengeId": "f8f18dd4-2db7-4785-ac30-03c5fa30c8d7",
"SourceContext": "EventStore.Core.Services.Storage.StorageScavenger",
"ThreadId": 54,
"elapsed": "40.08:40:28.6613320",
"messageTmpl": "SCAVENGING: Scavenge Completed. Total time taken: {elapsed}. Total space saved: {spaceSaved}.",
"spaceSaved": -11710416
}
Install Patch versions
Kurrent will release patch versions from time to time that contain important fixes. It is strongly recommended to keep up to date with the most recent patch version
Few Errors / Warnings
During normal operation, KurrentDB produces few errors or warnings. Logs should generally be clean. Customers may wish to monitor for spikes in error / warning rates emitted into log files
KurrentDB Health Endpoint
When a KurrentDB node is alive and healthy, it returns a response document at the following URL: https://localhost:2113/health/live?liveCode=200
Conclusion
In this second part of our series, we’ve explored log-based monitoring and general health tips for maintaining a robust KurrentDB cluster. By keeping an eye on critical log entries such as projection state sizes, index merge durations, version mismatches, and certificate expirations, you can proactively address potential issues before they impact your services. Additionally, following best practices like regular scavenging, staying updated with patch versions, and ensuring clean logs will contribute to the overall health and performance of your KurrentDB cluster.
Further Reading
Please see the following Kurrent Academy resources for more information