Event Threshold for Seq - A super powered threshold monitoring app for Seq
Table of Contents
"Tell us when events fall below a threshold, between X and Y times..."
Seq has an innate ability to alert based on a simple count of events in a signal, using dashboard widgets with alerts. We use that for a number of alerts, including detection of possible upstream outages - when we receive less traffic (measured in log entries) than normal from upstream over a given interval, we can send an alert that there's a problem.
The challenge comes when you want to measure only between specific times. For example, you have logs that you can derive a count of files transferred from, and you want to measure that a scheduled transfer between 4:00am and 4:30am had at least 100 files, and alert if it falls below that.
Which - of course - was a requirement that arose. Our logging, monitoring, and alerting is now so robust that the business want more. "Can Seq tell us if ..." is reasonably common - and the answer is generally yes. There's a massive amount of information that can be derived from Seq, even from logs that are less than ideal in structure. The question is always - how can I do this within the current features and capabilities? If you come up short, you're probably going to need an app.
In saying that, the Seq Reporter console app goes quite a way to making these kinds of requests fairly mundane. A scheduled report is certainly an option to fill many requirements ... but it's not structured to provide alerts based on counting logs over defined intervals. In that instance, you probably need a Seq app.
Requirements
So to outline the requirements that arose;
- Monitor for specific events and count them over an interval
- Configurable start and end time
- Configurable suppression interval before another threshold error alert can be raised
- Allow days of week, month, and public holidays to be configured to provide for an ability to have fine-grained control over when thresholds should be monitored.
- Configurable alert message and description
- Configurable alert level - Verbose, Debug, Information, Warning, Error, or Fatal
- Add tags to use in alerting
- Add priority level and responders for scenarios such as OpsGenie alerting
The list of requirements starts to look a lot like Event Timeout. The major difference is that Event Timeout is primarily structured to look for events that did not happen at all. While it maintains counts of events, this is based on making a positive match; if that match doesn't occur, raise an alert.
Event Timeout is a powerful app with a lot of configurability, but I wanted this to be a separate entity with similar features. The logic in figuring out that an event didn't happen and then alerting, versus counting events and alerting if it's under (or over) a threshold, is different enough that I didn't want to try to shoehorn yet another feature into Event Timeout.
So the answer was, firstly, to use the Event Timeout app as our basic structure, and adapt the settings and logic where necessary.
Seq.App.EventThreshold
Event Threshold is the result of that adaptation. It benefits quite considerably from the Event Timeout implementation, as you might see from the feature list;
- Configurable start and end time
- Invert the threshold (eg. make it Greater Than or Equal To)
- Threshold configuration (default 100)
- Optionally log the event count at end of each interval
- Configurable threshold measuring interval
- Configurable suppression interval for alerts
- Configurable log level for threshold violations
- Configurable priority, responders, tags
- Day of Week, Include Day(s) of Month, Exclude Day(s) of Month, Public Holidays
- Configurable property matching for up to 4 properties
- Configurable alert message and description
Each feature of Event Timeout was evaluated for benefit to Event Threshold. It's useful to be able to configure properties to evaluate towards the threshold count - so the Property 1 - 4 matching was retained.
Equally, we already know that our thresholds are different over weekends than weekdays, and that there may be specific times of the month where we want further differentiation - so all the day of week/day of month features make the cut. It's not much of a stretch to consider that public holidays may also be important, so we retain the Abstract API Holidays implementation.
Event Threshold inherently uses repeating intervals, so the "Timeout interval" becomes a threshold measuring interval, and we drop the "Repeating timeouts" and "Repeat timeout suppression" features - they don't belong in Event Threshold.
The net effect is that I can define instances like;
- Watch @Message for any value between 4am and 4:30am and alert if there are less than or equal to 100 events in 10 minutes
- Watch @Message from any value between 4:30am and 5:00am and alert if there are greater than or equal to 100 events in 5 minutes
- Watch @Message for any value AND StructuredProperty1 for "Test" AND StructuredProperty2 for "Test2", between 5:00am and 6:00am and alert if there are less than 100 events in 30 minutes.
The specificity that is possible means that you can have multiple instances watching the same signal for different criteria, and that you can ensure that you only count the properties that you want.
The configuration is very similar to Event Timeout - you get the power to decide how you want your threshold instance to work.
The Result!
We wind up with an app that, like Event Timeout, is forward looking and using UTC to calculate the next start event. You can configure it for start and end times up to 24 hours, and use any threshold monitoring interval from 1 second to 24 hours. It benefits from all the work done on Event Timeout, and even led to some minor improvements to both apps for edge cases found during development.
You can see the results of the configuration shown above in the below screenshot. We were looking for events with @Message matching any value over 10 minutes (600 seconds), and alerting if it fell below that threshold.
You can see the alert being raised below, which could then be fired off to email, OpsGenie, Jira ... any Seq alerting app, in short.
Adding the ability to invert the threshold criteria is useful, if you want to measure exceeding a threshold rather than falling under the threshold. Simply put, it changes the calculation from "<= threshold
" to ">= threshold
". I don't show that here - but it's simple logic that makes Event Threshold as versatile as you need.
The net effect is that if you want to get people out of bed because you fell below or exceeded a given volume of events - you can, and it allows you to be as specific as you need to avoid false positives.
Get Event Threshold for Seq!
Event Threshold has certainly benefitted from the prior work on Event Timeout. It meant that I could get it up and running quickly and easily with a similar and familiar structure that provided the power and capability needed to allow a single app to fill multiple needs with different instances. We will make use of this for a number of scenarios - including, simply, the measurement of different thresholds on weekdays and weekends.
I hope that others will benefit from Event Threshold too. You can install it in your Seq instance using the Nuget tag Seq.App.EventThreshold, or use the fanciest links in the known universe below:
Comments