EventLog().EntryWritten is bad and should feel bad - the v1.1.2 update for Seq Client for Windows Logins

MattMofDoom June 24th, 2021 20:58

Why aren't new logins showing up???
Cache drains and cache stalls
Re-examining assumptions
EntryWritten is bad and should feel bad
Switching to EventLogWatcher().EventRecordWritten
Results
v1.1.2 now available!

Why aren't new logins showing up???

I mentioned in my previous post that the Windows event log can iterate through all log entries and send them to the EventLog().EntryWritten handler repeatedly. I was handling that with an ad-hoc implementation of a 'cache' that aimed to ensure we only examined recent events and de-duplicated.

This behaviour was showing up on a Windows 2016 server with a fairly busy event log, but I was also wrong in my assumption that the log was large - a subsequent check indicated it was "only" ~40MB (subsequently expanded to ~80MB). The log was overwriting events as needed, and it appears that this is the main cause of the iteration.

Over time, though, I observed that Seq.Client.WindowsLogins 'stopped' detecting new logins. Initially I assumed this was either a problem with the cache (possibly unsafe thread handling or a deadlock), or with the period of new events - essentially, that the event handler was gradually taking longer and longer to send new events, so examining the past 10 minutes was inadequate for monitoring.

Cache drains and cache stalls

Once I added a counter to observe the number of items in the cache, I could see what appeared to be a "drain" of the cache. Essentially - the cache would travel along and happily build a list of "seen" events ... and at some point, suddenly drop to 0 and never recover. This correlated to the behaviour of not detecting new logins, so I assumed that correlation is causation.

I went through quite a few iterations of possible solutions - locks, switching to a System.Runtime.Caching.MemoryCache implementation, sliding cache and log window, async handling, and eliminating the log window altogether. I also tried adding a watchdog to detect if events hadn't been received and to therefore 'restart' the EventLog().EntryWritten handler.

The biggest "success" I had was when I switched to using the Microsoft.Extensions.Caching.Memory.MemoryCache implementation. Here, I encountered the exact opposite behaviour - I observed the cache "stalling" (hitting an upper count and then staying at that figure indefinitely). Again - no new logins being captured.

Re-examining assumptions

At this stage I felt I'd hit a wall, so I aimed to reset the assumptions I'd built as I worked on the problem, and start from first principles, by taking a step back and examining what I knew.

EventLog().EntryWritten reprocesses/resends old events on a server with a busy security log that overwrites events as needed
We attempt to solve this by only allowing recent events and caching events that we have seen
The issue does not occur on a different server at a different OS version that has lower activity
On the server where the reprocessing behaviour is seen, after several hours, we can observe the cache either
1. growing and then suddenly draining to 0 and never growing, or
2. growing and then stalling - never growing further or shrinking
Locking or deadlocking does not appear to be a cause, if you assume that MemoryCache is generally threadsafe.
After this occurs, new logins are not detected and sent to Seq
Heartbeat events continue to be sent to Seq
Restarting the service temporarily allows new logins to be detected and sent to Seq

I may have missed some aspects of my analysis from the list, but it's a reasonable summary.

EntryWritten is bad and should feel bad

So starting from the first item - why does EventLog().EntryWritten do that? Is it a known behaviour? Have others encountered it?

The short answer is yes - but there wasn't a lot to go off. I certainly found instances of people observing that behaviour, but often with no solution, or the solution being simply references to the Microsoft documentation that didn't appear to help. The most concrete thing that I could find was that it appears to relate to old event record ids being removed from the list and therefore causing Windows to re-send all events to the handler.

More searching could perhaps get me to a definitive statement on the behaviour - but having more context, I contemplated that perhaps I shouldn't be accepting this behaviour. Hence, some options;

Switch to a periodic "poll" of the event log using EventLog().Entries and looking at events that matched event id 4624 with recent TimeGenerated properties. If I did this every 10 seconds, or perhaps every minute, then I could iterate through a list and raise matching properties.
Switch to an EventLogReader() with a similar polling approach with a query that selects only event id 4624 with the Success Audit keyword.
Switch to an EventLogWatcher() with a query that selects only event id 4624 with the Success Audit keyword, and use the EventLogWatcher().EventRecordWritten handler

Each of these approaches would likely have an effect on the problem, but the EventLogWatcher() option afforded the most "similar" solution to the existing implementation, and meant it was most likely to send new logins to Seq in the shortest time possible.

Switching to EventLogWatcher().EventRecordWritten

I'll second the opinion of a number of posts I read - there's not a lot of meat to the Microsoft documentation for EventLogWatcher(). Nonetheless, I managed to work out an implementation and pivot the code over - it's not really difficult.

                _eventLog = new EventLogQuery("Security", PathType.LogName,
                    "*[System[band(Keywords,9007199254740992) and (EventID=4624)]]");
                _watcher = new EventLogWatcher(_eventLog);
                _watcher.EventRecordWritten += OnEntryWritten;
                _watcher.Enabled = true;

Not too far different from the EventLog().EntryWritten implementation, although it does give us the advantage that an EventRecord is passed, rather than an EventLogEntry that we then also need to implement an EventLogReader() query to extract our properties from.

With that advantage in mind, the handler becomes relatively "simple". The code really only needs to extract the properties and log them. I opted to make the handler async, and send the HandleEventLogEntry() worker off to a separate thread to avoid blocking new events; the method itself has nothing async to call upon, so I've just used a Task.Run() delegate. There's nothing we need to wait for in that worker; firing it off to a separate thread will result in only a few outcomes that will either be logged or update a counter. I don't think it's particularly expensive in CPU time, but I want to finish the handler's tasks as quickly as possible.

        private static async void OnEntryWritten(object sender, EventRecordWrittenEventArgs args)
        {
            try
            {
                //Ensure that events are new and have not been seen already. This addresses a scenario where event logs can repeatedly pass events to the handler.
                if (args.EventRecord != null && args.EventRecord.TimeCreated >= ServiceStart &&
                    !EventList.Contains(args.EventRecord.RecordId))
                    await Task.Run(() => HandleEventLogEntry(args.EventRecord));
                else if (args.EventRecord != null && args.EventRecord.TimeCreated < ServiceStart)
                    _oldEvents++;
                else if (args.EventRecord == null)
                    _emptyEvents++;
            }
            catch (Exception ex)
            {
                Log.Exception(ex).Add("Failed to handle an event log entry: {Message:l}", ex.Message);
            }
        }

Results

The result is ... so far so good!

I added new counters to the heartbeats in this release, which give me the ability to construct a Seq dashboard. These are:

ItemCount - the number of items in the cache at the time of a heartbeat
LogonsDetected - the total count of interactive logons detected while the service has been running
NonInteractiveLogons - logins that were determine to be "non-interactive" (eg. not a human being)
UnhandledEvents - events for which properties were not extracted and logged
OldEvents - events with a time that is older than the last service start time
EmptyEvents - events sent to the handler that are null (no event)

Based on testing, the last 3 should always be 0. We don't seem to hit the event reprocessing problem with this implementation, so we don't have any "old events". The EventLogWatcher() appears to reliably send valid event records, so we don't have "empty events". And we always extract the properties from events sent to our worker, so there are no "unhandled events".

The above dashboard screenshot shows measurements of two servers over several hours, since v1.1.2 was started on each server. In the top two graphs and the table;

we can see the cache being populated and trimming as entries expire
we can see that logins are being detected correctly when they occur (so far)
we can see that our counters reflect what we expect - the most consistently growing number is the non-interactive logins as scheduled tasks, processes, and network access occurs. I don't bother graphing that since it only increases.

The bottom two graphs reflect the Event Timeout for Seq instances that are monitoring our heartbeats. I can use the "Successfully matched" log entries to show heartbeats over time, and I can watch for Error logs from each instance to reflect when a timeout has occurred - the service is stopped or server is down.

The fact that we are never seeing old events means that the cache implementation appears to not strictly be necessary anymore - however I do like it for the purpose of health and statistics. It gives us a picture of overall volume in terms of activity, and it's only a cache of event record ids that is relatively memory efficient, so I'm leaving it in for now at least.

v1.1.2 now available!

I'm well past the period in which these issues have been showing up at this stage, and while I'm still monitoring, this appears to be behaving exactly as expected - I'm happy with it.

You can download Seq.Client.WindowsLogin v1.1.2 from the below fancy link (which is now fixed to point to the right repo)!

Seq Client for Windows Logins Latest Version

Seq Client for Windows Logins Total Downloads

EventLog().EntryWritten is bad and should feel bad - the v1.1.2 update for Seq Client for Windows Logins

Table of Contents

Why aren't new logins showing up???

Cache drains and cache stalls

Re-examining assumptions

EntryWritten is bad and should feel bad

Switching to EventLogWatcher().EventRecordWritten

Results

v1.1.2 now available!

Comments

You may also like:

Seq Client for Windows Logins v1.1.3 - Improved support for OpsGenie and Jira integrations!

NOTE: Updates to Event Timeout and Event Threshold may cause instances to stop

Updates to Event Timeout, Event Threshold, and Event Schedule for Seq ... and introducing Lurgle.Dates!