Low latency local event flags
Posted: Wed Nov 22, 2023 12:31 pm
I'm pleased to report that VMS local event flags are low-latency enough to use to wake up an event loop thread more quickly than other mechanisms, as I'd hoped, which I was able to show with a recent commit to my fork of the libuv (cross-platform async I/O) repo where I'm adding OpenVMS optimizations:
https://github.com/jhamby/vms-libuv
I haven't started to add the actual sys$qio / async AST callback completion code yet, but the same async event wakeup mechanism is used for worker thread synchronous event completions, which is how file I/O is handled on nearly all platforms. I also needed to merge poll() events for nonblocking socket I/O (until I add native VMS socket support to replace the generic sockets code) and file descriptors the user of the library may have passed in. I also needed to handle timed callbacks that the POSIX code path uses the timeout argument to poll() to implement (passing the earliest timer event for that thread).
Long story short: atomic variables can really limit scalability, and if you don't believe me, Fedor Pikus has several enlightening and entertaining C++ talks on the topic that show just how much of a slowdown they can be in SMP configurations, especially if they're on the same cache line. My strategy was to maximize the advantage from each technique that I had to use anyway, to amortize the penalty of each synchronization type.
I have a poll() thread managed by the event thread with one additional fd of a socketpair() that it receives commands from, as a single byte to tell it to pause, poll(), or exit. In pause mode, it waits on only its command fd. When the libuv event loop thread changes the fds or events that it cares about, it tells the poll thread to pause.
This way, there's no shared atomic variable between the poll thread and event loop thread managing the desired state. That's one atomic variable bottleneck avoided. For timing, the poll() thread always waits with no timeout (-1), since the event loop thread can wake it on demand through its command socket. When the event loop wants to sleep with a timeout, it creates a timer on a second local event flag (since sys$setimr clears the specified event flag which could otherwise reset pending events on the shared async event flag), then sleeps on the OR of the two flags. On wakeup, the timer is canceled in case it was still pending.
The pair of event flags has to be in the same cluster, so if they aren't, a third flag is allocated and swapped with the mismatched flag, which is then freed. At startup, event flags 1-23 are explicitly freed, since the default behavior of lib$get_ef() is to reserve them for compatibility with programs that used them without reserving them. Subtracting the 8 system reserved event flags, that enables up to 28 (or 27 if event flag 0 isn't available) uv_loop_t event loops per process, which is likely enough for most purposes considering the purpose of this architecture is multiplexing a large number of events onto a smaller number of event loop threads, possibly only 1 thread.
In order to use a single event flag for async worker thread callbacks, poll() thread callbacks, and AST completion callbacks, I made a single atomic variable with flag bits that the individual sources OR their type into, and then set the event loop event flag if the previous value was 0. If my logic is all correct, any mismatch in ordering of raising the event flag and atomically setting the flag bits should in the worst case cause the event loop to wake up too early and find no events, then go back to sleep until the next wakeup.
One interesting data point I discovered is that Linux and FreeBSD both have reliable and fast implementations of libuv, while the latest Solaris 11 release available without a support contract can't even complete the async1 benchmark I've been using to test inter-thread wakeup latency. Solaris has its own "completion port" API and apparently it doesn't work very well. FreeBSD and macOS use kqueues and Linux uses io_uring if available, falling back to epoll(). The failure of libuv's benchmark to complete on Solaris 11 doesn't make me eager to recommend it to anyone who could be using Linux or FreeBSD instead.
BTW, I haven't tried to build or test my work in progress on either Itanium or Alpha, due to the lack of a C11-compatible _Atomic keyword or <stdatomic.h> header file. On x86-64, using CXX, I was able to use <atomic> and its C11-compatible function prototypes that expand into compiler builtins by adding necessary casts to make the code C++-compatible.
https://github.com/jhamby/vms-libuv
I haven't started to add the actual sys$qio / async AST callback completion code yet, but the same async event wakeup mechanism is used for worker thread synchronous event completions, which is how file I/O is handled on nearly all platforms. I also needed to merge poll() events for nonblocking socket I/O (until I add native VMS socket support to replace the generic sockets code) and file descriptors the user of the library may have passed in. I also needed to handle timed callbacks that the POSIX code path uses the timeout argument to poll() to implement (passing the earliest timer event for that thread).
Long story short: atomic variables can really limit scalability, and if you don't believe me, Fedor Pikus has several enlightening and entertaining C++ talks on the topic that show just how much of a slowdown they can be in SMP configurations, especially if they're on the same cache line. My strategy was to maximize the advantage from each technique that I had to use anyway, to amortize the penalty of each synchronization type.
I have a poll() thread managed by the event thread with one additional fd of a socketpair() that it receives commands from, as a single byte to tell it to pause, poll(), or exit. In pause mode, it waits on only its command fd. When the libuv event loop thread changes the fds or events that it cares about, it tells the poll thread to pause.
This way, there's no shared atomic variable between the poll thread and event loop thread managing the desired state. That's one atomic variable bottleneck avoided. For timing, the poll() thread always waits with no timeout (-1), since the event loop thread can wake it on demand through its command socket. When the event loop wants to sleep with a timeout, it creates a timer on a second local event flag (since sys$setimr clears the specified event flag which could otherwise reset pending events on the shared async event flag), then sleeps on the OR of the two flags. On wakeup, the timer is canceled in case it was still pending.
The pair of event flags has to be in the same cluster, so if they aren't, a third flag is allocated and swapped with the mismatched flag, which is then freed. At startup, event flags 1-23 are explicitly freed, since the default behavior of lib$get_ef() is to reserve them for compatibility with programs that used them without reserving them. Subtracting the 8 system reserved event flags, that enables up to 28 (or 27 if event flag 0 isn't available) uv_loop_t event loops per process, which is likely enough for most purposes considering the purpose of this architecture is multiplexing a large number of events onto a smaller number of event loop threads, possibly only 1 thread.
In order to use a single event flag for async worker thread callbacks, poll() thread callbacks, and AST completion callbacks, I made a single atomic variable with flag bits that the individual sources OR their type into, and then set the event loop event flag if the previous value was 0. If my logic is all correct, any mismatch in ordering of raising the event flag and atomically setting the flag bits should in the worst case cause the event loop to wake up too early and find no events, then go back to sleep until the next wakeup.
One interesting data point I discovered is that Linux and FreeBSD both have reliable and fast implementations of libuv, while the latest Solaris 11 release available without a support contract can't even complete the async1 benchmark I've been using to test inter-thread wakeup latency. Solaris has its own "completion port" API and apparently it doesn't work very well. FreeBSD and macOS use kqueues and Linux uses io_uring if available, falling back to epoll(). The failure of libuv's benchmark to complete on Solaris 11 doesn't make me eager to recommend it to anyone who could be using Linux or FreeBSD instead.
BTW, I haven't tried to build or test my work in progress on either Itanium or Alpha, due to the lack of a C11-compatible _Atomic keyword or <stdatomic.h> header file. On x86-64, using CXX, I was able to use <atomic> and its C11-compatible function prototypes that expand into compiler builtins by adding necessary casts to make the code C++-compatible.