Low latency local event flags

jhamby · Post by **jhamby** » Wed Nov 22, 2023 12:31 pm

I'm pleased to report that VMS local event flags are low-latency enough to use to wake up an event loop thread more quickly than other mechanisms, as I'd hoped, which I was able to show with a recent commit to my fork of the libuv (cross-platform async I/O) repo where I'm adding OpenVMS optimizations:

https://github.com/jhamby/vms-libuv

I haven't started to add the actual sys$qio / async AST callback completion code yet, but the same async event wakeup mechanism is used for worker thread synchronous event completions, which is how file I/O is handled on nearly all platforms. I also needed to merge poll() events for nonblocking socket I/O (until I add native VMS socket support to replace the generic sockets code) and file descriptors the user of the library may have passed in. I also needed to handle timed callbacks that the POSIX code path uses the timeout argument to poll() to implement (passing the earliest timer event for that thread).

Long story short: atomic variables can really limit scalability, and if you don't believe me, Fedor Pikus has several enlightening and entertaining C++ talks on the topic that show just how much of a slowdown they can be in SMP configurations, especially if they're on the same cache line. My strategy was to maximize the advantage from each technique that I had to use anyway, to amortize the penalty of each synchronization type.

I have a poll() thread managed by the event thread with one additional fd of a socketpair() that it receives commands from, as a single byte to tell it to pause, poll(), or exit. In pause mode, it waits on only its command fd. When the libuv event loop thread changes the fds or events that it cares about, it tells the poll thread to pause.

This way, there's no shared atomic variable between the poll thread and event loop thread managing the desired state. That's one atomic variable bottleneck avoided. For timing, the poll() thread always waits with no timeout (-1), since the event loop thread can wake it on demand through its command socket. When the event loop wants to sleep with a timeout, it creates a timer on a second local event flag (since sys$setimr clears the specified event flag which could otherwise reset pending events on the shared async event flag), then sleeps on the OR of the two flags. On wakeup, the timer is canceled in case it was still pending.

The pair of event flags has to be in the same cluster, so if they aren't, a third flag is allocated and swapped with the mismatched flag, which is then freed. At startup, event flags 1-23 are explicitly freed, since the default behavior of lib$get_ef() is to reserve them for compatibility with programs that used them without reserving them. Subtracting the 8 system reserved event flags, that enables up to 28 (or 27 if event flag 0 isn't available) uv_loop_t event loops per process, which is likely enough for most purposes considering the purpose of this architecture is multiplexing a large number of events onto a smaller number of event loop threads, possibly only 1 thread.

In order to use a single event flag for async worker thread callbacks, poll() thread callbacks, and AST completion callbacks, I made a single atomic variable with flag bits that the individual sources OR their type into, and then set the event loop event flag if the previous value was 0. If my logic is all correct, any mismatch in ordering of raising the event flag and atomically setting the flag bits should in the worst case cause the event loop to wake up too early and find no events, then go back to sleep until the next wakeup.

One interesting data point I discovered is that Linux and FreeBSD both have reliable and fast implementations of libuv, while the latest Solaris 11 release available without a support contract can't even complete the async1 benchmark I've been using to test inter-thread wakeup latency. Solaris has its own "completion port" API and apparently it doesn't work very well. FreeBSD and macOS use kqueues and Linux uses io_uring if available, falling back to epoll(). The failure of libuv's benchmark to complete on Solaris 11 doesn't make me eager to recommend it to anyone who could be using Linux or FreeBSD instead.

BTW, I haven't tried to build or test my work in progress on either Itanium or Alpha, due to the lack of a C11-compatible _Atomic keyword or <stdatomic.h> header file. On x86-64, using CXX, I was able to use <atomic> and its C11-compatible function prototypes that expand into compiler builtins by adding necessary casts to make the code C++-compatible.

jhamby · Post by **jhamby** » Fri Nov 24, 2023 2:13 pm

Two follow-ups: Solaris 11.4.42.111.0 has a perfectly reliable completion port API, but it’s very slow. About as slow as I’m seeing with my local event flag version on VMS, given the same 6 CPU 16 GB RAM VM.

The libuv port is using the Solaris completion port API as a fancy version of poll() with more steps. The API makes you reload the fds that you care about after they trigger an event, It seems inefficient as far as data structures, So the benchmark didn’t fail but rather timed out after 60 seconds.

That brings me to the other follow-up: there’s a bug in the VMS implementation of poll() that’s causing it to wait 10x the duration specified. So the benchmark should also have timed out for me on VMS but it was apparently waiting for 600 seconds, calling poll() with a timeout of 6000 waits for 60 seconds, not 6.

pustovetov · Post by **pustovetov** » Sat Nov 25, 2023 5:57 am

jhamby wrote: ↑
Fri Nov 24, 2023 2:13 pm
That brings me to the other follow-up: there’s a bug in the VMS implementation of poll() that’s causing it to wait 10x the duration specified. So the benchmark should also have timed out for me on VMS but it was apparently waiting for 600 seconds, calling poll() with a timeout of 6000 waits for 60 seconds, not 6.

Nope. The poll() routine has no timeout issues

Code: Select all

#include <poll.h>
#include <stdio.h>

#define __NEW_STARLET
#include <starlet.h>
#include <gen64def.h>

int main()
{
    GENERIC_64 begin_time;
    sys$gettim(&begin_time, 1);

    poll(NULL, 0, 6000);

    GENERIC_64 end_time;
    sys$gettim(&end_time, 1);

    printf("%i ms\n", (int)((end_time.gen64$q_quadword - begin_time.gen64$q_quadword) / 10000));
}
....
$ cc test
$ link test
$ r test
6029 ms

The poll() function is just slow if you use it for anything other than sockets. (OpenVMS is not Linux. For good performance, we have asynchronous sys$qio. But this is a POSIX-incompatible way.)
On Linux, one system call is spent to poll, for example, five pipes. In the case of VMS, we will spend at best about 10 system calls (5 sys$qio + 4 sys$cancel + sys$schdwk + sys$hiber). And we're spending a lot more now... I'm working to improve the situation. My latest patch speeds up the polling of pipes on IA64 by about 20 times. But for x86 it did not show significant improvement. ;(
p.s. Our TCP/IP stack has a special sys$qiow(EFN$C_ENF, channel, IO$_SENSEMODE | IO$M_MORE...) for select(). So, we spend one system call to poll/select sockets, just like in Linux.

VSI OpenVMS Forum

Low latency local event flags

Low latency local event flags

Re: Low latency local event flags

Re: Low latency local event flags