spinlock - Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive?

Friday, August 17, 2018

spinlock - Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive?

I wrote some lock-free code that works fine with local
reads, under most conditions.

Does local spinning on a memory read necessarily imply I
have to ALWAYS insert a memory barrier before the spinning
read?

(To validate this, I managed to produce a reader/writer
combination which results in a reader never seeing the
written value, under certain very specific

conditions--dedicated CPU, process attached to CPU,
optimizer turned all the way up, no other work done in the
loop--so the arrows do point in that direction, but I'm not
entirely sure about the cost of spinning through a memory
barrier.)

What is the cost of spinning through a memory barrier if
there is nothing to be flushed in the cache's store buffer?
i.e., all the process is doing (in C) is

while ( 1 ) {
    __sync_synchronize();
    v = value;
    if ( v != 0 ) {
        ... something ...
    }
}

Am I correct to assume that it's free and it won't encumber

the memory bus with any traffic?

Another way to put this is to ask: does a memory barrier do
anything more than: flush the store buffer, apply the
invalidations to it, and prevent the compiler from
reordering reads/writes across its location?

Disassembling, __sync_synchronize() appears to translate into:

lock orl

From the Intel manual (similarly nebulous for the neophyte):

Volume 3A: System Programming Guide, Part 1 --   8.1.2

Bus Locking


Intel 64 and IA-32 processors provide a LOCK# signal that
is asserted automatically during certain critical memory
operations to lock the system bus or equivalent link.
While this output signal is asserted, requests from other
processors or bus agents for control of the bus are
blocked.

[...]

For the P6 and more recent processor families, if the

memory area being accessed is cached internally in the
processor, the LOCK# signal is generally not asserted;
instead, locking is only applied to the processor’s caches
(see Section 8.1.4, “Effects of a LOCK Operation on
Internal Processor Caches”).

My translation: "when you say LOCK, this would be expensive, but we're
only doing it where necessary."

@BlankXavier:

I did test that if the writer does not explicitly push out the write from the store buffer and it is the only process running on that CPU, the reader may never see the effect of the writer (I can reproduce it with a test program, but as I mentioned above, it happens only with a specific test, with specific compilation options and dedicated core assignments--my algorithm works fine, it's only when I got curious about how this works and wrote the explicit test that I realized it could potentially have a problem down the road).

I think by default simple writes are WB writes (Write Back), which means they don't get flushed out immediately, but reads will take their most recent value (I think they call that "store forwarding"). So I use a CAS instruction for the writer. I discovered in the Intel manual all these different types of write implementations (UC, WC, WT, WB, WP), Intel vol 3A chap 11-10, still learning about them.

My uncertainty is on the reader's side: I understand from McKenney's paper that there is also an invalidation queue, a queue of incoming invalidations from the bus into the cache. I'm not sure how this part works. In particular, you seem to imply that looping through a normal read (i.e., non-LOCK'ed, without a barrier, and using volatile only to insure the optimizer leaves the read once compiled) will check into the "invalidation queue" every time (if such a thing exists). If a simple read is not good enough (i.e. could read an old cache line which still appears valid pending a queued invalidation (that sounds a bit incoherent to me too, but how do invalidation queues work then?)), then an atomic read would be necessary and my question is: in this case, will this have any impact on the bus? (I think probably not.)

I'm still reading my way through the Intel manual and while I see a great discussion of store forwarding, I haven't found a good discussion of invalidation queues. I've decided to convert my C code into ASM and experiment, I think this is the best way to really get a feel for how this works.

Blog

Friday, August 17, 2018