What’s all this “Wandering Threads” stuff…

[A different perspective to http://parallel.cc]

So the genesis of WT was the realization (after working on it for a year or more) that the maximum speed up on SMP for VCS was going to be ~4x. Two decades on Synopsys don’t seem to be doing much better than that (at 5x)-

https://news.synopsys.com/2016-03-24-Synopsys-Unveils-Breakthrough-Parallel-Simulation-Performance-Technology-for-VCS

(the gate level engine was just bad when I worked on it in the 1980s, looks like someone fixed it).

So why does an embarrassingly parallel problem do so badly on SMP? – two main reasons: poor cache performance (one stalled CPU can stall everybody), and the Von Neumann Bottleneck (made worse by cache-coherency with a globally shared data bus). Along with the secondary reason that logic simulation twiddles individual bits, but the bus transfers are usually cache-line sized at a minimum, and that turns into a latency issue.

So what’s the fix? Starting with the obvious, it would be good to never miss cache, so lets go for manycore so there is a lot more cache, and throw out the coherency requirement. Given L2 caches are not that big, that’s still going to mean congestion on the shared data bus, so you probably want to ditch that too, and just use local communication between the cores.

So now you have a machine which is a 2D array of cores with local memory and only nearest neighbor communication – which nicely matches what you see on a Silicon IC if you consider it as a 2D compute surface. However it does have the niggling problem that you are still sharing data between cores (which is a coherency issue), so the final step is to drop that in favor of message passing.

OK, so problem solved, but how are you going to build that? You can check the dead parallel processors list for how it has gone in the past. Most of those failed because the ratio of on-chip memory to off-chip is bad, unless you die-stack it, and die-stacking doesn’t go well with hot CPUs. Second problem is how do you write code to run on it? – it’s NUMA, not SMP, and nobody has a good flow for that, how many of your friends are GPU programmers?

So let’s step back for a moment and consider that the C code for VCS-MT (or Cheetah) is actually embarrassingly parallel in itself, only the SMP machine is bad. So if you look at it the right way, maybe some small change to how it runs is all that is needed.

Message-passing/Call-Return Equivalence

The best defined thing in code is how you make routine calls, where all the arguments go (in registers) and the returns. That allows you to view the call and return operations as message-passing, and in machines that don’t use a stack directly to do it (most modern CPUs) it’s completely invisible to other threads. If you view it that way then it becomes fairly obvious that you can replumb how your calls get made such that where the message starts and finishes don’t have to be in the same place, and you can manipulate it en route; that’s a mechanism we can use to turn regular code into message-passing code.

So now we have the mechanism to create threads that move from core-to-core as necessary by intercepting calls and relocating where execution continues – YouTube – and it works without changing the code, only the linker and runtime need modified.

More details can be found in the patent.

What else does that fix?

So while the original goal was accelerating circuit simulation, that code pattern matches some other kinds of code, neural-networks being top of the list, and certain kinds of database search. For database search, threads may wander off and never come back – returning is not essential to the paradigm.

It also works as a methodology for programming NUMA machines, which is why I sometimes call it “virtualizing SMP” – moving threads instead of data means I don’t need the shared data infrastructure, but I can use the same programming paradigm.

Although the initial concept was to move work to data – simulations being models of static hardware – you can move threads on multiple criteria. GPUs use banks of cores performing specific tasks, and you can emulate that with WT by binding code to particular cores such that the threads will go to a given core in order execute particular code. That also applies to special purpose cores for things like indexing and security.

Distance is no object. WT is not restricted to multi-core, it can be used to turn a distributed environment into something that looks like a simple SMP-like programming environment.

1 thought on “What’s all this “Wandering Threads” stuff…”

marcel hendrix says:

December 21, 2022 at 9:53 am

Wandering threads or walking cores might work very nicely for “self-organizing nets.” It is a lot easier to think about a core taking a new spot in a 2D net than to intercept and re-route data between places where you are and places where you want to be, especially if data has to follow you around. That would fit very well with a CPU that has all its data on chip ( http://www.scholarpedia.org/article/Kohonen_network ).

Message-passing/Call-Return Equivalence

What else does that fix?

1 thought on “What’s all this “Wandering Threads” stuff…”

Leave a Reply Cancel reply