Instance Balancer and spreading equally across multiple sockets?

Started by MPB, July 25, 2024, 10:24:35 AM

Previous topic - Next topic

MPB

I'm interested in Process Lasso for a project I'm working on, specifically the "Instance Balancer" feature. I installed the trial version and was just testing it with "notepad.exe" and it works basically enough that I see the potential.

In essence, we have an executable that is not multi-threading, so it works best on a single core, and it will use 100% CPU while its running. Currently we're launching the multiple instances using "start /affinity" options from the command line which is kind of cumbersome and requires carefully tracking what cores are in use so additional launches will use a different CPU.

While "Instance Balancer" alone seems like a great way to spread the load across multiple cores exactly the way we want, it's worth mentioning that this is on a dual socket server, and we also want to try to spread the process loads across both sockets. Memory access is another consideration here, so having all of the processes huddled on just one socket and hammering the memory of that NUMA node isn't ideal.

So, the question is, is there any feature or setting I'm missing that will tell Instance Balancer to spread the utilization equally over all of the NUMA nodes? When testing, it seemed like it was merely assigning affinity in numerical order, so my "notepad" instances would start with CPU 0 and then 2, 4, 6, etc (I told it not to do hyper-threads).  On my dual 14-core Xeon I was hoping for an option that would start it with 0, then 28, then 2, then 30, etc.

It's also efficient from a thermal perspective because if we loaded up all of the cores on socket #1 first, that CPU could get thermally throttled while socket #2 is twiddling it's thumbs waiting for something to do.

Any thoughts on how to accomplish this? The alternative we're looking at is redesigning our executable itself to either pass in a parameter telling it what core to run on, or give it the smarts to pick a suitable idle core on its own, but considering Process Lasso is pretty reasonably priced I figured it might be a good solution.

Jeremy Collake

That sounds like a great use case for the Instance Balancer!

Unfortunately, as you discovered, it has no awareness of processor sockets or NUMA nodes.

We can, however, add such an option. I believe it would be generally useful. We'll consider this for addition early next week and I'll reply again here with any news.


Software Engineer. Bitsum LLC.

MPB

Thanks, that would be a nice addition.

The algorithm to handle the spreading (at least as far as I'm concerned) could be as simple as assigning cores in a "first available, then last available" and so on. It seems like right now it assigns affinity in incremental order like I mentioned (0, 2, 4, 6, etc) but let's say we had an 8 core CPU just for example, it could assign the cores like 0, 6, 2, 4 (skipping the HT/SMT cores).  Or on my 56-core system, 0, 54, 2, 52, 4, 50, etc.

On a 2-socket system that would spread the load. It wouldn't spread it equally on 4 or 8 socket systems but at least it's a simple solution for the most likely use cases.

One other option for core assignments is simply to randomize it. It won't be deterministic, but it may be the best and easiest way to really spread the load no matter the underlying architecture. Throw the available cores in an array and pick one at random for the next assignment. :)

The OS can be queried to get mappings of cores to numa nodes, so the more complicated solution would be to do that and figure out "okay, I've got 2 sockets of 28 cores each" or maybe "I have 4 sockets of 20 cores each" and then write the routine that'll space the core assignments out over however many sockets.

Whatever the case, it'll be nice to see. I have a feeling that even single-socket systems would have a little benefit to spreading the loads out on the physical package. I don't know if there's a common method of assigning the logical cores to their physical placement on the die, but if you've seen those x-ray images of a CPU die, you'll see the tiles that make up the multiple cores, and if you were able to spread out the high usage threads onto different physical sections, I can't help but think that would help with thermal dissipation. But then again, for all I know, there's no consistent mapping of logical cores (as exposed to the OS) to the physical tiles, or if there is, what is that relationship (are cores 1 and 3 next to each other? or cores 1 and 2?)  :)

MPB

On a related note, I don't know if this is still the case, but it used to be that core #0 was the only one that could handle interrupts, so if possible, core 0 should be excluded from sticky assignments. I'm not sure if that's still the case now but I still try to adhere to it by skipping that first core whenever possible. The old servers I worked on (Proliant) would usually have some documentation buried deep that mentioned that (and on multi-socket systems, it would spell out that the first socket would be the one handling interrupts, etc).

I could be wrong, and maybe newer OS and systems will spread INT handling and I'm just an old fogey for still paying attention to my first core. :)

Jeremy Collake

Quote from: MPB on Yesterday at 10:12:23 AMOne other option for core assignments is simply to randomize it.

That's likely to be the initial solution for the reasons you mentioned: It achieves broad compatibility without the complexity and will work well enough most of the time.

To clarify, Process Lasso as a product is aware of processor topologies: mappings to logical processors, cache associativity, NUMA nodes, groups, SMT/HT pairs, and all that. The information just isn't currently used by the Instance Balancer, but it definitely could be.

As for avoiding core 0, we'll consider an option for that too while we're working on this feature. It is still a thing.
Software Engineer. Bitsum LLC.