Instance Balancer and spreading equally across multiple sockets?

Started by MPB, July 25, 2024, 10:24:35 AM

Previous topic - Next topic

MPB

I'm interested in Process Lasso for a project I'm working on, specifically the "Instance Balancer" feature. I installed the trial version and was just testing it with "notepad.exe" and it works basically enough that I see the potential.

In essence, we have an executable that is not multi-threading, so it works best on a single core, and it will use 100% CPU while its running. Currently we're launching the multiple instances using "start /affinity" options from the command line which is kind of cumbersome and requires carefully tracking what cores are in use so additional launches will use a different CPU.

While "Instance Balancer" alone seems like a great way to spread the load across multiple cores exactly the way we want, it's worth mentioning that this is on a dual socket server, and we also want to try to spread the process loads across both sockets. Memory access is another consideration here, so having all of the processes huddled on just one socket and hammering the memory of that NUMA node isn't ideal.

So, the question is, is there any feature or setting I'm missing that will tell Instance Balancer to spread the utilization equally over all of the NUMA nodes? When testing, it seemed like it was merely assigning affinity in numerical order, so my "notepad" instances would start with CPU 0 and then 2, 4, 6, etc (I told it not to do hyper-threads).  On my dual 14-core Xeon I was hoping for an option that would start it with 0, then 28, then 2, then 30, etc.

It's also efficient from a thermal perspective because if we loaded up all of the cores on socket #1 first, that CPU could get thermally throttled while socket #2 is twiddling it's thumbs waiting for something to do.

Any thoughts on how to accomplish this? The alternative we're looking at is redesigning our executable itself to either pass in a parameter telling it what core to run on, or give it the smarts to pick a suitable idle core on its own, but considering Process Lasso is pretty reasonably priced I figured it might be a good solution.

Jeremy Collake

That sounds like a great use case for the Instance Balancer!

Unfortunately, as you discovered, it has no awareness of processor sockets or NUMA nodes.

We can, however, add such an option. I believe it would be generally useful. We'll consider this for addition early next week and I'll reply again here with any news.


Software Engineer. Bitsum LLC.

MPB

Thanks, that would be a nice addition.

The algorithm to handle the spreading (at least as far as I'm concerned) could be as simple as assigning cores in a "first available, then last available" and so on. It seems like right now it assigns affinity in incremental order like I mentioned (0, 2, 4, 6, etc) but let's say we had an 8 core CPU just for example, it could assign the cores like 0, 6, 2, 4 (skipping the HT/SMT cores).  Or on my 56-core system, 0, 54, 2, 52, 4, 50, etc.

On a 2-socket system that would spread the load. It wouldn't spread it equally on 4 or 8 socket systems but at least it's a simple solution for the most likely use cases.

One other option for core assignments is simply to randomize it. It won't be deterministic, but it may be the best and easiest way to really spread the load no matter the underlying architecture. Throw the available cores in an array and pick one at random for the next assignment. :)

The OS can be queried to get mappings of cores to numa nodes, so the more complicated solution would be to do that and figure out "okay, I've got 2 sockets of 28 cores each" or maybe "I have 4 sockets of 20 cores each" and then write the routine that'll space the core assignments out over however many sockets.

Whatever the case, it'll be nice to see. I have a feeling that even single-socket systems would have a little benefit to spreading the loads out on the physical package. I don't know if there's a common method of assigning the logical cores to their physical placement on the die, but if you've seen those x-ray images of a CPU die, you'll see the tiles that make up the multiple cores, and if you were able to spread out the high usage threads onto different physical sections, I can't help but think that would help with thermal dissipation. But then again, for all I know, there's no consistent mapping of logical cores (as exposed to the OS) to the physical tiles, or if there is, what is that relationship (are cores 1 and 3 next to each other? or cores 1 and 2?)  :)

MPB

On a related note, I don't know if this is still the case, but it used to be that core #0 was the only one that could handle interrupts, so if possible, core 0 should be excluded from sticky assignments. I'm not sure if that's still the case now but I still try to adhere to it by skipping that first core whenever possible. The old servers I worked on (Proliant) would usually have some documentation buried deep that mentioned that (and on multi-socket systems, it would spell out that the first socket would be the one handling interrupts, etc).

I could be wrong, and maybe newer OS and systems will spread INT handling and I'm just an old fogey for still paying attention to my first core. :)

Jeremy Collake

Quote from: MPB on July 26, 2024, 10:12:23 AMOne other option for core assignments is simply to randomize it.

That's likely to be the initial solution for the reasons you mentioned: It achieves broad compatibility without the complexity and will work well enough most of the time.

To clarify, Process Lasso as a product is aware of processor topologies: mappings to logical processors, cache associativity, NUMA nodes, groups, SMT/HT pairs, and all that. The information just isn't currently used by the Instance Balancer, but it definitely could be.

As for avoiding core 0, we'll consider an option for that too while we're working on this feature. It is still a thing.
Software Engineer. Bitsum LLC.

Jeremy Collake

Process Lasso v14.3.0.21 beta adds the random core selection option to the Instance Balancer rules. You'll see the checkbox when you click the "Show Advanced" button.

The "reserved cores" setting will allow you to exclude CPU 0, as it will come off the beginning.

You can get on the beta channel by clicking "Updates / Include Betas", then "Check Now". To get off the beta channel, reinstall the latest release version from https://bitsum.com and then uncheck that option.

If you try it, let me know how it goes!
Software Engineer. Bitsum LLC.

MPB

That sounds great, I'll give it a try in the next few days when I have a chance. :)

Forgand

Quote from: Jeremy Collake on July 29, 2024, 03:02:37 PMProcess Lasso v14.3.0.21 beta adds the random core selection option to the Instance Balancer rules. You'll see the checkbox when you click the "Show Advanced" button.

The "reserved cores" setting will allow you to exclude CPU 0, as it will come off the beginning.

You can get on the beta channel by clicking "Updates / Include Betas", then "Check Now". To get off the beta channel, reinstall the latest release version from https://bitsum.com geometry dash meldown and then uncheck that option.

If you try it, let me know how it goes!

I would love to try this new feature.

MPB

My initial tests of it went okay. I was using Notepad to test and see how the affinity was being randomized. I think I saw some issues where additional instances at one point no longer had an affinity, but when I tried to recreate that with another app (just cmd.exe) I wasn't seeing it, so I'm not totally sure.

Unfortunately my trial ran out once I got back from vacation so I may have to try this on a different system to retest how that's working.

I probably also need to play more with getting the initial program to have that setting as well as any child processes. I know I can do 2 rules to cover my bases there (like one for Notepad and another for the child processes of it, since Win11 Notepad actually kicks off a child process, sometimes). In that case the child process is also "notepad.exe" so it wouldn't matter, but in the case of "cmd.exe" it kicks off a child process of conhost, so in my test scenario I'd want that to have the affinity setting as well.

I didn't know if there was a simple "one entry" rule that would not only set the instance balancer for the process *and* child processes, or if it's just better to have "cmd.exe" and then also "childof:cmd.exe" ... I thought the regex might be the way to do that but I didn't get far enough reading about the different fields to know if a simple "cmd.exe" would actually match both cases anyway.