So Sup Forums

So Sup Forums

Why couldn't parallel single threading be a thing? Like 2 cores->1 Thread for better single threaded performance?

Didn't intel buy a company that was working on something like that?

Because you can't put butter on toast before buying toast.

Wasn't bulldozer trying to do that, originally?
I remember seeing some cpu-z's with 8C/4T.

No bulldozer had modules with 2 cores in each module

It's called RISC-V and it's gonna be years before you see any decent designs.

on some motherboards for fx processors you can disable the modules to work as 1 core instead of two, making 4 cores and slightly increasing single thread performance .

>RISC-V
I don't think that's the company I'm thinking of

It wouldn't actually

You'd get the same effect running a program 1 thread per 1 core, like if it were 2c/4t vs 4c/4t, it just slightly frees up the shared FPU.

Because you have to guarantee data safety in a thread, you fucking mongoloid. Any time you have to wait for cores to communicate to the uncore and back and hoping for a cache hit, you are wasting CPU cycles. In the strictest sense, a "core" already runs multiple instructions in a single clock and it's all extracted normally via OOE which works together with pipelining to minimize stalling but it relies on extremely fast DRAM and not the much slower cache.

No, bulldozer was basically 2 integer processing units and one shared floating point unit. Basically 1 integer unit and ~0.5 FPU per core. It was shit.

RISC-V is an ISA. Not really related to this.

OP, there actually was a company that produced a working experimental chip that basically did this, sharing resources between "cores" to boost single-threaded performance. IIRC the cores they used even emulated x86, but they could only get it up to like 300MHz when Sandy Bridge or Ivy Bridge was already mainstream and hitting like 4.5GHz easy with a hyper 212+. I remembered it showed up briefly in the news and virtually disappeared afterwards with no updates I heard of to date. If such a thing were even feasible for a real world scenario then everyone would just make ultra-wide cores in the first place and not bother with high core counts, but as you can see if making a wide core isn't feasible then making smaller ones that can dynamically combine their resources on one thread would be even less feasible.

Okay user, lets make this simple to understand.

There's a small tunne.
Only one digger can dig
Only one digger can fit in the tunnel
The digger doesn't have to worry about the shovel, dirt or fatigue because they don't exist, so he's 24/7 active
Sending in another digger doesn't make the digging faster.

It could add latency for error checking yes, maybe it wouldn't be good for gaming but it could be good for certain applications

So the idea would be to have this done on the hardware level and not require say windows to do anything

Like if you used an R7 chip at 8c/4t windows would only see the 4 threads and assign tasks to it, the CPU would do all the work of splitting up the load and increasing how fast it could do it.

This is pointless since it's added latency, most serial code is executed quickly in (games) but are frequently requested so splitting a serial task into 2 cores is pointless, and impossible.

If the serial task stalling the thread for a longer duration there might be incentive for this.

Like what? If you can guarantee thread safety then you can just run it in two threads. If you couldn't the latency hit between this technique and some sort of signaling like a semaphore would be almost negligible if not literally worse.
OS Scheduling IS NOT running the same thread over two cores. Do not conflate scheduling with this concept. Scheduling takes, say, three separate threads simultaneously running on two cores provides a set of techniques to deliver optimal latency (via preemption and data structure techniques/heuristics) and throughput (via load balancing).

You have two calculations to do:

1: x = 7 + 13
2: x + 5 = ?

How do you do 2 before you know x? This is why your idea is bunk.

So how does one process do it? One task at a time? why couldn't 2 cores work on the problem one right after the other through the shared cache?

You would need to get the result of 1. out of the core into a shared cache and then fetch it back into the 2nd core. When there is no shared cache, you have to use RAM instead. This takes around 100 clock cycles when using cache and 1000 when using RAM. Insane amounts of time spent with the CPU doing nothing but waiting. Also, this still wouldn't be parralel since 2. depends on 1.

The way it is done irl is that one core does 1. and 2. sequencially and refeeds the result of 1. into the ALU, then doing step 2. The time until step 2. can begin is around 4 clock cycles max.

But even if it adds latency for compute tasks that can't use a lot of threads it would be a useful technology.

Like if you could have the CPU cores alternate their cycles, when one is switching the other is working, something like that.

How is it useful? It just takes 2 cores longer to do 1 thread than just 1 core, because the thread has to swap between. Sure you could alternate, but it's still going to be slower because of You fundamentally cannot complete linear tasks in a non-linear way. Unless you have some magic solution to this, your idea is bunk.

Continuation:
There are also other things like program counters (tells the cpu which instruction of the program is executed next) and these would need to be tranferred over and synced with a second core in a so called "inverse multithreading" concept.

The only setup where two or more cores acting as one could somewhat work is with applications that could also be programmed to work in parallel. Heuristics such a hypothetical processor's microcode can recognize certain patterns in these applications and optimize them to somewhat scale over more cores on the fly. This would yield more throughput at the cost of latency but ultimately be slower than a program that is properly coded to scale across several normal cores.

So how did that one company intel bought do it? They had something like that working

I've never heard about this, unless you got a link

You're proposing an idea where you can break up

A->B->C->D

When each step depends on each other via

A->B
C->D

to

A->B->C->D

That's is either going to give you nonsense when you try and compute it or you will just waste CPU cycles trying to somehow predict how the program will execute.

It can be, it is a thing. CSMT is a viable concept. Only caveat is that no one has demoed a high clocking design.

VISC was the company. Intel bought them up. They'll definitely make use of that IP at some point in the near term future.

He's talking about SoftMachines's VISC. VISC does NOT perform what you think it does. This VISC design is designed so that a single thread sees a larger amount of virtual resources. It is not a hardware solution, and it does not solve the problem of data parallelism (or lack thereof). What it can do is devote more resources to extracting ILP; for instance, instead of taking a risky hit from a branch predictor, one core could run the more likely branch then the other the less likely branch, but this will come at a severe cost - not only of a translation layer, but also of resource allocation and data safety which all have to run under the hood. The heuristics to do this all for significantly more complex computational tasks would be incredibly costly and not worth it.