Most computer scientists are familiar with Amdahl’s Law, boiled down it says the fastest your program can run is determined by the slowest of the parallel threads (assuming you can decompose it into parallel threads).
However, it says little about the decomposition, and assumes that your threads have equal resources (assuming an SMP environment). If you are looking at it from a NUMA perspective with heterogeneous computing, what it tells you is that given a decomposition into parallel threads the one you need to put the most effort into compiling is the longest, and you can probably save energy running the shorter threads on smaller cores (in environments like ARM’s big.LITTLE).
Amdahl’s Law also skips over the communication overhead that is incurred in the decomposition too. SMP architectures with globally shared resources don’t cope well if you have a lot of that, and performance gains tail off quickly as you increase the thread count. You need to spend as much (or more) effort compiling the communications as the code. In the larger picture one may find that latency when communicating between threads is the major block to absolute speed – trying to communicate a single byte of data through a coherent cache layer takes a very long time.
Altogether that says if you want to run your code as fast as possible (without boiling oceans) you need a platform with a variety of cores and a programmable low-latency communication fabric – almost anything other than an Intel X86 SMP processor. Why don’t we have that?, well mostly because all our software compiler tools and methodology target X86 SMP machines, and very little else. ARM’s big.LITTLE architecture headed in the right direction, but failed to add low-latency core-to-core communication, despite ARM being a favorite with the FPGA community.