I have not been able to make as much progress on moss over the last 14 days or so as I would like, in large part because of limited time due to getting some exciting work across the finish line at $dayjob. However, whenever I find myself limited in the amount of “hands on keyboard” time I have to spend on a project, I try to invest more of my idle brain time (IBT, if you will) thinking about the next stages of development, as well as higher level long-term goals. The question that was floating around during my IBT this week was whether it is possible for the design of a competitive modern microprocessor to fit in one person’s head.

This question is really applicable to any system of sufficient complexity, but it just so happens that the processors we use today are some of the most complex systems in the proverbial “stack” of modern computing. If you talk to many folks, they’ll explain that this complexity is essential — that is, the end result would be diminished if any of the complexity was reduced. I trust this assessment from folks with vastly more experience in this realm than my current day self, but as with most things, I still want to verify.

For much of the complexity in modern microprocessors, there are clear performance advantages. Techniques such as cache hierarchies, instruction pipelining (superpiplining), and multiple issuance (superscalar), all of which add more complexity in terms of both the number of transistors and the logical model of operation, can lead to significant improvements. However, the benefit of some instruction scheduling techniques are more ambiguous. Two commonly used approaches are out-of-order execution (OoO) and speculative execution. The former refers to the ability for a processor to “rearrange” instructions in order to saturate pipelines and avoid costly cache misses. The latter refers to the ability of a processor to “guess” the outcome of a branch and speculatively execute the instructions that would follow if the guess was in fact correct. There is a great description of both, and many other microprocessor concepts, in the Lighterra 90-Minute Modern Microprocessor Guide.

In most modern processors, these techniques are employed in tandem, providing maximum freedom to exploit available resources. For example, instructions from a speculatively executed branch may be rearranged such that they are intermingled with those prior to the branch, potentially unlocking superior scheduling opportunities. However, because the implementation of these features comes with a complexity cost, it is worth evaluating whether the unlocked performance could be achieved via less expensive means.

The alternatives to these techniques generally rely on static optimization performed by the compiler. In my recent research, I read Discerning the Dominant Out-of-Order Performance Advantage: Is it Speculation or Dynamism? (McFarlin, Tucker, Zilles), which gives an excellent breakdown of the sources of performance advantages offered by OoO processors over static scheduling.

We perceive, however, that there are two potential sources for the performance advantage between the hardware OOO schedules and compiler-generated schedules as executed by in-order machines:

  1. the OOO has few constraints other than true dependencies and functional unit availability to constrain schedule generation, whereas existing ISAs prevent the expression of some speculative schedules by compilers.
  2. OOO can execute the same instructions in different schedules at different points of the execution, reacting to the specific events observed during each execution. We’ll call this advantage dynamism.

- McFarlin, Tucker, Zilles (2013)

The big idea of the paper is that if the majority of the performance advantages offered by out-of-order processors can be attributed to speculation (1) rather than dynamism (2), then it is possible that an in-order processor could achieve comparable performance if its ISA was able to fully express the same speculation primitives to the compiler. This is a logical train of thought as it isolates the capability (dynamism) of out-of-order machines that is fundamentally not available to static optimization techniques.

Without diving too deeply into the methodology employed in the paper (you should go read it!), the authors essentially simulate the behavior of an out-of-order processor, extract the schedules it generates, then see if those same schedules could be accomplished at compile time. While the results are nuanced, the conclusion is that much of the benefits offered by out-of-order scheduling could be achieved by relatively less expensive alternatives.

The performance of our simulated machine with these harvested schedules suggests that the dynamic flexibility of a full-blown OOO processor is (expensive) over-kill and shows us some of the architectural features a more statically-scheduled processor needs to achieve similar performance.

- McFarlin, Tucker, Zilles (2013)

The conclusion is interesting on its own, but it is also worth noting the year in which the paper was published. Since 2013, we have seen a slew of microarchitectural attacks based on transient execution. The Meltdown and Spectre vulnerabilities revealed in 2017 illustrated how the complexity of modern day microprocessors, namely speculative and out-of-order execution capabilities, open the door to leaking sensitive data between processes. In the following years, many more vulnerabilities have been disclosed, with the mitigations ranging from compiler updates to microcode patches. In some cases, these mitigations have caused significant performance degradation, and in others, full mitigation has not been possible.

To be clear, I don’t think that the analysis by McFarlin, Tucker, and Zilles, or the impacts of the vulnerabilities disclosed over the past few years, provide conclusive evidence that dropping out-of-order or speculative execution capabilities would be better for the industry. However, it does lead me to question whether some of the microprocessor complexity we have deemed essential in the name of ever improving performance may actually be accidental. I fully acknowledge my bias here: as someone who is attempting to design a competitive processor on my own, I want it to be the case that simpler implementations may actually out perform more complex ones when all facets are considered.

Perhaps the most likely answer is that there is enough essential complexity that a competitive modern microprocessor design cannot fit in one person’s head, but it is still true that the microprocessors we have today are much more complex than they need to be. If so, I hope efforts such as mine play a small part is leading us closer to systems that are more simple, more secure, and when it is all said and done, more performant.

If you have experience in this space and want to weigh in, I would love to hear from you!