An Analysis of Computer Architectures For Exploiting Parallelism
Kiran Anna
Harshavardhan Devireddy
Junayra Jamal
Fritz Feuerbacher
Amalendu Roy
University of Central Florida
CDA5106
- Advanced Computer ArchitectureFall 1999
Table Of Contents
Introduction:
*Expoliting Parallelism:
*Architecture Types:
*Wide issue super scalar:
*Multithreading:
*Simutaneous Multithreading:
*Difference between SMT and Dynamically Scheduled Superscalar:
*Simultaneous Multithreading Processor Architecture:
*SMT Implementation Parameters:
*Fetch Unit:
*"2.8" Fetching:
*Icount heuristic:
*ITAG:
*Difference Between Superscalar, Fine-grained multithreading, and SMT:
*The Bottlenecks of the Simultaneous Multithreading Architecture:
*Multiprocessor:
*Multiprocessors vs. SMT:
*Speculative Execution Using Multiprocessors:
*Conclusion:
*
The authors of this paper are convinced that simultaneous multithreading for use in exploiting paralellism in applications is the best alternative for future performance gains. Simultaneous multithreading (SMT) is an architectural technique in which the processor issues multiple instructions from multiple threads in a single cycle [1]. Specifically, we would like to exploit instruction level and thread level parallelism.
Simultaneous multithreading is a clever way of utilizing current and future processor design. As processors approach a billion transistors on a chip, researchers continue to debate over the best way to utilize them [3]. We can add more memory (either cache or primary memory) to the chip, but the performance gain from memory alone is limited [3]. Another approach is to increase the level of system integration, bringing support functions like graphics accelerators and I/O controllers on chip [3]. This still gives only marginal performance improvements.
The key to large performance gains is to exploit as much parallelism as possible on the chip. All programs exhibit a certain level of instruction level parallelism. This means that certain parts of the program can have its instructions executed in parallel, i.e. multiple instructions executed in the same clock cycle. Although there is usually always some instruction level parallelism available for exploitation in every program, most programs do not exhibit large amounts of these types of instructions. This means that much of the wide-issue hardware present on today’s superscalar processors will go to waste.
In order to compensate for low levels of instruction level parallelism in programs, we can build a processor that can take advantage of both instruction level parallelism as well as thread level parallelism. This involves running multiple threads at the same time and having instructions from each thread being executed in the same clock cycle. Simply adding more wide-issue super scalar processors is not sufficient to achieve this because these extra processors will be wasted when there is not sufficient thread level parallelism to utilize them [3]. The processor needs to be designed such that it can utilize both types of parallelism simultaneously. Because simultaneous multithreaded processors successfully (and simultaneously) exploits both types of parallelism, SMT processors use resources more efficiently, and both instruction throughput and speedups are greater [3].
A simultaneous multithreaded processor combines hardware features of wide-issue superscalar and multithreaded processors. The result is a processor that can issue multiple instructions from multiple threads each cycle. This achieves better performance for a large variety of workloads [3]. The SMT processor can be built with only minimal hardware complexity and is a straight forward extension of conventional dynamically scheduled superscalar processors [3].
If our goal is to exploit parallelism we must first find out were this parallelism can be found and why its not being exploited now. To do this we have defined two different kinds of parallelism waste that occurs in a proicessor. These are horizontal and vertical waste [3].
The difference between horizontal and vertical waste is shown in the figure below. The figure shows an example execution sequence. Each row represents the issue slots for a single execution cycle. A filled box indicates that the processor found an instruction to execute in that issue slot (functional unit) on that cycle. An empty box (no shading) means that the slot went unused on that cycle. The unused slots are characterized as horizontal and vertical waste. Horizontal waste occurs when some, but not all of the issue slots in a cycle can be used. This typically occurs because of poor instruction level parallelism. Vertical waste occurs when a complete cycle goes unused. This can be caused by a long latency instruction that inhibits further instruction issue[3]. Vertical waste can be eliminated by exploiting thread level parallelism.
Figure 1 omitted.
In figure 1, notice how the in the execution sequence, a single program is being executed so it is trying to find enough instructions to utilize all functional units on each cycle. When it cannot find enough, whole cycles can go wasted.
We have chosen to look at four different architecture types and the associated methods used to exploit parallelism on that platform. We first look at the most comman type, wide issue superscalar, and look at ways this architecture exploits parallelism. Next, we take a look at multithreading and why it is the natural progression from single threaded superscalar. We then take a close look at simultaneous multithreading which not only issues multiple instructions like the superscalar, but also issues from multiple threads, a natural extension of the superscalar and multithreading technique. Finally, we discuss thread level speculation using a multiprocessor architecture. This is a natural extension of the multiprocessor architecture.
The standard RISC architectures issue a maximum of one instruction per clock cycle. It so happens that in many clock cycles even a single instruction is not issued. This is because of various dependencies among the instructions and due to various memory latencies. So effectively the average number of instructions issued per clock cycle in these machines is always less than 1. These machines have no way to exploit the instruction level parallelism available in the programs.
On the contrary, the wide issue superscalar machines issue more than one instruction per clock cycle. These machines have the necessary hardware resources to fetch, check the dependencies and issue more than one instruction per clock cycle. The basic idea of issuing several instructions per clock cycle is to exploit the instruction level parallelism available in the code and thus improve the performance.
Examples of wide issue superscalar machines are:
These machines issue have the necessary hardware to issue from 4 to upto 8 instructions per clock cycle. Thus the ILP is exploited better in these architectures compared to previous architectures.
But, as always our programs & applications don’t have the necessary ILP to make full use of this architecture. Here we use the terms vertical waste and horizontal waste to explain the utilization of the hardware resources of these machines. The wide issue superscalar machines introduce both vertical and horizontal waste.That means we are wasting a lot of hardware resources. The horizontal waste is caused when some but not all issue slots in a cycle have been used. This is caused by poor instruction level parallelism. The vertical wastage is caused when a cycle goes completely unutilized. This is caused by a instruction having a long latency like for example a memory access which inhibits further issue of instructions[3].
This type architecture is definitely gives better performance than the single issue architectures. Previously we can issue at a maximum 1 instruction per clock cycle. But a 8 issue super scalar machine (alpha 21164) when tested on a SPEC benchmark gave a result that on an average 1.5 instructions are issued per clock cycle, which is comparatively much better than the single issue superscalar[9].But the hardware resources are highly underutilized. The processor utilization is something around 19%[9]. This is both good news and bad news. The good news is this machine is much better than the single issue superscalar processor. The bad news is the hardware resources are highly underutilized. There is a good news associated with this ie. this leaves a lot of scope for improvement. So in the next parts of our survey we look at architectures which try to reduces this wastage and improve the performance better.
The limits of the wide issue superscalar processor show up in the very low utilization of the processor’s functional units. This would not be a problem if many specific causes of wasted cycles could be found so that specific solutions could be implemented to eliminate them. There are no dominent causes to be found in a general context [9]. Instruction scheduling attacks several important segments of wasted bandwidth, but this is far from bringing the superscalar up to acceptable utilization. Most general programs do not have a sufficient amount of independent instructions to keep the processor busy. Thus, we have a lot of latency when there are no instructions to issue because of dependences. Multithreading is a way of hiding this latency in a generalized way. That is, we can modify a wide issue superscalar processor so that when a certain threads instruction is executing on a functional unit, and no further instructions for that thread can execute because of a depence, we can swap to a different thread which has instructions that are ready to be executed by the remaining functional units, and start executing. You can see that this does a great job of eliminating the vertical waste explained earlier. That is, a cycle which would otherwise go completely unused, can now be used for useful work. But, you should also notice that this does nothing to eliminate the horizontal waste. You see, multithreaded processors can issue multiple instructions in a single cycle, but they cannot issue multiple instructions from multiple threads in a single cycle. In fact, traditional multithreaded processors convert a lot of the vertical waste into horizontal waste. Fine-grain multithreaded architecture provides a maximum speedup (increase in instruction throughput) of only 2.1 over single-threaded execution (from 1.5 IPC to 3.2) [9]. This is why a new approach is needed to eliminate these wasted functional units. This new design is known as simulataneous multithreaded architecture and is discussed in the next section. Figure 2 below shows how multithreading can eliminate most vertical waste by finding useful instructions to execute from different independent threads.
Figure 2 omitted.
Difference between SMT and Dynamically Scheduled Superscalar:
Simultaneous Multithreading (SMT): It is a technique that permits several independent threads to issue multiple functional units each cycle. In the most general case, the binding between thread and functional unit is completely dynamic. The objective of Simultaneous Multithreading is to substantially increase processor utilization in the face of both long memory latencies and limited available parallelism per thread. Simultaneous multithreading combines the multiple-issue-per-instruction features of modern superscalar processors with the latency-hiding ability challenges from these architectures, e.g., achieving high register file bandwidth, supporting high memory access demands, meeting large forwarding requirements, and scheduling instructions onto functional units [1].
Simultaneous multithreading combines hardware features of wide issue superscalars and multithreaded processors. From superscalars, it inherits the ability to issue multiple instructions in each cycle; and like multithreaded processors it contains hardware state for several programs (or threads). The result is a processor that can issue multiple instructions from multiple threads each cycle, achieving better performance for a variety of workloads. For a mix of independent programs (multithreading), the overall throughput of the machine is improved. Similarly, programs that are parallelizable, either by a complier or a programmer, reap the same throughput benefits, resulting in program speedup. Finally, a single-threaded program that must execute alone will have all machine resources available to it and will maintain roughly the same level of performance as when executing on a single-threaded, wide-issue processor. Simultaneous multithreading adds minimal hardware complexity to, and, in fact, is a straightforward extension of, conventional dynamically scheduled superscalar (it’s described in the following paragraph).
Dynamically Scheduled Superscalar: Superscalar processors issue varying numbers of instruction per clock cycle and may be either statically scheduled by the compiler or dynamically scheduled using techniques based on Scoreboarding and Tomasulo’s Algorithms. The Mips R1000 is a dynamic, superscalar microprocessor that implements the 64-bit Mips 4 instruction set architecture [4]. It fetches and decodes four instructions per clock cycle and dynamically issues them to five fully pipelined, low latency execution units. Instructions can be fetched and executed speculatively beyond branches. Instruction graduate in order upon completion. Although execution is out of order, the processor still provides sequential memory consistency and precise exception handling. A dynamically scheduled superscalar processor includes complex hardware that dynamically reorders instruction execution based on operand availability. This hardware immediately adapts whenever cache misses delay instructions. The Mips R1000 processor looks ahead up to 32 instructions to find possible parallelism. This instruction window is large enough to hide most of the latency for refill s from the secondary cache. However, it can hide only a fraction of main memory latency, which is typically much longer.
Simultaneous Multithreading Processor Architecture:
In this section we present the architecture of a Simultaneous multithreading processor. The throughput gains provided by simultaneous multithreading are possible without adding undue complexity to conventional superscalar processor design.
The SMT architecture is derived from a high performance, out of order, superscalar architecture (Figure 3, without the extra program counter) which represents a projection of current superscalar design trends 3-5 years into the future. This superscalar processor fetches up to eight instructions per cycle; fetching is controlled by a conventional system of branch target buffer, branch prediction, and subroutine return stacks. Fetched instructions are then decoded and passed to register renaming logic, which maps logical
Figure 3 omitted.
registers onto a pool of physical registers, removing false dependences. Instructions are then placed in one of two instruction queues. Those instruction queues are similar to ones used by the MIPS R1000 [20] and the HP-8000 [21], and in this case holding instructions until they are issued. Instructions are issued to the functional units out-of-orders when their operands are available. After completing execution, instructions are retired in-order, freeing physical registers that are no longer needed.
The SMT architecture described here is a straightforward extension to the above conventional superscalar design. The hardware necessary to support simultaneous multithreading on that architecture are:
SMT Implementation Parameters:
The processor contains 3 floating-point functional units and 6 integer units; four of the six integer units also execute load and store instruction. The peak issue bandwidth out of the two instruction queues is therefore nine; however, the throughput of the machine is bounded by the peak fetch and decode bandwidths, which are eight instructions per cycle. All functional units are completely pipelined. Table 1 shows the instruction latencies, which are derived from the Alpha 2164 [8]. A 32-entry integer instruction queue (which handles integer instructions and all load/store operation) and 1 320entry floating point queue, not significantly larger than HP PA-8000 [21, see paper 3].
|
Instruction Class |
Latency |
|
Integer multiply |
8, 16 |
|
Conditional move |
2 |
|
Compare |
0 |
|
All other integer |
1 |
|
FP divide |
17, 30 |
|
All other FP |
4 |
|
Load (cache hit) |
1 |
Table 1.
Instruction Latencies of a typical SMT’s CPU instructions|
|
I Cache |
D Cache |
L2 |
L3 |
|
Size |
32KB |
32KB |
256KB |
2MB |
|
Associativity |
DM |
DM |
4-way |
DM |
|
Line Size |
64 |
64 |
64 |
64 |
|
Banks |
8 |
8 |
8 |
1 |
|
Transfer time |
1 cycle |
1 |
1 |
4 |
|
Access/Cycle |
Var (1-4) |
4 |
1 |
¼ |
|
Cache fill time |
2 cycles |
2 |
2 |
8 |
|
Latency to next level |
6 |
6 |
12 |
62 |
Table 2.
Details of cache hierarchyThe caches (Table 2) are multi-ported by interleaving them into banks, and it’s a look-up free caches and TLBs. TLB misses requires two full memory accesses and no execution resources. Each cycle, one thread is given control of the fetch unit, chosen from among those not stalled for an instruction cache (I cache) miss. If we fetch from multiple threads, we never attempt to fetch from threads that conflict (on an I cache bank) with each other, although they may conflict with other I cache activity (cache fill).
SMT’s implementation parameters will, of course, change as technologies shrink. For the implementation of a typical SMT, the CPU parameters are:
For the memory subsystem, the parameters are:
The fetch unit is one of the important factors in the processor design. In a conventional processor, the fetch unit secures instructions from a single thread into the execution unit. Performance issues revolve around maximizing the number of useful instructions that can be fetched (by minimizing branch miss predictions, for example) and fetching independent instructions quickly enough to keep functional units busy [3].
In an SMT processor, the fetch unit is shared among the available multiple threads. The high level of inter thread competition for the fetch unit can be exploited to increase fetch throughput without increasing the fetch bandwidth. Tullsen, et al., [6] suggested two ways to increase the fetch throughput which is not possible in single threaded processors: (1) the fetch unit can fetch instructions from multiple threads simultaneously increasing the fetch bandwidth utilization.(2) It can be selective about which thread or threads to fetch instructions from. Since all threads cannot provide equally useful instructions in a particular cycle, fetching from thread(s) that will provide the best instructions will benefit the SMT processor.
Tullsen, et al,. [6] showed that higher maximum throughput can be achieved by splitting the fetch over multiple threads and 2.8 Fetching as the best fetching scheme. He also showed that Icount as the best heuristic that selects the best threads to fetch. That paper also showed that ITAG as the best scheme that eliminates the conditions that block the fetch unit.
In this scheme, the fetch unit has eight program counters, one for each thread context. In every cycle, it selects two different threads from a group of threads that have not incurred instruction cache misses earlier and reads eight instruction block for each thread. Since SMT processor cannot issue more than eight instructions, it chooses a subset of these instructions for decoding. It takes instructions from the first thread until it comes across a branch instruction or the end of cache line and the rest are filled in from the second thread to make a total of eight. This scheme involves only minor hardware additions. Two multiple addresses are driven to each cache bank, then multiplexed before decoder to select one cache index per cycle. Since the cache banks are single ported, bank-conflict logic is needed to ensure that each address targets a separate bank. Two hit/miss calculations per cycle are required. It has two cache output buses, each eight instructions wide. In addition, logic to select and combine instructions is necessary and additional pipe stage might be needed.
The 2.8 fetching scheme improved the maximum throughput by 10% without comprising single-thread performance [6]. This is achieved by combining the partitioning of the fetch bandwidth over multiple threads and making that partition flexible. However, the high throughput of this scheme adds more pressure on the instruction queues (IQ) leading to IQ- full conditions.
The processor's efficiency is affected by the quality of the instructions fetched [6]. Icount heuristic is used to select the best thread in an SMT processor. It gives highest priority to the threads that have lowest instructions in the decode, renaming and queue pipeline stages. It improves the performance in several ways. It replenishes the IQ with instructions from fast moving threads and therefore, avoids the blocking of instructions in the IQ that depend on long latency instructions. It maintains a fairly even mix of instructions among the available threads thereby, increasing inter thread parallelism. Finally, it avoids thread starvation as it selects threads that have fewer instructions in the pipeline.
Icount feedback heuristic adds a very low hardware cost to the system. It uses a counter to keep track of instruction count in each thread. Therefore, it requires a small amount additional logic to increment (decrement) counters in each thread when instructions enter the decode stage and to select two smallest counter value threads. The Icount heuristic showed a throughput gain of 2.5 over the unmodified superscalar [6]. The improved performance is because it dynamically biases toward threads that will use processor resources efficiently.
A fetch unit is blocked when the IQ is full or there is a miss in the instruction cache. ITAG scheme is used to prevent the blocking of the fetch unit. when a thread is selected for fetching but experiences a cache miss, It loses the oppurtunity to be fetched that cycle. In this scheme, Icache tags are looked one cycle early. Only non missing threads are chosen for fetch and at the same time cache miss accesses are started . This adds an additional stage to the front of the pipeline, increasing misfetch and mispredict penalties. This scheme requires an one or more additional ports on the I cache tags, so that potential replacement threads can be looked up at the same time [6]. ITAG scheme showed improved performance slightly over Icount 2.8 scheme [6]. It is not an effective scheme when there are few threads.
Difference Between Superscalar, Fine-grained multithreading, and SMT:
The differences between superscalar, fine-grained multithreading, and simultaneous multithreading architectures can be better understood by studying the behavior of issue slots in each processor. Figure 4 shows a set of four issue slots for a single execution cycle. A filled box indicates that the processor has an instruction to execute in that issue slot on that cycle. An empty box denotes an unused slot. Tullsen et al, [3] characterized the unused slots as horizontal or vertical waste.
Horizontal waste occurs when some, but not all, of the issue slots in a cycle can be used. It generally occurs because of poor instruction-level parallelism. Vertical waste occurs when all the issue slots in a cycle go unused. This can be caused by a long latency instruction (such as memory access) that inhibits further instruction issue [3].
Figure 5 omitted.
Figure (5a) shows a sequence from a conventional superscalar processor [3]. Superscalar processor executes a single program or thread. It tries to issue multiple instructions each cycle from this program. Due to low instruction-level parallelism, it may not find sufficient number of instructions to issue thereby, causing horizontal and vertical wastes.
Figure (5b) shows a sequence from a multithreaded architecture, such as Tera [3]. In multithreaded processors, each thread has a separate hardware state. Multithreaded processors execute only one thread in any given cycle. In the next cycle, it switches to a different thread context and executes instructions from the new thread. It eliminates vertical waste, but cannot eliminate horizontal waste. It is still limited by instruction-level parallelism as it executes only one thread each cycle.
Figure (5c) shows that the SMT processor fetches instructions from all threads for execution each cycle. It exploits both instruction-level parallelism and thread-level parallelism. It can execute only one thread if it has highest instruction-level parallelism or it can execute multithreads if each thread has a low instruction-level parallelism. Thus, it recovers issue slots lost to both horizontal and vertical wastes in the above two processors.
The Bottlenecks of the Simultaneous Multithreading Architecture:
Fetch Bandwidth: The fetch throughput is still a significant bottleneck for SMT. The Branch frequency and the PC alignment problem prevent it from fully utilizing the fetch bandwidth.
Branch Prediction: SMT puts more pressure on the branch prediction hardware but it is more tolerant of branch misprediction. Statistics shows that with one thread running, on average 16% of the instructions we fetch and 10% of them are down to the wrong path. But with eight thread running only 9% of the instructions we fetch and and 4% instructions we execute are wrong path. Perfect branch prediction boosts throughput by 25% at 1 thread, 15% at 4 threads, and 9% at 8 threads. So, we can see the decreasing efficiency of the branch prediction hardware.
Register File Size: The numbers registers required by SMT is a very significant issue. Setting the number of excess registers to infinite instead of 100 improves 8-thread performance by 2%. Interestingly lowering it to 90 reduces performance by 1% , and to 80 by 3% and 70 by 6%. So, we can clearly see that there is no sharp drop-off point.
As chip densities continue to rise, single-chip multiprocessors will provide an obvious means of achieving parallelism with the available real estate. On the organizational level, multiprocessors and simultaneous multithreading processors are very similar. Both have multiple register sets, multiple functional units, and high issue bandwidth on a single chip. The key difference is in the way those resources are partitioned and scheduled. The multiprocessor statically partitions resources, devoting a fixed number of resources to each thread. The simultaneous multithreading processor allows the partitioning to change every cycle [9]. It should be obvious that the job of scheduling is more complex on an SMT processor than a multiprocessor. But, in other areas, the simultaneous multithreading model requires fewer resources than the multiprocessor to acheive the desired level of performance. The following table 3 shows the results of experiments conducted in [9] on various SMT/MP configurations.
|
Purpose of Test |
Common Elements |
Specific Configuration |
Throughput (instructions/cycle) |
|
Unlimited FU’s: equal total issue bandwidth, equal number of register sets (processor or threads) |
Test A: FU’s = 32 Issue BW = 8 Reg Sets = 8 Test B: FU’s = 16 Issue BW = 4 Reg Sets = 4 Test C: FU’s = 16 Issue BW = 8 Reg Sets = 4 |
SMT: 8 thread, 8 issue MP : 8 1-issue procs
SMT: 4 thread, 4 issue MP : 4 1-issue procs
SMT: 4 thread, 8 issue MP : 4 2-issue procs |
6.64 5.13
3.40 2.77
4.15 3.44 |
|
Unlimited FU’s: Test A, but limit SMT to 10 FU’s |
Test D: Issue BW = 8 Reg Sets = 4 |
SMT: 8 thread, 8-issue, 10 FU MP : 8 1-issue procs, 32 FU |
6.36 5.13 |
|
Unequal Issue BW: MP has up to four times the total issue BW |
Test E: FU’s = 32 Reg Sets = 8 Test F: FU’s = 16 Reg Sets = 4 |
SMT: 8 thread, 8-issue MP : 8 4-issue procs
SMT: 4 thread, 8-issue MP : 4 4-issue procs |
6.64 6.35
4.15 3.72 |
|
FU Utilization: equal FU’s, equal issue BW, unequal Reg Sets |
Test G: FU’s = 8 Issue BW = 8 |
SMT: 8 thread, 8-issue MP : 2 4-issue procs |
5.30 1.94 |
Table 3.
Results of SMT vs. Multiprocessors [9].
These comparisons show that simultaneous multithreading out performs single-chip multiprocessing in a variety of configurations because of the dynamic partitioning of functional units. More important, SM requires many fewer resources (functional units and instruction issue slots) to achieve a given performance level [9]. For example, a single 8-thread, 8-issue SMT processor with 10 functional units is 24% faster than the 8 processor, single-issue multiprocessor which has identical issue bandwidth but requires 32 functional units (Test D). To equal the throughput of that 8-thread, 8-issue SMT, a multiprocessor system requires eight 4-issue processors, which consume 32 functional units and 32 issue slots per cycle (Test E) [9]. Some further advantages of SMT over multiprocessors which are not directly apparent from the previous results are:
Speculative Execution Using Multiprocessors:
Researchers have come up with an alternative way of exploiting parallelism in general programs using multiprocessors. This method is called thread level speculation. This form of exploitation uses single chip multiprocessors which have support for rollback. The computation model is such that a total order is maintained to eliminate any errors in executing sequential code among the multiple processors. The speculative execution model works like this: first, this model requires that the compiler being used provides support for speculative execution. The compiler creates speculative regions made up of sequential tasks with id's numbered 1-n. These tasks are assigned to processors in order. When a processor finishes taski and commits, it is then able to start work on taski+p. Each task has a state, either current or speculative. The processor executing the task with the lowest id "owns" the current state and all other tasks have a speculative state. Only tasks owning the current state can commit, which means that all other tasks before it must already have commited any lower numbered tasks. All read and write operations executed by a processor in speculation mode are directed to the processor's state, which may be current or speculative. There are two types of writes: a normal write operation which may trigger a RAW hazard on another processor, and a synch-write which is used for synchronization primitives that simply writes to a location[7]. Write operations may trigger a RAW hazard if a higher numbered task has read a value that was not yet written by a lower numbered task. This is the only time a RAW hazard can occur. If a RAW hazard occurs, taski and all tasks after it are rolled back and restarted.
A task can issued three forms of commit statements. In order for a task to commit, it must wait for all previous tasks to commit_and_advance. This simple commit operation has the effect of forcing serialization at the commit point. At the end of each task, a commit_and_advance operation occurs. A terminate-speculation operation also has commit semantics and waits for all previous tasks to complete. After waiting, all speculative states of subsequent tasks are flushed and the other processors are given notice to stop operating in speculative mode[7]. All the primitives are summarized in table 4 below.
|
Operations |
Semantics |
|
Start_Speculation |
Start the speculative mode of execution. tp = n Current state = state of processor with task 0. |
|
Read |
Read from p's state. |
|
Synch_Write |
Write to p's state. |
|
Write |
Write to p's state. If task t > tp has previously read this data, flush speculative states of task t and beyond, and restart the tasks. |
|
Commit |
Wait until p's state becomes the current state. |
|
Commit_And_Advance |
Commit. tp = tp + P Current state = speculate state of processor executing task tp + 1 |
|
Terminate_Speculation |
Commit. Flush speculative states of subsequent tasks. Signal to stop speculation mode for all processors. |
Table 4.
Speculative execution model primitives[7]This method of exploiting paralellism from general programs looks promising, however the hardware requires additional support for the model primitives. The biggest problem with this model is that researchers have not been able to find enough speculative code to exploit. This current research in this paper focused entirely on loop level speculation. Loop iterations are the traditional target of parallelization and an obvious candidate for thread level speculation. Each iteration of a loop can be turned into a speculative thread that runs in parallel with the other iterations of that loop. The degree of parallelism available in a loop is governed by the presence of data dependences that cross loop boundaries. If the iterations operate on disjoint sets of data, the degree of parallelism can be equal to the number of iterations in the loop[8]. Since thread level speculation requires compiler support as well as hardware support, it does not seem to me to offer as much flexability as simultaneous multithreading, which still utilizes the processors resources the most efficiently and without special compiler support.
The main objective of SMT is to substantially increase processor utilization for both long memory latencies and limited available parallelism per thread. Simultaneous multithreading combines the multiple-issue-per-instruction features of modern superscalar processors with the latency hiding ability of multithreaded architecture. SMT is an evolutionary design that attacks multiple sources of waste in Wide Issue Processors. SMT uses instruction-level and thread-level parallelism to effectively increase effective processor utilization for both multiprogramming and parallel workload. The cost of SMT is an important factor to consider. In SMT one will see that we are increasing the hardware complexity and as a result the cost will increase. But if we observe we will see that the transistor density per chip is increasing day by day. Furthermore the SMT architecture will be implemented in a time period of 2 to 3 years later. So, by that time the cost wouldn’t be that much of a problem. Therefore certainly SMT is an excellent building block for future technologies and machine designs.
|
INSTRUCTION THROUGHPUT EXECUTING A MULTIPROGRAMMING WORKLOAD AND A PARALLEL WORKLOAD |
|||||||||
|
MULTIPROGRAMMING PARALLEL WORKLOAD WORKLOAD Threads SS FGMT SMT SS MP2 MP4 FGMT SMT |
|||||||||
|
1 |
2.7 |
2.6 |
3.1 |
|
3.3 |
2.4 |
1.5 |
3.3 |
3.3 |
|
2 |
|
3.3 |
3.5 |
|
4.3 |
2.6 |
4.1 |
4.7 |
|
|
4 |
|
3.6 |
5.7 |
|
|
4.2 |
4.2 |
5.6 |
|
|
8 |
|
2.8 |
6.2 |
|
|
|
3.5 |
6.1 |
|
Table 5.
Workload comparisonAccording to table 5 above, we see that the superscalar’s instruction averaged 2.7 instruction per cycle for multiprogramming workload. For parallel workload it has a slightly higher average through of 3.3. So we can conclude that SMT executed the multiprogramming workload is 2.3 times faster and parallel workload is 1.9 times faster (at eight threads). The superscalar’s inability to exploit more instruction level and thread level parallelism causes this. Fine-grained multithreaded architecture eliminates vertical waste, as a result it provides speedup over the superscalar as high as 1.3 on both workloads. Interestingly the maximum speedup occurred at 4 threads. That is a significant drawback, because with additional threads the performance fells. According to the table SMT was able to get higher instruction throughput and better program speedup than the fine-grained multithreaded processor. SMT obtained better speedups than the multiprocessors MP2 and MP4, not only at the maximum thread capability, that is 8 for SMT, 4 for MP2 and 2 for MP4, but also for a given number of threads. According to the table from our research paper SMT’s throughput reached 6.1 instructions per cycle, compared with 4.3 for MP2 and 4.2 for MP4.