In this post, we will explore how the heterogeneous design of the MIPS Warrior I-class I6500 CPU processor, delivers significant benefits in terms of performance and low power consumption.
High-performance processors typically employ techniques such as deep, multi-issue pipelines, branch prediction and out-of-order processing to maximise performance, but this can have consequences for power efficiency.
If some of these tasks can be parallelised, then partitioning them across a number of efficient CPUs will deliver a high performance, power efficient solution. To accomplish this, CPU vendors have provided multi-core clusters, and operating system and application developers have designed their software to exploit these capabilities.
Picking up multiple threads
Even with out-of-order execution, with typical workloads, CPUs spend the majority of their time waiting for access to the memory system. The newly introduced MIPS I6500 includes support for multi-threading where each thread appears to the software as a separate processor. Depending on the application, the addition of a second thread to a CPU typically adds 40% to the overall performance for an additional area of around 10%. The MIPS I6500 can accommodate up to six CPUs, each with up to four threads, giving up to 24 threads in a single cluster.
The MIPS I6500 cluster adds support for hardware coherency at the system level and interfaces to the system using AMBA ACE, a coherent bus interface for compatibility with popular coherent NoC products such as NetSpeed’s Gemini and Arteris’ Ncore. With the NetSpeed Gemini product, 64 MIPS I6500 clusters provide up to 384 processors running an impressive 1,536 threads.
With many consumer products, the processing performance that is needed can vary considerably over time. Peak performance is, generally, only required occasionally. For most of the time, only lower performance is needed and this can be provided by using a simpler, more power efficient processor to deliver significantly reduced total energy consumption.
Sharing a common view
To move a task from one processor to another requires each processor to share the same instruction set and the same view of system memory. This is accomplished through shared virtual memory (SVM). Any pointer in the program must continue to point to the same code or data and any dirty cache line in the initial processor’s cache must be visible to the subsequent processor.
Figure 1: Memory moves when transferring between clusters
Figure 2: Smaller, faster memory movement when transferring within a cluster
Maintaining cache coherency between processors
Cache coherency can be managed through software. This requires that the initial processor (CPU A) flush its cache to main memory before transferring to the subsequent processor (CPU B). CPU B then has to fetch the data and instructions back from main memory. This process can generate many memory accesses and is therefore time consuming and power hungry; this impact is magnified as the energy to access main memory is typically an order of magnitude higher than fetching from cache. To combat this, I6500 CPU clusters support hardware cache coherency, minimising these power and performance costs.
When a task is moved from CPU A to CPU B in a system using I6500 clusters, CPU B will access cache lines that reside in the local caches of CPU A. Hardware cache coherency tracks the location of these cache lines and ensures that the correct data is accessed by snooping the caches where necessary.
Another benefit of an I6500 cluster is found within the cluster. In a typical heterogeneous system, the high-performance processors reside in one cluster, while the smaller, high-efficiency processors reside in another. Transferring a task between these different types of processor means that both the level 1 and level 2 caches of the new processor are cold. Warming those takes time and requires the previous cache hierarchy to remain active during the transition phase.
The MIPS I6500 is different. We call this difference ‘Heterogeneous Inside’; meaning that the I6500 supports a heterogeneous mix of processor types allowing both high performance and power optimised processors in the same cluster. Transferring a task from one type of processor to another is now much more efficient, as only the level 1 cache is cold and the cost of snooping into the previous level 1 cache is much lower so the transition time is much shorter.
Mixing CPUs with dedicated accelerators
CPUs are general purpose machines. Their flexibility enables them to tackle any task but the price that they pay for this is efficiency. Thanks to its optimisations the PowerVR GPU is able to process larger, highly parallel computational tasks with very high performance and good power efficiency, in exchange for some reduction in flexibility compared to CPUs, and bolstered by a well-supported software development eco-system with environments such as OpenCL.
The specialisation provided by dedicated hardware accelerators offers a mix of performance with power efficiency that is several orders of magnitude better than a CPU, but with far less flexibility.
However, using accelerators for operations that occur frequently are ideal to maximise the potential performance and power efficiency gains. Specialised computational elements such as those for audio and video processing, as well as Neural Network processors used in machine learning, use similar mathematical operations. An example of such an operation that is used extensively in these areas is the vector dot product (sum of the products of corresponding elements in each vector). A specialised accelerator for this function working with the CPU can offer a very significant performance boost along with a big energy saving, while retaining the flexibility from the CPU for the remaining functionality.
Hardware acceleration can be coupled to the CPU by adding Single Instruction, Multiple Data (SIMD) capabilities with floating point Arithmetic Logic Units (ALUs). However, while processing data through the SIMD unit, the CPU behaves as a Direct Memory Access (DMA) controller to move the data and CPUs make very inefficient DMA controllers.
Conversely, a heterogeneous system essentially provides the best of both worlds. It contains some dedicated hardware accelerators that, coupled with a number of CPUs, offers the benefits of greater energy efficiency from dedicated hardware, while retaining much of the flexibility provided by CPUs.
These energy savings and performance boost depend on the proportion of time that the accelerator is doing useful work. Work packages appropriate for the accelerator are present in a wide range of sizes – you might expect a small number of large tasks, but many smaller tasks.
Figure 3: Minimum function size to justify switching costs reduces with reduction in switching costs
There is a cost in transferring the processing between a CPU and the accelerator, and this limits the size of the task that will save power or boosts performance. For smaller tasks, the energy consumed and time taken to transfer the task exceeds the energy or time saved by using the accelerator.
Reducing data transfer cost
To assist with this the I6500 features Shared Virtual Memory with hardware cache coherency. This addresses much of the cost of transferring the task as it eliminates the copying of data and the flushing of caches.
To reduce the time and energy costs even further, other techniques are needed. The HSA Foundation has developed an environment to support the integration of heterogeneous processing elements in a system that extends beyond CPUs and GPUs. The HSA system provides an intermediate language called HSAIL, which provides a common compilation path to heterogeneous Instruction Set Architectures (ISAs) that greatly simplifies the system software development but also defines User Mode Queues.
These queues enable tasks to be scheduled and signals to trigger tasks on other processing elements allowing sequences of tasks to execute with very little overhead between them.
IO Coherency Units
The MIPS I6500 supports a number of IO Coherency Units (IOCUs) in addition to the CPUs in the cluster. These IOCUs feature Memory Management Unit (MMU) functionality to map transactions to the physical address space and pass through the shared level 2 cache in the same way as transactions from the CPUs, maintaining coherency with all of the local caches. For many systems this capability enables the effective integration of accelerators with the CPUs.
For larger tasks in an accelerator, the bandwidth for the Direct Memory Access (DMA) engine feeding the accelerator can be large and may even be limited by the capacity of the memory system. This will result in the accelerator’s transactions queuing at the memory system and adding queuing latency to the CPU transactions that share the same port and path. Accelerators working with large data sets can have a low dependency between transactions allowing the accelerator to pre-fetch data. This leads to a high tolerance to memory latency that is not shared by the CPUs.
Figure 4: Heterogeneous system diagram with I6500, PowerVR GPU and accelerator
The MIPS I6500 with its coherent bus interface enables an accelerator to use a separate, independent path to system memory yet still retain the coherency with the CPU cluster. We call this ‘Heterogeneous Outside’.
Heterogeneous systems offer the opportunity to significantly increase system performance and reduce system power consumption, enabling systems to continue to scale beyond the limitations imposed by ever shrinking process geometries. Multi-threaded, heterogeneous and coherent CPU clusters such as the MIPS I6500 have the ideal characteristics to lie at the heart of these systems.
As such they are well placed to efficiently power the next generation of devices in many markets, such as advanced driver assistance systems (ADAS) and autonomous vehicles, networking, drones, industrial automation, security, video analytics, machine learning and many others.