Introduction
System-on-Chip (SoC) architectures for many-core processors rely on Network-on-Chip (NoC) interconnects to connect cores, caches, and memory controllers. Two common NoC topologies are the 2D mesh and the 2D torus. In parallel, SoCs employ memory interleaving techniques to boost memory performance by spreading memory accesses across multiple memory banks or controllers. This report explores how NoC topology (especially mesh vs. torus) relates to memory interleaving strategies, explaining how interleaving enhances memory throughput and how it is implemented on different NoC topologies. We also survey the evolution of these concepts in academic research and describe current industry practices (from ARM, Intel, AMD, and NoC IP vendors) with technical examples and references.
NoC Topologies: Mesh vs. Torus
A NoC provides an on-chip communication fabric connecting IP blocks (CPU cores, caches, memory controllers, etc.) via routers. In a 2D mesh, nodes are arranged in a grid with each node connected to its immediate neighbors in the north, south, east, and west directions. Mesh networks are planar and have no wrap-around connections at the edges. In contrast, a 2D torus extends a mesh by linking the opposite edges, so each node has neighbors in all four directions with the network edges “wrapped around”sciencedirect.com. This wrap-around in a torus reduces the maximum and average path length between nodes compared to a mesh (since there are no edge boundaries), improving overall communication bandwidth and reducing latency for distant node communicationsciencedirect.com. Both topologies have been widely studied in on-chip networks and parallel computers due to their regular structure and scalability. In practice, mesh topologies have been more common in commercial many-core chips (e.g. Intel’s mesh in Skylake-SP Xeonstomshardware.com, ARM’s CMN mesh in Neoverse coresanandtech.com), whereas torus topologies are more often seen in larger-scale multiprocessor networks or academic prototypes because the wrap-around links can increase routing complexity and wiring overhead on chip. Nonetheless, torus NoCs remain of interest for their potential performance benefits, and NoC IP generators (like Arteris FlexNoC) even support automatic topology generation for meshes, rings, and toriarteris.com.
Mesh vs. Torus and Memory Access: In a mesh NoC, the physical placement of memory controllers (MCs) on the grid and the distance from cores can lead to non-uniform memory access latencies – a core located in one corner of the mesh will incur more router hops to reach a controller on the opposite corner than to a nearer controller. A torus can alleviate some of these distance issues by providing multiple wrap-around paths, effectively reducing worst-case hop counts. The improved path diversity and shorter diameters of a torus can help avoid congestion hot spots when memory traffic is heavy, by routing around the ring connections. However, both topologies require careful traffic management and memory address mapping to fully utilize their bandwidth. This is where memory interleaving comes into play: by distributing memory accesses across multiple controllers or banks, one can prevent any single region of the NoC from becoming a bottleneck due to concentrated traffic.
Memory Interleaving for Enhanced Memory Performance
Memory interleaving is a classic technique to improve memory bandwidth and latency by splitting memory into multiple modules that can be accessed in parallel. Instead of storing consecutive addresses entirely in one memory bank (or channel), interleaving stripes memory addresses across multiple banks or memory controllers in a fixed pattern. This means that sequential or nearby addresses reside in different physical memory units, allowing multiple memory accesses to proceed concurrently. As a result, a processor can issue back-to-back memory requests without waiting for the previous one to finish, since each request goes to a different memory unit that can operate independentlyinfohub.delltechnologies.com. Interleaving was originally used in large-scale and vector computers to overcome slow memory speeds by overlapping accesses to multiple memory banksacs.pub.ro. In modern SoCs, interleaving is crucial for multichannel DRAM systems: for example, dual-channel memory doubles the data path (128 bits instead of 64 bits) and allows two memory transactions to occur in parallel, effectively feeding the processor with up to 2× the data per cyclemanuais.iessanclemente.netmanuais.iessanclemente.net. More generally, with N channels or controllers interleaved, the peak memory bandwidth can approach N times that of a single channel (assuming ideal load balancing).
From a performance standpoint, interleaving maximizes utilization of all memory resources. It reduces idle time for memory devices and increases parallelism, which is especially beneficial for throughput-oriented workloads. Dell’s server documentation concisely explains that when memory is interleaved, “contiguous memory accesses go to different memory banks” and therefore subsequent accesses need not wait for the previous one to completeinfohub.delltechnologies.com. With all DIMMs or channels in one interleave set (uniform memory region spanning them), the total memory bandwidth is increased since “the distribution of information is divided across several channels… and the total memory bandwidth is increased”infohub.delltechnologies.com. In practice, most systems see maximum memory performance when all memory controllers/channels are interleaved into one unified address space, so that any given memory region is spread across all channelsinfohub.delltechnologies.cominfohub.delltechnologies.com. This ensures that every memory access load is balanced and utilizes the full width of the memory system. Conversely, if interleaving is not used (or multiple disjoint interleave sets are created), some memory regions would reside on only a subset of controllers, potentially leaving bandwidth on other controllers unused and creating NUMA (Non-Uniform Memory Access) effects where some addresses are “faster” to access than othersinfohub.delltechnologies.com. Thus, from a pure bandwidth perspective, a single interleaved pool is ideal for most general-purpose workloadsinfohub.delltechnologies.com. (There are scenarios where partial or no interleaving is preferred, such as explicit partitioning of memory for real-time isolation or NUMA-aware software optimizations – these will be touched upon later.)
Interleaving Granularity: A key design parameter is the interleaving granularity, i.e. the size of address blocks alternated between memory units. This can range from very fine-grained (e.g. every cache line or 64 bytes alternating controllers) to coarse (e.g. 4KB pages or larger). Fine-grained (cache-line or sub-page) interleaving tends to maximize load balancing and parallelism, since even small memory regions engage all channels. However, very small stripes can incur overheads: for instance, successive cache lines going to different controllers might increase the number of open/close operations in each DRAM (reducing row buffer locality)users.cs.utah.edu and also require more frequent controller switching for a streaming access pattern. Coarser interleaving (e.g. at page level) keeps each page in one controller (better locality) but sacrifices parallelism for a single large data stream. Many systems choose an intermediate granularity, such as 1KB or 4KB chunks, to balance these trade-offs. For example, AMD’s programmable NoC (from its Xilinx division) allows interleaving across 2 or 4 controllers with configurable granularity (e.g. 1KB stripes)docs.amd.com. In that scheme, “alternate 1KB regions go to different DDR controllers,” and the NoC hardware will even split a burst transfer if it crosses a 1KB boundary so that each portion is sent to the appropriate controllerdocs.amd.com. This ensures that no single AXI transaction spans two controllers, simplifying coherence and ordering. Generally, the interleave granularity is aligned to typical access sizes (cache lines or pages) to avoid splitting too many requests.
Mapping Interleaved Memory onto NoC Topologies
When implementing memory interleaving in an SoC, the NoC plays a central role in routing memory requests to the correct memory controller based on the address. In a multi-controller system with a flat unified address space, each physical address is mapped to a specific memory controller, often by a simple modulo or hashing on certain address bits. For instance, in a system with two controllers interleaved, a particular address bit (or bits) might determine which controller holds that address. The on-chip interconnect must decode those address bits and forward the request to the corresponding controller’s port. This functionality can be integrated into the NoC routers or the memory request initiators. In an ARM mesh interconnect, for example, a component called the Home Node or Snoop Filter (HN-F) node owns a portion of the physical address space; a hashing scheme may be used to distribute addresses evenly across HN-F nodes (which correspond to cache slices or memory ports)developer.arm.com. In the AMD/Xilinx NoC mentioned earlier, each NoC Master Unit (NMU) at a cache/CPU will perform address interleaving as configured: “the NoC manages interleaving at each NoC entry point (NMU)… arranged in a strided fashion such that alternate 1K regions go to different DDR controllers”docs.amd.com. The result is that half of the memory region’s addresses map to one controller and half to another, effectively making two physical controllers behave like one larger, higher-bandwidth memory from the software’s perspectivedocs.amd.comdocs.amd.com.
Load Balancing and NoC Traffic: Interleaving naturally balances memory traffic load across multiple controllers. In a 2D mesh NoC, this means that requests from all cores are distributed across the chip rather than funneling into a single memory controller node. For example, a 64-core mesh might have 4 memory controllers placed at four quadrants of the chip; with address interleaving (or hashing) the memory requests of each core will statistically spread to all four controllers, preventing any one quadrant’s controller (and the routes leading to it) from becoming a hot spot. A concrete case is the Tilera Tile64 manycore (which used a mesh NoC): it had 4 on-chip memory controllers and employed a controller-interleaved page placement so that no 64KB page was serviced by only one controllerarcb.csc.ncsu.eduarcb.csc.ncsu.edu. In fact, on Tile64 the hardware used bits of the physical address to select the memory controller, with the effect that one could not allocate more than 64KB of contiguous physical memory on a single controller – larger allocations automatically spanned multiple controllersarcb.csc.ncsu.edu. This striping scheme ensured that memory traffic was evenly divided among the 4 controllers (each with its own DRAM channels), significantly boosting achievable memory bandwidth. The design also implemented an address hashing technique to spread accesses among DRAM banks and reduce bank conflictsarcb.csc.ncsu.edu, illustrating that interleaving can be applied hierarchically (among controllers, and among banks within each controller). The overall impact was a much higher sustainable memory throughput and smoother memory access latency distribution, since each core sees an average memory latency that is an aggregate of near and far controllers.
When comparing mesh and torus topologies, interleaving works conceptually the same way – by address partitioning – but the topological differences influence the performance of the interconnect under that traffic. In a mesh, if memory controllers are at the periphery, interleaved traffic means every core will at times need to send requests to distant edge controllers, traversing multiple hops. This can introduce noticeable latency for those accesses and consume NoC bandwidth. A fully interleaved mesh thus behaves somewhat like a distributed shared memory with uniform distribution but non-uniform physical distances (i.e. an on-chip NUMA, where the “NUMA-ness” is hidden from software by the flat address space). By contrast, a torus can mitigate some extremes: because of wrap-around links, the effective distance between any core and any controller is shorter on average, and there are typically multiple minimal paths to a given controller. This can reduce worst-case latency and avoid saturating any single path. In other words, a torus can more gracefully handle the all-to-all traffic pattern that full interleaving induces. Academic analyses have noted that a torus offers higher bisection bandwidth and lower average distance than a meshsciencedirect.com, which directly benefits memory traffic traveling across the chip. Thus, if one were to map the same memory interleaving scheme onto a torus NoC, it would generally yield lower memory access latencies under load, thanks to the topology’s richer connectivity. The trade-off is increased hardware complexity and potentially more power used in those extra links.
Local vs Interleaved Mapping: An alternative to pure interleaving is to assign each core or region primarily to a “local” memory controller (like a NUMA partition) to minimize hop count, and only use remote controllers when local memory is full. This NUMA approach was explored in research and is even configurable in some systems (e.g., “cluster-on-die” mode in Intel processors or BIOS options to turn off interleaving across sockets). However, it places the burden on software/OS to handle non-uniform regions. Awasthi et al. (PACT 2010) point out that simply allocating all data to the nearest MC might not be optimal due to load imbalance – some controllers could become overwhelmed while others are idleusers.cs.utah.eduusers.cs.utah.edu. They proposed adaptive runtime mechanisms to migrate or replicate pages between controllers, achieving significant performance gains (17–35%) over static first-touch or static interleaving policiesusers.cs.utah.edu. This highlights that the optimal strategy can depend on workload and contention: interleaving uniformly optimizes throughput, whereas localized allocation optimizes latency for certain access patterns. Many modern NoC-based SoCs therefore support flexible interleaving modes. For instance, the AMD Infinity Fabric (used in Epyc server CPUs) can be configured in BIOS either as a single memory domain interleaving across all controllers or as multiple NUMA domains where each die’s controller mostly serves its local coreschipsandcheese.com. In AMD’s older Magny-Cours architecture (two dies in one package), the system could be run in an interleaved memory mode so that legacy OSes saw a unified node, at the cost of some cross-die latencychipsandcheese.com. Ultimately, balancing NoC distance vs. memory parallelism is a key design decision, and both academia and industry have developed solutions (like sophisticated page mapping algorithms, or hashed interleaving that takes physical distance into account) to get the best of both worlds.
Evolution in Academic Literature
Techniques for memory interleaving have a long history in computer architecture. Academic literature from as early as the 1970s and 1980s discussed interleaved memory to supply multiple words per cycle to high-performance processors (e.g., in vector supercomputers)acs.pub.ro. As multiprocessors emerged, researchers noted the benefits of interleaving to allow concurrent accesses from multiple CPUs and to reduce memory bank contention. Lamport (1979) famously described the requirements for a multiprocessor’s memory system to provide a coherent view despite operations completing out of program orderresearchgate.netresearchgate.net – an issue that becomes trickier when buffering and interleaving are used to overlap memory accessesresearchgate.net. By the 1990s, cache coherence protocols and non-uniform memory access (NUMA) architectures were active research areas; interleaving was a basic assumption in many cache-coherent NUMA designs to stripe addresses across memory modules in different nodes for load balancinginfohub.delltechnologies.com.
With the rise of on-chip multiprocessors (CMPs) in the 2000s, academic focus shifted to on-chip networks and distributed cache/memory organizations. One notable thread was the introduction of distributed shared last-level caches (banks of L2/L3 across the chip) and the concept of Non-Uniform Cache Access (NUCA). Huh et al. (2007) studied a 16-core CMP with 256 L2 banks connected via a network, comparing different address-to-bank mapping policiesresearchgate.netresearchgate.net. A simple static interleaving of addresses to cache banks provided uniform load spreading, though at the cost of some remote bank accesses; they found that more dynamic policies could outperform static interleaving by keeping frequently accessed lines in closer banksresearchgate.net. This mirrors the tension between uniform interleaving and locality that also applies to distributing main memory accesses.
As on-chip networks evolved, researchers examined various NoC topologies (mesh, torus, flattened butterfly, etc.) and their impact on memory access. Dally and Towles (2001) advocated packet-switched on-chip networks and discussed how regular topologies like meshes can be used to connect processors and memories in a tiled fashion for scalability. The mesh/torus comparison has been revisited often: a recent work on NoC topology design notes that “the Torus is like [a] mesh but with wrap-around connections, reducing average path length and improving bandwidth”sciencedirect.com, reaffirming why a torus might benefit memory traffic patterns. However, many academic NoC prototypes (e.g., MIT RAW, TRIPS, Tilera) stuck with meshes for simplicity. The Tilera Tile64 (2008) is an academic-inspired commercial design that we mentioned; an academic study on manycore memory allocation noted that Tile64 uses a 64KB page size and a controller-interleaved placement, meaning no single 64KB page stays in one controllerarcb.csc.ncsu.edu. That study (Mueller et al., 2016) was examining OS-level memory allocators for manycores and had to account for Tilera’s fixed interleaving when managing memory blocksarcb.csc.ncsu.eduarcb.csc.ncsu.edu.
Another research direction looked at page coloring and permutation-based interleaving to mitigate row buffer conflicts. Zhang et al. (MICRO 2000) proposed a permutation-based page interleaving scheme that spreads out pages such that accesses have higher chance to hit open DRAM rows and exploit bank-level parallelismusers.cs.utah.edu. This indicates that beyond simply striping addresses, how you interleave (linear vs. hashed vs. permutation) can impact the efficiency of memory access – a consideration both in research and in some industry controllers (which often XOR address bits to hash across banks).
Academic interest continues in how to optimally place memory controllers in a NoC and assign addresses. For example, Balasubramonian’s group (Awasthi et al. 2010) highlighted that with multiple on-chip controllers and a large flat address space, the system inherently becomes NUMA – some memory addresses are “near” (served by a close controller) and some “far”users.cs.utah.edu. They argued that intelligent data placement or migration is required because neither pure first-touch (locality-only) nor pure round-robin interleaving (uniform-only) is universally bestusers.cs.utah.eduusers.cs.utah.edu. Their adaptive first-touch and page migration policies were an early example of hardware/software cooperative management to get both low latency and high bandwidth. Subsequent research has built on these ideas, exploring everything from memory networks (treating memory itself as a network of banks) to machine-learning-based page allocation in NUMA systems.
In summary, the academic evolution has moved from simple interleaving for bandwidth (in early multiprocessors) to more nuanced strategies that consider on-chip distances and contention. The interplay of NoC topology and memory placement is now a recognized aspect of manycore design. As core counts and memory channels increase (e.g., 100+ core chips with 8 or more memory channels), researchers have proposed using sophisticated hashing or even runtime page scheduling to map addresses to controllers in a way that minimizes NoC congestion and queuing delays at controllersusers.cs.utah.eduusers.cs.utah.edu. We see academic concepts like these being adopted in industry in various forms (hash-based interleaving, QoS-aware memory scheduling, etc.).
Industry Practices and Implementations
Leading SoC and CPU vendors have incorporated memory interleaving and NoC topology considerations into their designs, often documented in whitepapers or technical manuals:
- ARM (Mesh NoC with Distributed Home Nodes): ARM’s high-performance cache-coherent interconnects (CCI, then CCN, and now CMN series) use mesh topologies for scalability. In ARM’s CMN-600/700 mesh, up to dozens of HN-F nodes (home nodes for cache/memory) are placed throughout the meshanandtech.com. ARM employs hashing (“striping”) of addresses across these HN-F nodes to distribute traffic. The CMN-700, for instance, supports striping across a non-power-of-2 number of memory controllers, indicating a flexible hashing mechanism to evenly map addresses even if, say, 10 or 12 controllers are useddeveloper.arm.com. The aim is to avoid any load imbalance in memory requests. Official ARM documentation provides System Address Map (SAM) programming examples where interleaving across chips or controllers can be configured (e.g., enabling 4KB interleaving across local and remote memory nodes) – ensuring that not just on-chip controllers but even memory across chiplets can be unified for software transparencydeveloper.arm.com. Products like Ampere’s Altra (80-core ARM Neoverse-N1) indeed feature an ARM mesh interconnect with memory striping and hashing; Ampere’s public documentation notes the use of a coherent mesh where addresses are interleaved across memory interfaces to maximize bandwidth (the Altra has 8 DDR controllers accessed via the mesh). The Neoverse CMN-700 specifically expanded the number of memory controller ports from 16 to 40 to accommodate designs with enormous memory bandwidth (e.g., mixing DDR5 and HBM memory)anandtech.comanandtech.com. Such designs crucially rely on interleaving to manage traffic to those many controllers. ARM’s documentation and the Socrates configuration tool allow designers to choose interleaving granularity and hashing algorithms to optimize performance for their SoC.
- Intel (Mesh on Client/Server CPUs): Intel transitioned from ring buses to a mesh NoC in its Skylake-SP (Xeon Scalable) processors in 2017tomshardware.com. In that mesh, cores and LLC slices are arranged in a grid, and multiple memory controllers (MCs) sit along the mesh as well (e.g., up to six MCs for six DDR4 channels in Xeon). Intel Xeon processors expose a NUMA view by default (each socket is a NUMA node), but within a socket, the OS typically sees a flat memory space interleaved across all on-die controllers. The on-chip mesh and the memory controllers work together to make this transparent. Intel’s hardware uses an address hashing scheme called HAM (Hash Address Mode) in some generations to reduce hotspotting: it XORs a few address bits to more uniformly distribute accesses across the memory channels and rankssoramichi.jp. Furthermore, Intel provides BIOS options for sub-NUMA clustering (SNC) on some Xeon models, which essentially partitions the mesh and groups half the controllers with half the cores to create two NUMA domains per socket. In SNC disabled mode, memory is fully interleaved across all controllers; in SNC mode, interleaving is only within each half, improving local latency at the expense of peak bandwidth for each domain. This is a practical example of industry toggling between interleaved vs. localized memory mapping to suit different workload needs. Another Intel example is the many-core Knights Landing (Xeon Phi) processor: it had a high-bandwidth 2D mesh connecting 72+ cores and 6 memory controllers (along with MCDRAM stacks). Intel’s tuning guides recommended using quadrant/snc modes to manage latency, but when those are off, the default was an all-to-all interleaving to use all memory channelsanandtech.com.
- AMD (Infinity Fabric and Chiplet Memory Interleaving): AMD’s Epyc processors have a modular design with multiple die “chiplets” each containing cores and a portion of the total memory controllers. The Infinity Fabric serves as a coherent interconnect between these dies. By default, each die manages the memory directly attached to it (NUMA domains), but AMD supports memory interleaving across dies (sometimes called Memory Interleaving or Memory Addressing modes in BIOS). For instance, in the older Opteron Magny-Cours (which packaged two dies in one chip), the system could be configured such that memory addresses alternate between the two dies’ controllers, creating a single contiguous memory space for the OSchipsandcheese.com. This helped “scale performance with non-NUMA aware code” by balancing memory traffic, albeit at the cost of remote memory latencychipsandcheese.com. In modern EPYC, one can choose “Channel Interleaving” (spreading addresses across the channels on a die) and “Die Interleaving” (spreading across dies). AMD’s platform guidelines often recommend keeping memory fully interleaved across all channels per socket for maximum bandwidth, unless specific NUMA optimizations are requiredabhik.xyz. On-die, AMD’s designs (like Zen 2/3) typically have multiple memory controllers (two per IO die in Epyc) and those controllers interleave at a 256-byte or 512-byte granularity across the channels. AMD’s documentation confirms the benefits: “Memory interleaving makes the participating memory controllers appear as one large pool… Memory traffic is balanced across the controllers in hardware and software does not need to determine how to place data”docs.amd.comdocs.amd.com. This quote from AMD underscores the industry’s goal: make multiple controllers look like a single high-bandwidth memory to simplify software and maximize performance. AMD (via Xilinx) also uses interleaving in its FPGA-oriented NoC as discussed, showing the concept’s broad applicability from CPUs to configurable SoCs.
- SoC Interconnect IP (Arteris, Sonics, etc.): Dedicated NoC IP providers have long recognized the importance of multichannel memory interleaving. Sonics Inc. introduced an “Interleaved Multichannel Technology (IMT)” in 2008 as part of its on-chip interconnect offeringsdesign-reuse.comdesign-reuse.com. Sonics IMT could manage up to 8 external DRAM channels and provided user-controlled interleaving with hardware load balancingdesign-reuse.comdesign-reuse.com. It was designed to be transparent to software, presenting a unified address space and automatically dividing memory transactions among the channels. A Sonics whitepaper noted that simply having two channels without a good interleaving scheme often required burdensome software tweaks to split traffic, whereas their hardware IMT evenly divided traffic and even allowed asymmetric channel configurations with partial interleavingdesign-reuse.comdesign-reuse.com. By splitting memory bursts across multiple channels, Sonics claimed to eliminate wasted bandwidth that occurs when single-channel DDR transfers larger bursts than the typical data object size (e.g., 64-byte cache lines vs. 128-byte DDR bursts)design-reuse.comdesign-reuse.com. The interleaving ensured that those large bursts actually fetch useful data from multiple channels in parallel. Similarly, Arteris IP in its FlexNoC product line supports advanced memory interleaving features. The latest Arteris FlexNoC 4 (aimed at AI and automotive SoCs) explicitly touts “HBM2 and multichannel memory support – ideal integration with HBM2 multichannel memory controllers with 8 or 16 channel interleaving”arteris.com. This indicates that Arteris can automatically handle the address mapping for up to 8 or 16 channels of wide HBM memory, which often sits on-package. The ability to interleave across a non-power-of-two number of channels (like 6 or 10) is also important for real designs and is a feature in these IPsdeveloper.arm.com. These commercial NoC IP solutions provide designers with configurable options: for example, one can select the interleave stride (cache line, 128B, 256B, etc.), the addressing scheme (linear vs XOR hash), and whether to interleave at all or keep controllers separate. Both Sonics and Arteris emphasize that their solutions operate with low overhead and transparency, meaning they handle reordering and splitting such that from the CPU’s perspective, it’s just accessing a bigger, faster memorydesign-reuse.com. They also support mixing interleaved and non-interleaved regions — for instance, some critical memory might be fixed to a specific controller (for latency or security reasons), while bulk memory is interleaved for bandwidth.
In the GPU and high-performance accelerator domain, similar principles apply. GPUs have many memory channels (e.g., 6 or 8 GDDR/HBM channels) and they uniformly interleave across them to maximize throughput – this is typically done at a fine granularity (often at 256-byte or 512-byte boundaries) since GPU workloads stream through large memory regions. NoC topologies in GPUs vary (some use crossbar-like interconnects on-die, others a mesh for very large GPUs). NVIDIA’s recent GPUs, for example, use a hybrid ring+mesh interconnect and incorporate memory partitioning across HBM stacks – again using address hashing to distribute accesses evenly. Although details are proprietary, the concept is analogous to the SoC practices described above.
To conclude, industry practice embraces memory interleaving as a fundamental technique to boost memory performance, and the NoC topology is the backbone that makes it work in a scalable way. Mesh and torus NoCs provide the routing infrastructure to connect many distributed memory controllers; interleaving (striping addresses) is the scheme that maps the memory onto that infrastructure efficiently. Over the years, both academic research and industry implementations have converged on a few key themes:
- Use interleaving (possibly with intelligent hashing) to maximize bandwidth and balance load across controllersdocs.amd.cominfohub.delltechnologies.com.
- Be mindful of NoC topology and latency; if needed, allow some NUMA or clustering options to reduce average distance when bandwidth is less criticalchipsandcheese.comusers.cs.utah.edu.
- Incorporate flexibility in the interconnect IP so designers can choose interleaving strategies per memory region or subsystem (as seen in ARM’s and Arteris’s offerings)docs.amd.comarteris.com.
- Ensure that all of this is abstracted from software unless software explicitly wants to manage it – the goal is typically to make multiple memory channels appear as one “big fast memory” to the programmerdesign-reuse.comdocs.amd.com.
Both mesh and torus NoCs can successfully support interleaved memory with careful design. As core counts and memory channels continue to grow (with chiplet-based systems, 3D-stacked memory like HBM, etc.), these techniques are more critical than ever. Future academic work is likely to keep influencing industry – for example, research on machine-learning-guided page placement or new topologies (like 3D meshes) could further improve how we map and move data on-chip. The interplay of topology and memory interleaving will remain a rich area of optimization for SoC architects aiming to squeeze the most performance out of every byte transferred across the chip.
References:
- Hennessy, J. L., & Patterson, D. A. Computer Architecture: A Quantitative Approach (5th Ed.) – discusses memory interleaving in the context of improving bandwidth (multiple words per cycle)acs.pub.ro.
- Dell Technologies, Memory Population Rules for 3rd Gen Intel Xeon Scalable – explains memory interleaving benefits for bandwidth by using all DIMMs/channels in one setinfohub.delltechnologies.cominfohub.delltechnologies.com.
- AMD (Xilinx) NoC Architecture, PG313 Network-on-Chip – describes two/four-controller interleaving presenting a unified address space, with 1KB stripes alternated across controllers and automatic load balancing in hardwaredocs.amd.comdocs.amd.com.
- Sonics Inc., Press Release (2008) – introduces the IMT interleaving technology for on-chip memory controllers, dividing traffic evenly among up to 8 DRAM channels and operating transparently to softwaredesign-reuse.comdesign-reuse.com.
- Arteris IP, FlexNoC 4 Announcement (2018) – highlights support for HBM2 and multi-channel memory with 8 or 16-channel interleaving, and automated mesh/torus topology generation for AI SoCsarteris.comarteris.com.
- Awasthi, M. et al. (PACT 2010) – “Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers”; discusses flat address space across multiple on-chip MCs causing NUMA effects and proposes adaptive page allocation to improve on naive interleaving or first-touch, yielding up to 35% speedupusers.cs.utah.eduusers.cs.utah.edu.
- Mueller, F. et al. (ARCS 2016) – “Reducing NoC and Memory Contention for Manycores”; notes Tilera Tile64’s 4 MCs with 64KB pages controller-interleaved, and uses address hashing to increase bank-level parallelismarcb.csc.ncsu.eduarcb.csc.ncsu.edu.
- Chips and Cheese tech blog, AMD Magny Cours and HyperTransport (2025) – describes how AMD allowed interleaving memory across two dies to present a unified memory space for software, improving performance for code not optimized for NUMAchipsandcheese.com.
- Sciencedirect (J. of Supercomputing, 2025) – notes that a torus network’s wrap-around links reduce average path length versus a mesh, which can improve memory access latency and network bandwidthsciencedirect.com.
- AnandTech, Arm Neoverse V1/N2 and CMN-700 (2021) – details the ARM CMN-700 mesh, supporting up to 40 memory controllers and anticipating usage of both DDR and HBM memory with adequate interleaving/hashing to manage trafficanandtech.comanandtech.com.
- Patterson, D. A., & Hennessy, J. L. – “Memory Systems and Interleaving” (in earlier editions) – foundational explanation of memory bank interleaving and its use in pipeline and vector processors (not directly cited above, but classic textbook treatment).
'System-on-Chip Design > NoC' 카테고리의 다른 글
Memory Interleaving Granularity and Data Splitting in ARM NoC Systems (3) | 2025.08.03 |
---|---|
Simulation Example (3) | 2025.06.16 |
Simulation (2) | 2025.06.16 |
Performance Analysis (2) | 2025.06.16 |
Buses (0) | 2025.06.16 |