Memory interleaving is a technique where consecutive blocks of the physical address space are distributed (“striped”) across multiple memory controllers or home nodes, creating a unified memory region that spans them. Typical ARM Network-on-Chip (NoC) interconnects (e.g. Arm’s CMN-600/700 Coherent Mesh) support configurable interleaving granularity in power-of-two sizes. Common options range from cache-line scale (64B or 128B) up to page-scale (4KB or more)[1]. For example, an ARM CMN configuration might stripe addresses at a 256-byte granularity across three or more home nodes[2], meaning each 256B block of addresses goes to a different node in a round-robin fashion. Smaller granularity means more frequent switching between memory controllers, whereas larger granularity means each controller handles larger contiguous address chunks.
Figure: Illustration of interleaving across two memory controllers with a 1KB stripe size. Alternate 1KB address regions (0–1KB, 1–2KB, 2–3KB, etc.) are mapped to different controllers, forming a unified interleaved address space[3]. In the diagram, Memory Controller 0 (white) handles the 0–1KB, 2–3KB, 4–5KB, … segments while Memory Controller 1 (gray) handles the 1–2KB, 3–4KB, 5–6KB, … segments. This automatic distribution balances traffic across controllers without software having to manage placement.
The interleaving granularity is typically chosen based on system goals. Fine-grained interleaving (e.g. 256B or 1KB stripes) maximizes parallelism by spreading even small memory accesses across controllers, while coarse interleaving (e.g. 4KB stripes) keeps whole blocks (like OS pages) on a single controller. ARM’s NoC hardware allows these modes to be configured to suit the workload; for instance, 3-SN or 6-SN striping modes in CMN hash addresses across 3 or 6 home nodes at 256B granularity in order to distribute load evenly[2].
AXI Burst Transactions and Interleave Boundaries
AXI (Advanced eXtensible Interface) is a burst-based protocol, and AXI masters can issue bursts consisting of multiple data beats. However, the AXI specification imposes a key rule: bursts should not cross 4KB address boundaries[4]. The reason is that crossing a 4KB boundary could mean the burst spans into a different slave region (e.g. a different memory controller or peripheral), which is generally “an impractical situation” and is disallowed by the spec[4]. In practice this means an AXI burst must fit within a 4KB-aligned address window.
When memory interleaving is used with a granularity smaller than 4KB, a single contiguous AXI burst could still target multiple controllers internally, even though it stays within a 4KB region (and thus isn’t illegal by AXI rules). For example, with a 1KB interleaving across two controllers, an 2KB linear burst starting at an aligned address will span two 1KB stripes: half of its addresses belong to “Controller A” and half to “Controller B”. The AXI protocol itself has no knowledge of this split, since the interleaved controllers present one unified address space to the master. It falls to the NoC/interconnect logic to handle the split transparently.
Burst Splitting at the AXI Level: NoC interconnects are designed to chop or split bursts that cross an interleaving boundary so that each portion can be routed to the appropriate memory controller. In our example of a 2KB burst with 1KB stripes, the interconnect (e.g. at the NoC’s master interface unit) will split the single burst into two transactions – one for the first 1KB to Controller A, and one for the second 1KB to Controller B. More generally, if a burst transaction crosses an interleave boundary, the interconnect hardware “chops” the transaction at that boundary[3]. This ensures each sub-burst stays entirely within one memory target. The ARM CoreLink NoC architectures (and similarly, the NoC in Xilinx/AMD Versal) implement this behavior at the NoC entry point. “If a burst transaction is sent to an NMU (NoC Master Unit) and crosses an interleave boundary…the transaction is chopped at the interleave boundary,” so that a single AXI transaction never spans two interleaved regions[5]. The master device still perceives it as one continuous burst overall, but under the hood it has been divided into multiple AXI transfers on the memory side. The AXI write or read responses for the sub-transactions are coordinated such that the original ordering is preserved and the master’s expectations are met (e.g. the data beats return in sequence).
For very fine interleaving (256B, 512B etc.), even moderate-size bursts will be split into many pieces. Consider a 256-byte interleaving: a burst of 1KB (1024 bytes) would be divided into 4 chunks mapped to alternating controllers. The interconnect would issue 4 sub-bursts (each 256B) to the controllers in turn. Conversely, with a coarse 4KB interleaving, that same 1KB burst stays entirely on one controller (no split needed). In fact, with 4KB stripes, any legal AXI burst (which cannot exceed 4KB by rule) will always remain on a single controller. Thus, 4KB interleaving effectively avoids burst splitting, aligning with the AXI boundary rule by design.
NoC Packetization and Data Splitting
On-chip networks (such as ARM’s CMN) transport transactions using packets and flits internally. A high-level AXI or CHI transaction may be broken into smaller packets for routing efficiency or protocol reasons. Interleaving granularity influences how the NoC packetizes and routes the data:
- Single-Controller Case: If an AXI burst is contained within one interleaved chunk (e.g. a 512-byte burst with 1KB interleave, or any burst under 4KB with 4KB interleave), the NoC can treat it as one transaction targeted to a single home node. The request travels to that node, and the data payload may be sent in one or multiple packets (depending on size). For example, if the NoC’s data packet payload is 64 bytes (commonly the size of a cache line), a 512B read might be delivered as 8 data packets of 64B each, all returning from the same target.
- Cross-Controller Case: If a burst spans two or more interleaved regions, the NoC generates multiple request packets – one per target region. Each packet carries the address range and length pertaining to its region. These packets can be sent in parallel into the mesh network, each heading to a different memory controller node. The data responses will likewise come back as separate packet streams from each controller, which the interconnect will interleave or concatenate back to fulfill the original AXI burst stream. Notably, the packet-level data splitting corresponds to the interleaving: finer granularity causes the NoC to split the data at finer boundaries, potentially creating more, smaller packets. In the earlier 2KB burst example (1KB stripes), two parallel read request packets would be issued. Each yields ~1KB of data, which might come back as a sequence of packets (e.g. 16×64B packets from Controller A and 16×64B from Controller B, in an interwoven fashion).
Internally, ARM’s coherent interconnect protocol (CHI) often operates on cache-line units, so large bursts are naturally segmented. In fact, the NoC may deliberately fragment bursts into cache-line-sized chunks for transport. For instance, the CMN-700 documentation notes that a remote read burst may be “cracked… into 64B chunks” when forwarded to a home node[6]. This means even if an AXI master issues a long burst, the NoC will handle it as a series of 64-byte packets on the wire. Smaller interleaving granularities (256B, 512B) align well with such chunking – multiple 64B packets will simply be directed round-robin to different controllers. With larger granularity, the entire burst’s packets all go to the same controller (until a 4KB boundary is reached).
It’s important to note that packetization overhead increases with the number of splits. Each sub-transaction carries its own header and routing info. So, a finely interleaved burst that becomes many small packets incurs more header overhead and potentially more coordination logic (to merge responses) than one large packet stream. However, the NoC is optimized for this scenario with dedicated network interface units (NIUs) or RN-F/RN-D components that handle the splitting and reassembly seamlessly.
Impact of Granularity on Performance and Bandwidth
The choice of interleaving size involves trade-offs between parallelism and overhead. Fine-grain interleaving (e.g. 256B or 1KB): This maximizes the number of memory controllers that can be engaged by a single high-bandwidth request stream. It allows more requests to reach different channels in parallel, thereby increasing the achievable memory throughput[7]. In multi-channel memory systems, the interleaving granularity largely determines how many simultaneous accesses can occur – a finer stripe means an access pattern will hop to the next channel more frequently, keeping all channels busy for a sustained sequential access[7]. In other words, fine granularity improves memory-level parallelism and tends to yield higher bandwidth utilization. Studies have shown that very fine interleaving can significantly outperform coarse interleaving in bandwidth-heavy workloads. For example, one research work demonstrated that using a 128B stripe (as opposed to a 4KB stripe) can nearly double effective memory bandwidth in worst-case scenarios[8]. The smaller stripes ensure that even within one OS memory page, data is spread across multiple controllers, preventing any single controller from becoming a bottleneck[9].
However, fine interleaving isn’t free of drawbacks. The increased number of sub-transactions and network packets adds some overhead (extra packet headers, more ACK/NACK handling, etc.), which can slightly increase latency for a given burst. The interconnect must also merge or coordinate multiple responses – this is well within design capabilities, but it adds complexity. Additionally, because fine interleaving distributes even small blocks across all controllers, it means all controllers (and their attached DRAM banks) are active for most memory operations. This can reduce locality (e.g. consecutive cache lines might reside in different DRAM channels, potentially opening multiple DRAM rows) and may increase power usage since all memory channels are engaged. There is also an architectural consideration: extremely fine granularity (e.g. 128B) is smaller than the typical 4KB memory page, which means the operating system cannot direct pages to specific channels – every page is automatically spread across channels[9]. This yields great load balancing, but it removes any software control over channel usage (for NUMA or QoS purposes) and requires that the number of channels be a power of two for the address bit striping to evenly cover all combinations[10].
Coarse-grain interleaving (e.g. 4KB): This effectively assigns entire pages (or large blocks) to a single controller. The benefit is simplicity and locality – an OS page resides wholly in one memory controller, which can be advantageous for page-based allocation or if certain processors are affinity-biased to certain controllers. It minimizes the splitting of AXI bursts: as noted, a 4KB stripe avoids any burst-level splits under the AXI rules. This can slightly reduce overhead and keep transactions atomic on the network. The downside is a potential loss of parallelism. A single streaming access will saturate only one controller until it moves to the next 4KB page. If a workload frequently accesses large contiguous regions, one controller might handle most of the traffic while others sit idle, until a 4KB boundary is crossed. In high-bandwidth scenarios, this can underutilize available memory bandwidth – performance can degrade when coarse interleaving prevents parallel channel usage, especially if the memory controllers individually become bottlenecked[7]. Empirical analyses have shown that coarse interleaving (page-sized or larger) can suffer as core counts and memory demands increase, whereas fine interleaving keeps more channels busy and delivers higher sustained throughput[7].
Medium granularity (e.g. 1KB or 2KB): These offer a compromise. For instance, 1KB stripes will split only when bursts exceed 1KB, and still allow up to four distinct 256B cache lines in a row to reside on different controllers before cycling back (if 2 controllers). Many common cache-coherent transactions (like 64B or 128B cache line fills) won’t notice a difference between 1KB and 4KB interleaving – they’ll just hit one controller. But larger DMA bursts or consecutive cache lines will spread across controllers after a few hundred bytes, improving concurrency. In practice, SoC designers often choose an interleave size that matches typical burst lengths or memory access patterns to balance efficiency. For example, if most bursts are 64B–256B, a 256B stripe might be unnecessarily fine (causing splits for bursts just over 256 bytes); a 1KB stripe would ensure most such bursts stay unsplit while still load-balancing at a page sub-boundary. On the other hand, if the system frequently issues 1KB+ cache refills or larger DMA transfers, using 256B or 512B stripes can ensure those are split and serviced concurrently by multiple controllers for better bandwidth.
Conclusion and Key Takeaways
In ARM-based NoC systems, interleaving granularity has a direct impact on how data is segmented and routed through the interconnect. Fine granularity (256B–1KB) causes the NoC to split bursts into multiple packetized transfers that engage several memory controllers at once, boosting parallel throughput at the cost of a bit more protocol overhead. Coarse granularity (2KB–4KB) keeps bursts intact on a single controller (up to the 4KB AXI limit), simplifying transactions but potentially leaving performance on the table when one channel becomes a bottleneck. The AXI protocol’s 4KB burst boundary rule underpins these behaviors: interleaving of 4KB or larger aligns with the rule to avoid splits, whereas sub-4KB interleaving relies on the interconnect to transparently chop bursts at boundaries[5][4].
Overall, the trade-off is between maximum memory parallelism and transaction overhead/complexity. Industry practice and documentation (Arm’s CMN technical references, Xilinx Versal NoC guides, etc.) highlight that interleaving across controllers can “2x or 4x the bandwidth” available to a single request stream[3], which is a huge benefit for memory-intensive workloads. Academic studies further reinforce that finer interleaving yields higher effective bandwidth utilization in multi-channel memory systems[8]. Designers must balance this against considerations like power, typical access size, and system software needs. In summary, smaller interleaving sizes generally improve throughput by enabling packet-level data splitting across the NoC, while larger sizes favor simplicity and locality by keeping AXI bursts intact. The optimal choice depends on the SoC’s performance targets and workload characteristics, but the mechanism is fundamentally the same: interleaving granularity dictates how the NoC divides and conquers memory transactions across the chip.
Sources: The analysis above is based on ARM CMN-600/700 technical documentation, which details supported interleaving modes and internal hashing/striping mechanisms[1][2], as well as an AMD/Xilinx NoC user guide illustrating burst chopping at interleave boundaries[3]. The AXI specification’s 4KB rule is noted in ARM’s developer materials[4]. An academic study on multi-channel memory systems was referenced to quantify performance impacts of different interleaving granularities[7][8]. These sources collectively underpin the discussion of packet and burst splitting behaviors in modern ARM-based NoC designs.
[1] [2] [6] Arm Neoverse CMN 700 TRM Addendum 108055 0301 01 en | PDF | Computer Architecture | Computer Science
https://www.scribd.com/document/845143330/arm-neoverse-cmn-700-trm-addendum-108055-0301-01-en
[3] [5] Memory Interleaving - 1.1 English - PG313
https://docs.amd.com/r/en-US/pg313-network-on-chip/Memory-Interleaving
[4] What is 4KB address boundary in AXI protocol? - SystemVerilog - Verification Academy
https://verificationacademy.com/forums/t/what-is-4kb-address-boundary-in-axi-protocol/33510
[7] [8] [9] [10] upcommons.upc.edu
https://upcommons.upc.edu/bitstream/handle/2117/11379/05642060.pdf
'System-on-Chip Design > NoC' 카테고리의 다른 글
Network-on-Chip Topologies and Memory Interleaving in SoC Design (1) | 2025.08.03 |
---|---|
Simulation Example (3) | 2025.06.16 |
Simulation (2) | 2025.06.16 |
Performance Analysis (2) | 2025.06.16 |
Buses (0) | 2025.06.16 |