Demystifying Memory Hierarchy – A Programmer‘s Guide

Imagine you asked a friend to retrieve a book for you from their extensive home library. To save time searching shelf-by-shelf, they might first glance at their desk, check the books stacked by their favorite chair, then scan the genres they peruse most often before digging into the farthest, dustiest corners.

In computers, memory hierarchy works similarly – intelligent predictive models govern what data gets stored close to the processor vs farther away on cheaper, slower media. As a programmer understanding these layers unlocks optimization opportunities. This guide unravels the mystery of memory hierarchy from circuits to code.

Why Locality Matters

To achieve high performance given finite memory budgets, computers exploit locality – the tendency for software to access related information recurrently over short time intervals. Master architect Jim Smith generalized this in 1982 as principles of temporal and spatial locality.

Temporal locality occurs because program workflows iterate through loops, recurse through data structures, invoke common utilities, and retain intermediate results needed for further calculations. Information utilized recently likely serves again in the imminent future.

Likewise spatial locality emerges since instructions execute sequentially and data organizes into arrays, objects, and pointers interrelating components. Nearby memory addresses and register contents often contribution together towards some collective purpose.

Observe below a simple C function demonstrating both forms of locality:

// Sum array elements via temporal locality in ‘total‘ accumulator  
int sum(int* values, size_t count) {

  int total = 0; // Spatial locality between function args and total

  for(size_t i = 0; i < count; i++) {
    total += values[i]; // Sequential memory access
  }

  return total; 
}

Skilled programmers intentionally craft algorithms to concentrate locality. But even absent such optimization, common patterns emerge statistically through real workloads.

Climbing the Pyramid

Memory hierarchy ranks storage infrastructure across multiple levels balancing capacity, speed, and price. Higher tiers connect closest to the processors leveraging interoperation locality to reduce perceived latency. Think of this architecture as a pyramid with the CPU at its apex:

Registers – These small onboard memory banks connect directly into the arithmetic logic units (ALUs) enabling low single-digit cycle access. Though only kilobits in size, intelligently managed registers through compiler allocation and function calling conventions provide tremendous temporal bandwidth.

Cache – Small amounts (megabytes) of fast (1-10 cycle) static RAM (SRAM) cache duplicate recently used data from main memory reducing retrievals. Modern processors implement multiple levels (L1, L2, etc) of cache often with dedicated varieties for instructions and calculations.

RAM – Gigabytes of affordable dynamic RAM (DRAM) supplied on memory modules form the primary workspace holding program state information like variables, metadata, and buffers. While slower (~100 cycle) than cache, large enough to host OS kernels, applications, plus active datasets.

Secondary Storage – Disk drives and solid state drives (SSDs) provide permanent but slow access to tertiary data & archived programs using virtual memory abstractions. Networks now transparently tier and replicate such external storage via SAN and cloud interfaces for both capacity and availability.

Let‘s explore each tier further including historical perspective on their advancement enabling the modern computing base.

Tiny Registers – Huge Performance

The earliest electronic computers built on the architectural heritage of tabulating machines. These incorporated processor registers analogous to notched wheels or banks of mechanical latches spanning a few bytes – sufficient state for mainly fixed arithmetic operations.

CDC‘s 1964 supercomputers grew register capacity slightly but still focused their budget on other innovations like pipelining. Competing guidance systems likewise minimized hardware registers leaning instead on the compiler to judiciously juggle main memory references via an optimized instruction set architecture.

Such frugality prevailed for the next decade until complex KL10 pipelines demonstrated more onboard state improved performance despite marginally higher cycles per instruction (CPI). Additionally, specialized units like floating point accelerators and vector registers further validated dedicating increased silicon to storage near the dies.

This trend continued through the RISC revolution starting in the mid 1980s. Reduced Instruction Set Computers explicitly emphasized simplifying processor implementation to enable higher clocks and greater throughput. Abundant registers perfectly aligned by minimizing logic dedicated to encoding their identity in every instruction.

Today even microcontrollers embed dozens to hundreds of general purpose registers augmented by specialized varieties (floating point, SIMD, etc) depending on complexity. Compilers easily sustain register locality as modern software continues trending from complex control flow towards predictable data parallelism.

Caching Out

Given the steep premium on ultra low latency memory, only the fastest technologies merit integration into the processor die itself. Yet even rigorously optimized software still peppers bursts of activity between periods of latency from disk, network, and user interactions.

Pioneered in 1962 amongst the Atlas supercomputer‘s Virtual Memory architecture, adding small high-speed cache buffers as intermediaries to larger backing stores provided a major leap forward in real world responsiveness. These early "mini memories" bridged disparate technologies to extract maximal utility from still exorbitantly expensive RAM components.

Over subsequent decades, adding caches on the boundary between distinct storage areas proved a reliably effective tactic. Especially as burgeoning density from silicon fabrication enabled successively larger SRAM pools starting in the mid 1980s. Leading microprocessors soon positioned multilevel caches ranked by diminishing performance as the requests funneled further from the execution pipelines internally.

Today‘s CPU architectures commonly implement three levels of cache often with unique varieties customized by purpose. The smallest, fastest L1 caches locate adjacent to the vector engines or scalar ALUs. Being closest, they reduce latency the most but sacrifice capacity and associativity. Mid-size L2 reside further out as a victim cache or to+"salt away" pricier silicon real estate. Finally larger L3 configurations favor energy efficiency and throughput over raw speed.

Beyond incremental size and associativity tweaks, designers also added logic enhancing specific cache behaviors for their workloads:

Mitigating contention via partitioning or duplication
Augmenting hits through prefetching and tabling
Reducing writes via dirty bits enabling lazy flushes
Accelerating replacement via pseudo-LRU or randomization
Concentrating locality via guided placement and pinning

Multicore parallelism introduced additional concerns balancing coherence overhead with duplication costs. Overall the theme remained matching quickly accessible memory to the working set actively impinging the registers.

Breaking the DRAM Barrier

After the 1969 invention of dynamic RAM finally opened a scalable path to affordable silicon based main memory, its capacity reliably doubled yearly for over a decade. Eventually intensifying leakage currents required periodic refresh cycles to maintain stored bits. But this pitfall shrank against magnitude gains in density enabled by manufacturing improvements.

By the mid 1990s, individual SDRAM DIMMs could hold hundreds of megabytes presenting traditional bandwidth challenges to the motherboard Front Side Bus (FSB) connection. Additional signaling changes including Double Data Rate (DDR) signaling, multi-channel architecture, and point interconnect isolation all helped accelerate access below 100 nanoseconds.

However DRAM access latency soon struggled to keep pace with duration cuts achievable within the processor itself via architectural innovations from speculative execution to Instruction Level Parallelism (ILP). So whole new categories of intelligent memory controllers emerged to orchestrate scheduling policies minimizing delays to priority requests like:

Low latency modes for inter-cache transfers
Reordering schemes emphasizing row locality
Preemptive staging of predicted data structure traversal
Traffic analysis adjusting burst length and channel concurrency dynamically

Furthermore, the External Memory Interface (EMIF) also adapted to bridge growing external technology performance inequality…

While hardware largely hid these intricacies during boot and memory training, understanding the mechanisms remains crucial for any programmer seeking to architect high performance algorithm mappings across the hierarchy.

Beyond the Horizon

Moore‘s Law has finally expired for conventional silicon CMOS transistors after decades of exponential density scaling. Nevertheless new memories may someday redraw hierarchy boundaries by offering distinct price and capability tradeoffs complementary to existing vehicles. Some emerging candidates include:

MRAM – Magnetic RAM writes bits by polarizing tiny ferromagnetic stacks. Inherently non-volatile, MRAM promises near DRAM speeds at higher cell density with practically unlimited endurance. Its inherent radiation resilience makes it an intriguing outer space replacement for mechanical drives and error prone flash.

FeRAM – Tiny capacitors etched into a ferroelectric crystal lattice yield non-destructive read behavior with extraordinarily fast writes. Though overall density suffers compared to other technologies, moderate FeRAM serves well for slotting into SRAM cache levels. Its ultra low latency directly aids applications like high resolution neural signal processing requiring extensive random state updates.

PCM – By adjusting programming current injected into Phase Change Memory chalcogenide glass, electrons shift between structural states mapping bits via measurable resistance variance. Closer to flash replacement than DRAM displacement, PCM provides improved endurance at lower cost but similar capacity.

STT-RAM – Applying targeted spin transfer torque (STT) rotates electrons within Magnetic Tunnel Junctions representing bits in STT-RAM arrays. Performance far outpaces other non-volatile options while avoidingbrute force write power typical of PCRAM. These supercapacitor-like cells sustain rapid state changes crucial for sustaining writes in secondary storage without wearing prematurely.

Myriad other emerging memories like RRAM, CBRAM, and memristors each attack specific niches hoping to augment tired incumbent technologies. Their battleground spans each level: registers, buffers, caches, main memory, mass storage, and beyond. One day soon they may radically reshape hierarchy…once scaling cost and complexity barriers sufficiently diminish. But until then, engineers cleverly craft hierarches adapting yesterday‘s technologies towards contemporary workload needs.

Programming for Memory

Given compilers now readily translate high level code into machine instructions, software engineers rarely actively consider memory hierarchy nuances from within application logic. However those pursuing truly high performance still benefit from several targeted optimizations:

Insert prefetch intrinsics to manually initiate caching future array subsets up the hierarchy
Structure critical data structures to exploit spatial locality like pointer chasing
Localize heap allocations using First Touch policies or NUMAaware hints
Mask latency by increasing thread parallelism and pipelining I/O operations
Determine optimal blocking sizes balancing reuse versus pollution tradeoffs
Apply iterative relaxation solving techniques over direct methods in sparse linear algebra

Skilled coders additionally recognize and avoid anti-patterns undermining hardware hierarchy efficacy:

Premature evacuation of transient values from registers into main memory slots
Excessive pointer indirection frustrating dependency prediction
Incoherent strides while traversing multidimensional arrays
Overwhelming cache capacity with unused interim workspace variables
Polling Flags, semaphores, and queues instead of event triggers

Finally, directly instantiating asynchronous memory transactions via intrinsics judiciously overlapped with independent work often synthesizes biggest latency hiding wins.

While compilers automatically restructure code enhancing locality, only programmers possessing mental models for navigating memory hierarchy space truly unlock hardware potential. Doing so simultaneously maximizes throughput while minimizing energy wasted on unnecessary transfers or idling pipelines awaiting data.

Conclusion

From hand wired core memory to cryogenic DRAM over half a century, the relentless march towards economic memory scale eternally fuels computer advancement. Today GPUs pack thousands of streaming processors into single devices thanks to ample high speed memory bandwidth now economically feasible. And flash storage affords nearly boundless archival capacity once only available to national institutions.

Yet best practices economizing memory hierarchy largely remain universal throughout this progress. Computing architectures accept locality and optimize layout adhering to predictably accessed levels. Programmers in turn arrange code to concentrate data usage maximizing cache reuse opportunities. This corollary mindfully balances memory diversity achieving both optimal throughput and efficiency at minimal complexity.

So while hardware and algorithms relentlessly progress, memory hierarchy design eternally links them together through a hierarchical chain of locality. In computing, some ideas never fade away or lose relevance.