Introduction
I originally wrote this in college, a couple of years ago. I wrote how solid state drives would improve computer performance and wanted to update this article now that multi-core processors and solid state harddrives (SSD) have become a reality.
Motivation
I learned about caches in CPU's a couple weeks ago in class and how they improve performance. I want to know how well it worked in real life and particularly how it helped with harddrive accesses as they are the primary bottleneck in computers (the I/O and memory bus is too but not as significant). I also want to find a way to mitigate page faulting (misses) especially when you load an application. When loading and linking an app, your CPU is primarily idle while you wait (and in a lot of cases wait some more) for your app to appear on your screen in a useable state.
Introduction
The basic idea of caching is to hold commonly used values in memory close to the CPU for quick access. Why? Because the values in memory or on the hard disk drive (HDD) are far away and take a very long time to retrieve. If you can imagine a CPU pipeline with 5-20 steps, a step getting executed every clock tick, then it should take 5-20 steps (or 5-20 ticks) to execute a single instruction. This is not very good but in reality there is a different instruction in each step of the pipeline and this allows one instruction to be completed every clock tick. This is nice, but in reality again there are dependencies which cause delays. There are also the dreaded memory accesses which are really frequent and cause the CPU to stall out (execute noop instructions) for long periods of time. Also, the larger your memory gets, the more expensive it will be. In real life the cost is prohibitive, so cheaper (slower) alternatives are used. Hard drives are the largest, cheapest and slowest. On die SRAM is very fast and relatively expensive. Your main memory is somewhere inbetween. SSD are much faster than traditional drives, but still not as fast as main memory. But not much faster. For this reason caches are placed between your main memory and your CPU (or to be more general, your hard drive and your CPU).
A cache will hit (find the right value) with a certain percentage. It depends on the size and configuration (associativity, line size, number of entries). Remember though, that larger number of entries and higher associativity is generally slower.
Hardware Specifications
Processor
The processor used as a reference is a 2.8GHz Wolfdale quad core from Intel. It has (2) two 32kB 2-way associative L1 cache with a 3 cycle latency and a 6144kB 24-way associative L2 cache with a 15 cycle latency. This processor does not have an L3 cache and putting one on the motherboard probably (just a guess) not improve performance. But we'll put an imaginary 24MB L3 cache on the die and see if it changes performance. We assume it has a 30 cycle latency. All caches use full LRU replacement policy.
The quad core is actually (2) two dual cores, which are actually (2) two single cores. Because of this we assume they do not fight over the bus for access to memory, or to cache. This is probably not true but I don't have any information on the architecture. I am sure there would be fighting for the harddrive, but we can make this simple by assuming this test is for single thread performance and that the hard drive is accessed infrequently anyways.
The miss ratio of the L1 should be around 0.01 and the L2 should be 0.00625. We will assume the L3 misses with 0.0002. (See here.)
Memory
The memory will be 2GB of DDR2-800 with a peak transfer rate of 6400MB/s. With 3-4-3-9 timings at 400MHz, this should give us 10 cycle delay. With a CPU at 2.8GHz (7x faster than the memory), our delay should be 70 CPU cycles. We will assume the main memory misses with 0.00001.
Hard Drive
The reference hard drive will be the MTRON MSP 7000 64GB SSD. This is the fastest hard drive currently available and boasts a 0.1ms (100μs) seek time. A 100μs delay in a 2.8GHz CPU will be is 280,000 CPU cycles (0.0001*2.8x10^9). The hard drive of course never misses (unless it is broken).
Equation
This equation calculates average access in cycles for a single data access:
d=L1A*(1-L1M)+L1M*[L2A*(1-L2M)+L2M*[L3A*(1-L3M)+L3M*[MMA*(1-MMM)+MMM*SSA]]]
d=average delay
L1A=access time to L1 cache
L1M=miss ratio for L1 cache
L2A=access time to L2 cache
L2M=miss ratio for L2 cache
L3A=access time to L3 cache
L3M=miss ratio for L3 cache
MMA=access time to main memory
MMM=miss ratio for main memory
SSA=hard drive access time.
Results
Average read time in cycles with no L3: 3.12
Average read time in cycles with L3: 3.12
Average read time in cycles with no L3 (and 5ms Raptor): 3.12
The same on all accounts. How could this be? Real life applications show how much faster system performance is with SSD drives. Perhaps because this is a best case scenario. It appears that with very low miss rates, the L1 access time is the primary limiting factor.
I wanted to find out which of these values really has an impact on performance, so I started varying each value, from one extreme to the other.
I first tried varying the HDD access times from 1000 cycles to 100 million cycles. This had no effect whatsoever. So apparently this level of buffering doesn't alter the average access time, but a page fault at 100 million cycles would be eternity for a CPU. Many page faults over and over would be crippling. Now we see why good software is so important. If the file system, OS and application are wasteful, efficiency drops off precipitously. Varying the memory access variable MMA made close to no difference. There are many memory benchmarks reviews on computer/gamer websites that confirm buying faster memory makes close to no difference. Changing the memory hit rate had no effect either. Changing the L2 cache access time +5cycle had a 1.6% effect. Varying L2 miss rate had no effect. Varying L1M caused the first substantial change in performance. Doubling the miss rate to .02 our access times increased by 4%. Varying the L1 access caused the average access to increase almost linearly. Clearly, the L1 cache is the bottleneck in this design and both the miss rate and the access times are crucial. This is obvious from the equation where nearly all of the final answer comes from the first term. As a matter of fact, a first approximation can be achieved simply by multiplying L1A*(1-L1M).
Conclusions
As we can see, modern memory systems are highly optimized. Not much more can be squeezed out by having a clever design. Does the SSD help? Yes, it is because of inadequecies in the software that improving the worst case scenario (a page fault and subsequent access to the SSD) improves performance. Right now SSD have sub ms access times. This needs to be improved all the way down to sub μs access times. Opening an app takes forever and these days nobody has the patience.
How long does it take to execute the average instruction? Well there is one memory read for every instruction and memory instructions have one additional (don't forget every instruction must be retrieved from memory). If memory instructions are 33% of all instructions, then the total average time for an instruction to be completed (in a full pipeline) is:
td=2*d*(ma)+ d*(1-ma)
td=total average time for an instruction to be completed in a full pipeline
d= value from above (3.12 cycles)
ma=memory access ratio (0.33)
This gives 4.15 cycles per instruction. So best case scenario gives a 24.1% utilization in the CPU logic.
Other Considerations
I did not take into consideration the following when performing the analysis: