You are here

Reliability, Availability and Serviceability

Reliability, Availability and Serviceability (RAS) are related to a products ability to deliver its intended functionality. Essentially how long and how well.

Definitions
Reliability: is the ability (i.e. probability) of a device to provide correct outputs, given a specific operating environment.
Availability: is the likelihood (i.e. probability) of a device to be available for usage at a given time.
Serviceability: is the efficacy of a device to be repaired.

Design Guidelines

  1. Identify Flows
    1. Map flow of device outputs (e.g. functions)
    2. Identify all failure modes along flow
    3. Identify single point failures
  2. Determine or calculate expected time between failures on a per component and per system level
    1. Component lifespan
    2. Component aging
    3. Derate components for sufficient margin
  3. Mitigation
    1. Implement redundancy as needed
    2. Implement high reliability hw as needed
  4. Detection
    1. Sampling of output
    2. Injection
    3. Monitoring hardware (e.g. self detection)
    4. Diagnostics
    5. Alerting and reporting failures
  5. Recovery and Repair
    1. Can the user accept reduced capacity (e.g. run slower, wider error, etc)?
    2. Automated failover
    3. Service (online)
    4. Service (offline)
  6. Repeat at all levels of the stack (memory is used as an example)
    1. Memory cell
    2. Memory row
    3. Memory silicon and packaging
    4. Memory PCB
    5. Memory controller or CPU
    6. Memory bus
    7. OS memory manager
    8. OS process manager
    9. Process

Resources
Reliability engineering
Availability