Disk failure probability on a large number of disks

Introduction

We either have or will experience a drive failure. Doesn’t matter if we talk about HDD, SSD, or PCIe disks, any storage disk drive will fail eventually. But how probable it really is, especially if we have many drives? Single drive is pretty realiable, it has huge MTBF, so the result might surprise you.

The case

Single business-class HDD/SSD drive failure probability per year is around 1% in datasheets (but some observe real-world probability is around 10% and depends on the age, environment, usage, and many other factors).

If we take that optimistic 1% probability for a single drive to fail per year, how probable it is that at least one of them will fail within a year if you have 100, 200, 300 disks?

For 10 drives, probability is 10%

For 100 drives, probability is 63%

For 200 drives, probability is 87%

For 300 drives, probability is 95%

For 1000 drives, probability is 99,996% – almost certainly at least one drive will fail within one year

So, if you have many disks, it is almost certain that at least one of the drives will fail within a year. It can be drives in a storage or in 500 laptops, it does not matter – we are not discussing here RAID levels, we are just talking about the drives no matter where they are.

If you have 300 drives, probability of 95% does not mean that after one year 285 your drives will fail and 15 drives will still work. 95% means that if you find 100 companies and each of them has a storage with 300 drives, 95 companies will tell you they had a failure within the last year, and only 5 companies will tell you they did not had any disk failures.

The math

How I got those numbers?

The formula and logic to calculate that probability is simple. Probability that at least one drive will fail is the opposite of the probability that all drives will remain healthy. A single drive has a chance of 99% to be healthy in a single year (it is written as 0.99, whereas 100% equals 1.0). For two drives to remain healthy in a year the probability is 0.99*0.99 = 0,98 which means 98%. For all 100 drives to remain healthy for one year, the chance is 0.99*0.99*0.99*…*0.99= 0.99^100 = 0,37 which means only 37% !!? Surprised?

That means the probability that at least one drive will fail is inverse of that probability: 100%-37% = 63%

The same method is for 200 or 300 drives, or with changes single-drive failure probability. I showed no math formula to keep this example simple and not to frighten people who are not comfortable to use math and formulas.

Real world

This is purely theoretical approach, presented as simple as possible to point out the key thought: with lots of drives, chance that at least one will fail is huge. It would be very interesting to see how that correspond to the real world cases. If you have a storage with lots of drives or you have lots of laptops/desktops in your company, please drop here a line how many drives there are and have you experienced a drive failure within the last 365 days ? The greater sample, the more reliable results will be.

Conclusion

The more drives you have, more probable it is that at least one of them will fail. Redundancy with some of the redundant RAID levels is mandatory. Most popular are RAID10, RAID5, and RAID1. RAID0 does not provide any redundancy. It is also adviseable to have a hot spare drive for the array.

Leave a Reply

Your email address will not be published. Required fields are marked *

*