Navigation

Search

CADRE

Computation and Data Resource Exchange

Why Buy Value Storage?

It’s a matter of weighing the costs and benefits — and it all comes down to data value. 

  • Are the data (or the instrument time) valuable, or are you storing things that aren’t really crucial? 

  • How expensive is it to recreate your data in researcher and instrument time? Is it even possible?

  • How much of your own time and payroll are you willing to dedicate to the real and inevitable overhead?

There are users at the University with a variety of storage needs. This article addresses a question posed by one of our users: disks are cheap ($30/TB outright). Why not just store data using a bunch of these inexpensive disks?

There's a lot of subtlety to the question. Do you want to store a single monolithic existing dataset, or will you be accumulating five years of research? Knowing the instrumentation is also be helpful in giving specific guidance. 

Without all of the specifics in hand, I can only resort to generalities — but those can be very useful. They mostly center in questions of data lifecycle. The points I will make stem from spending my career in experimental physics with a strong computing emphasis, using enormous datasets (petabyte scale).

In no particular order:

Returnability

Drives that are you purchase are yours, for better or worse. You are limited in your elasticity. You will have to use your own facilities to store them, manage them, and dispose of them. In contrast, the university’s value options don’t lock you in — that capacity can easily be ceded back and used by others if you are not interested in continuing to use (and pay for) it.

Redundancy

Instrument time and researcher time are often expensive. If the data are worth taking and keeping, they’re usually worth protecting.

HDD failure rates are significant, and often trace to bad batches. I have opened an under-desk storage appliance with 4 spinning disks from the same batch and disassembled them. On each, the “read” head had shredded the magnetic media into fine black dust. 

This kind of failure is not a catastrophe on our system, because we’re managed, instrumented, and capable of replacing disks before the whole system is irretrievable.

Failure Probability

One real problem you may encounter is that the new disks you buy for data collection will probably be plugged in, collect the data, and then be unmounted and set aside. This usage pattern puts them squarely in the “infant mortality” part of the infamous bathtub curve of disk failures*, which will greatly increase the chances of data loss. 

In the Swift object storage system, we can handle three full hard disk failures at a time without data loss — and without a rebuild time that is noticeable to the end user. The standard NAS value storage runs in RAID6, which can manage two simultaneous disk failures. 

Security

Even if your data aren’t sensitive, the disks are. In data center deployments, we treat the disk shelves with great care — slow movements, cautious extraction and replacement. Spinning media do not like to be jostled. If you are collecting data via swapping disks into a dock or an appliance attached to your instruments, that’s a risky operation. 

Even more risky is keeping these disks in drawers or a cabinet. Say you have 500TB -- that's 125 4TB HDDs. You're looking at a lot of equipment to manage, and every swap potentially compromises the data it contains. Even smaller piles of disks (10 or 15) have similar problems. 

If you are planning on keeping the drives safe by putting them in a server chassis in a rack… well, that’s significant added expense. It costs power, floor space, cooling, and personnel. All of that needs to be added to your cost comparison. Disks are the cheap part.

Data on HDDs are not completely stable, either. Disks on a shelf don’t undergo constant data integrity checks.

That’s leaving aside the fact that large hard drives are convertible for quick cash. The same folks that will steal a laptop (with all of your valuable data) are quite willing to snag a few HDDs. Even worse: the theft isn’t necessarily going to be noticed immediately. It’s just not good data security to keep your drives in a common lab space without cameras and good access control.

The datacenters we run for the University are carefully secured and managed, and live servers rapidly notice removed disks.

Accessibility

So you need to get back to a dataset. You look it up in the spreadsheet, fetch the disk, plug it in, hope it works, and hope nobody accidentally mislabeled or misfiled or mishandled (dropped!) the disk. Suppose your analysis spans multiple disks? Headaches. Do you need to clean out a subset of data? That’s going to completely screw up your organizational scheme, especially if it’s an inconvenient distribution of files on a variety of disks. Even graduate student hours cost money — this all adds up fast. 

Compare that to having a single, clean, enormous data pool. Speeds are great, organization is flexible, and there’s no need to be in the lab swapping disks for a last-minute run to reprocess a different dataset or recreate your plots for publication. This is especially valuable when you’re out of town — or even just working from home.

Value storage is accessible anywhere on Grounds at pretty excellent speeds. The Rivanna cluster will be one place among many that can use that storage well.

Sharing

Review all of the accessibility arguments above, with an additional wrinkle: someone else is handling the disk directly, and may make data-destroying errors. You can’t lock that user away from data modification because permissions go out the window with physical access to the drive. There’s simply no way to guarantee read-only use. In addition, it’s strictly single-user. Your data speeds are pretty good, but copies and transfers are sheer drudgery. Getting the data onto a central computing resource might involve extended, manual transfers.

A centrally managed system, in contrast, allows for safe sharing, permissions management and revocation, and can be cited as a long-term archival solution in an academic environment where data and results preservation are becoming more and more of an issue.

Management

When things go wrong with your purchased disks, you have few places to turn. Recovery, if it’s even possible, is expensive or time-consuming or both!

The managed value storage deals with failures silently — you don’t spend time or anxiety on it. We handle it. Worst case, you have a few hours’ downtime, in which you and the grad students can be productive elsewhere (even if it’s just before the results are due, there’s always something to do.) Usually, however, the data are simply… there.

So now we come to dollars and cents. 

Let's assume you want 50 TB. If your requirements are smaller, the numbers scale just like you'd expect.

HDDs:

At $110-140 per drive (settling at $120 for convenience), we clock in at $1,500 for the drives. 

That’s $300 per year over 5 years. 

(This includes no redundancy, failure replacement, ancillary equipment, or management overhead. On the other side, it does not assume a decrease in price/GB over time, either.)

NAS value storage:

Three possible assumptions: 

  • storing 50TB for 5 years: $45/TB/year * 50TB = $2,250/year (7.5x base)
  • assuming linear growth over 5 years up to 50TB = $1,125/year avg. (3.75x base)
  • assuming 10TB added per year over 5 years = $1,350/year avg. (4.5x base)

Swift object storage:

Three possible assumptions: 

  • storing 50TB for 5 years: $30/TB/year * 50TB = $1,500/year (5x base)
  • assuming linear growth over 5 years up to 50TB = $750/year avg. (2.5x base)
  • assuming 10TB added per year over 5 years = $900/year avg. (3x base)

So to repeat: it all comes down to data value. 

  • Are your data (or the instrument time) valuable, or are you storing things that aren’t really crucial? 

  • How expensive is it to recreate your data in researcher and instrument time? Is it even possible?

  • How much of your own time are you willing to dedicate to the real and inevitable overhead?