Silent data corruption

Alternate title:  Apple’s file system engineers are sadly naive.

I was quite disappointed to see that APFS isn’t even trying to provide data integrity.  Data integrity is kind of step 0 of any file system, and checksums or use of ECC is pretty much standard in modern & leading-edge file systems.  APFS doesn’t want to be one of those, it seems.

Case in point why this matters:

I have a bunch of old backup drives, because drives are cheap and until recently I could just buy a new one once the current one filled, instead of ever deleting a backup.  Periodically I go back through these old backup drives and do some basic integrity checks (S.M.A.R.T. bad block scans, file system checks, etc).

also run a comparison of key data between those backups and the current versions on my computer, for files which generally shouldn’t change nor disappear – e.g. photos, videos, key documents, etc.

And today I found that at least half a dozen valuable personal videos (and a few photos) were corrupt, in the versions on my computer.  Luckily, the versions in the ancient backups were still good, so I could replace the corrupt ones.

This corruption was completely silent, until my ‘paranoid’ and time-consuming checks discovered it.

It’s far from the first time.  A failing drive years back corrupted a huge portion of my music library – silently, as far as the file system & OS were concerned.  Periodically I’ve discovered photos (of which I have huge numbers – the majority of my data) which have become corrupt at some indeterminate point.  And I’ve of course had file system [metadata] corruption occur many times, sometimes requiring complete erasure of the disk, and recovery or rebuilds from backup (a few times I’ve had to use data recovery software, where backups weren’t available).

Most, if not all, of these issues would have been discovered by even the most trivial file integrity protections, in the file system.

The notion that modern disks somehow magically protect against all silent data corruption is abject poppycock.  They’re more likely to suffer from it than older disks – a byproduct of higher densities and market demand for cheaper, crappier storage products.

And the implicit assertion that Apple’s file system driver, and kernel overall, are somehow completely free of bugs… is just batshit crazy.

Addendum

Since Apple aren’t interested in protecting anyone’s valuable personal data, I’m on the look-out for other options.  Manual use of shasum is one, for now, but a more streamlined and fool-proof system would be better.  Alas, none seems to exist[1. There is chkbit, but it relies on MD5… probably acceptable for this use case, but needless in the face of decades of better hash algorithms.  And it’s written in JavaScript.  Ew.].  Yet.

Yay! An actual outage! I’m a real blog now!

I woke up this morning to find that my website – this one – had gone down only a few minutes earlier.  The host’s website, Gandi.net, was acting flaky and not letting me log in either, so I figured it was a widespread issue on their end.

A few hours later, Gandi.net was working again, but my site wasn’t.  Sad panda.

Long story short, I used up all the disk space – and by “I”, I mean something – still haven’t figured out what, yet.  Apparently when you use up all the space, that simply kills the VM without any notification (their dashboard for my VM still claimed it was running just fine, no problems detected, which was obvious crap).

That said, their tech support identified the problem quickly and were ultimately able to rectify things for me (after first suggesting I delete some stuff myself, which I tried only to find that when your Gandi VM is wedged in this state, you can’t log in via SSH nor delete anything via SFTP, and those are your only two means for deleting any files…).

As far as I can recall, this is the first time my site’s actually been down in the ~four years I’ve hosted with them (other than a few errors on my part when messing with WordPress etc).

Why I cancelled Backblaze

This is the feedback I sent to Backblaze shortly before I cancelled my account with them.

For the additional context – the restore failure I alluded to was basically that:

  1. Over the course of more than a week and repeated attempts, they were unable to restore 99.7% of my data.
  2. They sent me 685 spammy emails telling me the restore failed.  Six hundred and eighty five.
  3. Their tech support was at least fairly open, and admitted to the problem without fuss, but were unable to actually do anything to get the data back.  Which is, after all, the most important thing.

So, the departing ‘support’ ticket I filed with them (#167833):

Maybe this’ll help your future would-be customers.

The main reason is that when I tried to actually restore data a month or two ago, I was unable to. Epic fail on your part. (Support request #162743, FYI)

That alone is a deal-breaker. The lacklustre customer support and idiotic email spam bugs add icing on that horrible cake.

There are other reasons too, however:

• There’s no secure way to restore. You require me to provide my private key password to your web site. So many ways that can go wrong. I want something more akin to Crashplan’s ability to restore through a local app [once it’s given the private key password]. I should never, *ever* have to transmit my private key password over the internet.

• 30 day inactivity window. I recently travelled away from home for nearly 30 days, and realised that if I’d been gone a little longer, you would have thrown out all my backups. If I’m still paying you, you should still be retaining my backups. (and since *all* my drives are external, including my boot drive, this applies to *all* my data. Even if my boot drive weren’t external, the vast majority of my valuable data is on [permanently connected] external drives)

• 30 day restore window. I’m somewhat on the fence with this one, but other backup services offer retention horizons much longer, or alternative schemes where you have up to N (typically 30) *versions*, regardless of how old those are. Both are preferable to a fixed time window. Since the vast majority of my data is write-once, or thereabouts, I don’t actually have multiple versions of most, but for those that do it’d be comforting to know that I could go back months or years to the prior version(s). I’d be willing to pay extra for this.

• The Backblaze daemon takes an unduly long time to notice new files. Even if I manually tell it to hurry up (i.e. option-click ‘Restore options…’) it still sometimes doesn’t notice new files. I see good reason to not be too hyperactive with backups – it’s true I don’t need every minute’s version of some random file I’m working in – but most of my data is photos & videos, which are import-once-and-never-change (or maybe delete, later). I’d really like to just see Backblaze immediately start backing up newly imported photos & photos as soon as they hit the disk.

I’ve realised that I need all these things, and as it happens Crashplan offers them, so I’m switching.

To your advantage, uploads are much faster than many of the alternative services I tested (particularly Crashplan), and I otherwise like your native app and it’s relatively minimal system impact. So I’m a little sad to see that go. But simple, fast uploads are quite pointless if, when all is said and done, they’re essentially going to /dev/null.