Media File Integrity for NAS

I'm looking for a tool to help me preserve the integrity of my media collection. I have a backup scheme, but this weekend I discovered that a handful of my files differed in checksum from the backup. It was a huge PITA working out which was the "right" file in each case, and I'd prefer to not have to go through the hassle again. (The corruption turned out to be due to an ungraceful shutdown, by the way.)

The features I'm after are (assuming it works on the principle of checksums):
* works recursively
* checksum data doesn't have to go into a file in every subfolder, but can instead just be one file at the root (with relative paths of course)
* checksum data can be UPDATED in-place (i.e. one command causes any deleted files to have their checksum entries removed, and for new files' checksums to be generated and added to that file) instead of having to be generated from scratch
* checksum data (along with this tool) can be used on different systems with (almost) zero extra configuration (i.e. no tools intended for "full system" integrity)

I've tried out 'cfv' and 'cksfv'. 'cfv' works recursively and lets you generate one single file, but has no functionality (that I can see) to "update" an existing .sfv. 'cksfv' doesn't let you put it all on one single file, requiring it to be spread across all subfolders, and also doesn't have the ability to "update".

I looked into "aide" but this is obviously meant for "full system" use, and seems to need non-trivial configuration to even get working, but its database approach seems promising.

I'm at a loss. I struggle to believe that I'm the first person with this issue, so there must be something others use daily with success. Any ideas?

P.S. I'm on GNU/Linux, of course.

Other urls found in this thread:

linuxatemyram.com/
twitter.com/SFWRedditGifs

>there must be something others use daily with success
Good hard drives and proper shutdowns

Thanks for your very constructive comment.

Similar would be something like SnapRAID, but without the need for a parity disk, just checksums.

Zfs. Just use it.

Fuck read this:

Meme.

Literally use fucking ZFS, you idiot, it does exactly what you want.

btrfs

>>P.S. I'm on GNU/Linux, of course.
Then it's easy. Just do what others have said ITT and use a good filesystem with checksum like ZFS or Btrfs.
Data integrity is the sort of thing that should be handled by the filesystem.

ZFS is all that in a nutshell.
And is available on Linux with only minor issues.

ZFS has too high RAM requirements to be useful, and it has "lock in" properties which make me not want to use it.

ZFS has no minimal RAM requirements you only need large amounts of ram if you're going to be in constant usage of the entire array.

btrfs
But really ZFS

Well, I said ZFS or Btrfs, but I really meant Btrfs.
ZFS is not even really open source.

ZFS only has high ram requirements if you use block-based deduplication or a very large L2ARC device.

Otherwise, it uses free memory for caching like every other filesystem.

linuxatemyram.com/

Hmm. I've been using XFS for a while now, for attested performance benefits for large files (indeed almost all of my files are 'large'). I use btrfs for my / but it never occurred to me to use it for media - I've always been concerned by what people say about its reliability, though admittedly never had chance to pursue those claims very far.

If you don't want to change filesystems, look at git-annex. It keeps file metadata in a git repo and can track multiple copies (including backups).

If it detects file corruption, it will recover the file from a backup, or another replica.

>git-annex uses git to index files but does not store them in the git history. Instead a Symbolic link representing and linking to the probably large file is committed. git-annex manages a content-addressable storage for the files under its control. A separate git branch logs the location of every file. Thus users can clone a git-annex repository and then decide for every file whether to make it locally available.
Sounds interesting, but still really over-engineered.

I might just end up using cfv and accept making .sfv files for each subfolder. At least there's an option to check which files aren't described by the current checksum set.

Part of the problem is my server only has 2GB of RAM. Well, "server", I'm using an ARM board for convenience and power efficiency.

I think it's clear that I just want a checksumming database tool, without all these extra bells and whistles and complexity for vastly different use-cases.