babarchive - Manage babarchives, checksumed directory trees that can be validated.
babarchive_check_all, babarchive_prep_one, babarchive.cron
Babarchive is a system to manage babarchives, checksumed directory trees that can be validated.
It is designed to preserve digital archives of static content that are intended to last for decades or more. Its goal is to detect random corruption in data at rest, or in errors introduced during data copies.
A babarchive is a directory tree with a .shasum.fsdb and ls-lR in the root, as well as .shasum files in each subdirectory.
Babarchives are managed with two tools:
Babarchive_prep_one
(1) creates (prepares) a new babarchive. Babarchive_check_all
(1) tracks all archives on the local system and incrementally re-validates them.
In addition babarchive_check_one
(1) validates a specific archive.
The script babarchive.cron
(1) runs babarchive_check_all
(1) and is intended to be run daily from cron (perhaps with anacron).
The overall idea: when saving data, you don’t get what you deserve, you get what you verify.
Babarchive provides end-to-end checks on the validity of a directory tree.
We assume:
babarchive_check_all
regularly on your file server.It detects the following problems or events:
It saves two copies of checksums so partial corruption can partially recovered (by hand), and standard tools can be used to assist in partial recovery or verification.
Babarchive has been in use since 1998 for archives of media files and academic datasets. It has detected two silent bit-flips on disks over that time period. As of 2016 it currently protects more than 100 TB of data at USC/ISI.
There are of course many alternatives. Many are good. Our thoughts on some of them.
Fancy files systems. File systems are great, and some modern ones (ZFS and btrfs)
checksum files on disk. We still like end-to-end verification, since data sometimes
lives on different file systems.
Off-line storage. Off-line storage has a place and can be cheaper.
Babarchive started to validate offline optical media.
We prefer on-line storage so validation can be automated.
RAID. RAID is great, but RAID is not perfect. We have lost data due to double failures during recovery.
Backup (including off-site) and verify.
Cloud backup. Cloud backup is great.
However, for some data, off-site storage many not be acceptable.
In addition, you are outsourcing your reliability to someone else.
We think you should still verify.
Super-archival stroage. Some have proposed very durable storage,
such as etching bits in titanium that can be read by a microsope.
While perhaps appropriate for specific cases (a locked vault where
active data management is impossible),
we strongly prefer on-line storage as far cheaper, and, with revaladition,
more reliable.
Finally, if you care about reading your data decades from now, we strongly encourage you to think about the data formats you choose.
Copyright (C) 2001-2016 by John Heidemann. License GPLv2 (only).
This program is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
babarchive
(8), babarchive_check_all
(1), babarchive_check_one
(1), babarchive_prep_one
(1), shasum
(1).
J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-end arguments in system design. ACM Transactions on Computer Systems, Vol. 2 (No. 4), pp. 277-288, November, 1984. [http://dx.doi.org/10.1145/357401.357402]