Our organization is subject to data retention requirements over very long periods (50 years or more). Do your organizations face similar constraints? If so, what long-term retention strategy have you implemented for these regulated data, and what technologies or solutions do you use to ensure their durability and compliance?
Sort by:
Sounds like pharma requirements... I know a European pharma company using PDF-A format on AWS Glacier as the solution meeting this requirement. It is extremely cheap to store, but quite expensive to retreive and it takes long time to retrieve as well. The Business case is that, from experience less than 1% of the documents will ever need to be retrieved, and a 2-5 day retrieval time is OK when needed.
I have no direct experience, however I know that regulatory obligations on the nuclear industry insist that design, build, operational data and documentation is securely held for the full lifecycle of a nuclear installation including decommissioning and thereafter so around 75 years plus. Additionally there are requirements that data and the format of that data is periodically inspected (i.e. can you still read PDFs etc ...) and ensuring that storage media is still accessible and available to handle situations where the storage medium has itself become obsolete and so data/docs need to be migrated onto current media. There are (specialist) data management solutions used by the nuclear industry to handle the above, and they may cover life sciences or other industry needs too.
Consider storing all data as plain text if possible and using immutable data stores. Fluree is an example of such a data store. Datomic is another. These are niche technologies, so few IT teams will have heard of them, and even fewer will have experience with them, but those that do seem to enjoy working with the technology. Interestingly, both those products have open source cores and are written in Clojure, which is a LISP that runs on the Java Virtual Machine.
Plain text (UTF-16 for the best compromise between the number of characters supported and the space required to store them) has a good chance of being readable by something even a hundred years from now.
The history of long term data storage is not overly positive. NASA lost data from the Apollo program. Their premier program and it was gone. When the US Census tried to retrieve data from the 1960 census, they had to borrow the computer from the Smithsonian and then find and bring out of retirement the engineers to configure and run it.
Thinking of burning everything to CD? The last manufacturer of personal computer optical disk read/writers just quit the market in 2025. The last ones on the shelf are all that are commercially available. Also, CD's do degrade over time, the material is not a multi-decade storage medium. The same for USB sticks.
Data type change every decade. I worked on systems built in the 1960's, 80's and onward. It was an amazing education, but it has kept my head spinning over the years on how to achieve long term storage.
Microsoft was doing research on a century plus data storage under Project Silica. Beyond your needs, but it is interesting.
Here is a nice mental excercise for your team, if you find an old PC or Laptop in the closet, you can fire it up and log in (or break in). The XP system in the closet will start Office, it will run all of the installed applications. The web browser will open and you can access the web. It isn't safe, but it works.
With most software in the cloud as SaaS in 2025, let's jump to finding our 2025 PC in the year 2035. You blow off the dust, power it up and log in, oh wait, you don't have your MFA to get in and those credentials are not stored locally. You bust in but Office 365 is cloud based and it doesn't work.
I've scratched my head on this one!
The text based data are best stored as Text as Frank states below. The problem is more complex data. Engineering drawings, math equations, images, etc.
What a wonderful project you are tasked with. You will remember this one for years to come!