Long-term storage of data
Happy New Year, doubly, and welcome to the January issue of Hacker Chronicles!
I hope the start of 2023 has been good to you. We had storms and enormous amounts of rain in California. The water is great for the snow pack in Sierra Nevada and our reservoirs but created a lot of damage in rural areas. 21 people died. π’ One of the beaches we like to go to β Capitola Beach β was among the hardest hit.
Since then it’s calmed down and now California is green.
This issue of my newsletter features a topic I find fascinating and which plays a key role in the plot of my novel Identified, namely long-term storage of data.
/John
Writing Update
One of my new year’s resolutions is to finish my sequel. The project name is Submerged but I’ll be sure to let you vote on the final title just like last time.
The definition of “finished” is a bit fluid. I’ll have Draft 0 when I’ve written the final sentence. Then fix all my TODOs which will get me to Draft 1. I’ll print a copy at this point, read it, take notes in the margin, and work through all those notes to produce Draft 2. Then I’ll do a few dedicated passes on specific things like character voice, descriptions, conflict, and any loose ends. At that point I will have Draft 3 and the novel is ready to be read by another human for the first time.
Five brave alpha readers read and provide feedback. I’ll most probably do a round of professional developmental edit for Draft 3 too. Addressing all feedback will get me to Draft 4, also known as the Beta Read Copy. This may be as far as I get during 2023.
Ideally, I’d also do the beta reader round. That’ll be about 20 people reading and providing feedback through questionnaires. Once their feedback is addressed, I’ll do one more professional developmental edit, make changes, pay for copy edit, make changes, pay for proofreading, make changes, and publish. Cover design will be commissioned in parallel.
It’ll be exciting to do this for the second time, not in the least to see how much I’ve learned since the first time over.
This is where I am right now:
Long-Term Storage
I have always loved history. That might sound superficial but I got delayed a year in high school partly because of my love for history.
I chose social studies because they gave me the most hours of history class. After the first year I realized that what would get me to the kind of career I wanted β astronomy or computers β was STEM and I got permission to start over.
I remember my mom being really supportive, telling the school to make it happen. I’ll be ever grateful for that.
Learning History Through Human-Made Things
A key thing for historians is longevity of information. I mean information in a broad sense, including stories and sacred writings. Some prime examples:
- The cave paintings in Lascaux, 17,000 years old.
- Hieroglyphs, 5,000 years old.
- Oracle bones, 3,600 years old.
- Vedic Sanskrit hymns, 3,500 years old.
- The Dead Sea Scrolls, 2,300 years old.
- The Rosetta Stone, 2,200 years old.
- Runestones, 1,600 years old.
The cave paintings are spectacular for their age but the other cases used script to encode information which then stood the test of time. Those writings have allowed us to learn what people were thinking thousands of years ago. Sure, through some significant filters such as history written by victors, but still.
There are also tragic cases of destruction of information. The burning of parts of The Great Library of Alexandria and the Nazi’s book burning come to mind.
Digital Storage
With everyday use of computers, we are producing mind blowing amounts of written information per day. Social media, text messages, emails, blogs, and notes. Most of this information ends up on servers which enables us to access it on multiple devices.
“On servers” doesn’t tell the whole story though. Information is typically stored on either spinning hard drives (HDDs) or in flash memory (SSDs). But magnetic tapes are still a thing and now have the capacity of 45 GB compressed per cartridge.
Modern magnetic tape by Fujifilm.
So called hot storage is quickly accessible whereas cold storage can take days to retrieve. Cold Storage is not a bad title for a novel. π€
Digital Data Protection
Digital data, just like information on paper, can be partially or wholly destroyed. On those drives, there’s something called bit rot where ones and zeroes are flipped to their opposite, effectively corrupting the data they encode.
There are two strategies to make data survive bit rot or other forms of destruction β duplication and error correcting code, or ECC.
Backup is a form of duplication, ideally with journaling so that you can recover a file in the state it was a week, a month, or a year ago.
An error correcting code is extra data padded on. It can be used to detect exactly how information has been corrupted and fix it. The code is smaller in size than the data it’s protecting so it’s not a case of duplication. If the corruption is too large, ECC cannot help fix it and only duplication works.
One way to think of ECC is a checksum β an extra digit at the end for basic validation purposes. In everyday life you may come across such a checksum based on the Luhn algoritm. It is used for health care provider numbers in the US, South African ID numbers, Swedish national identification numbers, and Greek Social Security Numbers.
Data Correctness
You mat assume that humans can spot when data has been corrupted. For instance, we can tell that a file is not possible to open at all or we can see or hear that something is wrong such as garbled text or audio glitches.
But lots of data doesn’t make sense to humans regardless of whether it’s corrupt or intact. It’s just ones and zeroes that a computer can interpret to perform something. The prime example is encrypted data. If such data is corrupted, there is no way for humans to tell without decrypting it and checking the data in its plaintext form.
Watermarking
Some data corruption may also be too small for humans to detect. For instance the shade of an individual pixel. This allows for covert storage of data inside other data, so called steganography. One use of that is digital watermarking to track copyright or perform data authentication.
In Fiction
There are several examples of how storage of data plays into modern fiction.
Johnny Mnemonic stores data in his brain and I reason in my review about the redundancy he likely needs to be able to keep digital data intact in as fuzzy storage as biological tissue.
In the Minority Report novella, Anderton suspects someone has inserted a rigged punch card to frame him (my review here). The author Dick refers directly to the inability to know how far back data was deliberately corrupted.
In the Minority Report movie, there are three instances of data storage as plot points. First the way the precogs vote to decide on what the correct data is if they don’t all agree. Second, the long-term storage of criminal case data that Anderton retrieves. He finds a record missing. And finally, the way Director Burgess was able to inject an almost identical piece of data and have Precrime staff accept it as a true duplicate. Check out my review here.
In the movie Hackers, data related to crime is hidden in trash data. I reviewed it here.
In My Novel
Data storage, steganography, bit rot, and error correcting code are all part of the plot in my novel Identified. The backup in the cave, the tracking dots from laser printers, the bit flip in memory to hack into the rack-mounted server, and the Russians’ use of JBIG2 images for key sync are a few examples.
I won’t go through all the research I did for those pieces but I’ll mention one β how G20S stores global identities for extreme longevity. Those sapphire discs West finds and later blows up aren’t taken out of thin air. Here’s from The Verge:
ANDRA, a French nuclear waste management agency, has decided to engineer data discs that will last nearly 10 million years using sapphire and platinum. Each disc costs over $30,000 to create, and will be made using an eight-inch round of industrial sapphire etched with platinum on one side.
An example of the sapphire discs.
Ten million years. That’s something. When I did this research for the book, I already knew of another fascinating long-term digital storage idea β dot patterns on paper. The fact that paper books have stood the test of time well means we could print information. We need a dense format to encode the data with and researchers have developed such. You can see this idea in practice in the form of QR codes today.
An early draft of Identified had West find round containers with paper in a room adjacent to the one with the sapphire discs. The idea was that the Russian backup facility used two different long-term storage mechanisms which would be natural. But it made the plot too complicated and I dropped it.
Final Thoughts
Will what we produce in terms of information exist in the future? Does that matter beyond things that were really created for the long-term such as articles or books we wrote, pieces of art we created, and information that we once existed? The push today is to make more information ephemeral, i.e. go away shortly after it’s created.
A coworker at an earlier job originated from Easter Europe and had grown up under communist dictatorship. When I asked him about that experience, he told me something profound. One of the most important human abilities for coping with life and the world around us is to forget. You hear it in “Forgive and forget” and “time heals all wounds.” He was adamant that decades of exact copies of who said what will only make us sad. We grow and hopefully become better humans over the years. Society evolves. Part of that is not having to answer for every misstep in our pasts.
I baked that insight into Identified in a conversation between Kiss and West. Let me end by quoting it, and again pitch the idea of data decay:
βDon’t doom scroll your own past. I claim the right to forget.β
βYou mean the right to be forgotten?β
The EU case where a Spanish man won the right to be erased from web search had just been settled as West and Melissa went on trial for their NSA hack. He remembered it as a profound victory for the little guy.
βNope,β Kiss said. βThe right to forget. Life is miserable enough as it is. I don’t want to be reminded of old shitty decisions or screwups or shitposts. People change and cope. Have you heard of data decay?β
West shook his head.
βData decay makes all your messages and social media posts degrade over time and eventually go away, like human memory. If you periodically go read your old email, it stays fresh, but if you haven’t touched it in years, you just get a summary and eventually nothing. Man, don’t look at your blockchain. Live your life.β
Currently Reading
I’m reading Jacob Helberg’s The Wires of War which is a non-fiction book on cyber conflict. I’m also reading The Adventures of Huckleberry Finn for one of our kids.
US law requires me to provide you with a physical address: 6525 Crown Blvd #41471, San Jose, CA 95160