Are full books written unto...

Are full books written unto .txt files the most efficient way of compressing lots of information into as small a space as possible? Or is there an even more efficient way?

Attached: 1495758718252.jpg (270x400, 104K)

Other urls found in this thread:

prize.hutter1.net/index.htm
mattmahoney.net/dc/text.html#1440
twitter.com/SFWRedditVideos

x

in my constructed language, that character contains the entirety of the western canon

.txt is uncompressed. Something like epub which is designed for books would be better as it is basically a zip file but given that epub requires extra data for it to be usable, raw text that is compressed in a efficient manner would be the best

>raw text that is compressed in a efficient manner would be the best

Is that a thing that currently exists tho? Would taking 12GBs of .txt files and compressing them into a .zip file make them lighter?

English text usually compresses about 3:1 with zip. I don't know anywhere that makes available text files in that format, but you're free to compress your own stash.

Yes, and very significantly. PAQ could easily make it 5 times smaller.

Yes they would considering actual books contain lots of repeated phrases and words
I had a 4mb log file that is plain text using ultra compression in 7zip the compressed size is ~1/10th of original

Attached: Screenshot_2.png (587x48, 5K)

>I don't know anywhere that makes available text files in that format, but you're free to compress your own stash.

It wasn't really a matter of distribution, as much as I was wondering what the theoretical way to store the most amount of information in the least amount of space possible was. Like, physical space wise, that's why I automatically jumped to .txt files, since they're lighter than epubs so you can fit more into a smaller memory, and a smaller memory can have a smaller physical size.

Now i'm wondering how much space and memory it would take for every single book ever written and every movie script and every game concept to be written down and compressed and fitted into the smallest possible space.

I suggest you look into information theory. It deals with the separation of information content from the symbols used to encode them.

The more predictable a given stream of text is, the lower information density it tends to have. Compression deals with replacing repeated patterns with abbreviations. In theory, once the stream is reduced to a completely unpredictable form, it is pure "entropy", and cannot be reduced any further.

One byte can be expressed by two atoms. Youd only need about 8 milion atoms to express a book the size of the bible

Technically youd only need one atom. Dont forget about isotopes.

is that how compression works? they just set something like pointers to places with repeated words or phrases / symbols?

How do you compress a word without losing letters? It doesn’t make any sense unless the program knew some stenographer shorthand

kek

A byte holds way more possibilities (256) than there are standard characters (~60). One byte can easily be 4 characters. By compressing from ascii to pure binary you can cut out a ton of space

Thats wrong ya faggot

It’s not the words they remove dumbass.

The compression algorythm removes all the white space in the document, stuff like double spaces, indents, and even spaces between words.

look up Huffman coding for a simple method of compressing text

It would take at least 2 bytes to print 4 characters. 3 if you include capitalization

>How do you compress a word without losing letters?
There's lots of ways to do it. For example, if you have a document that contains the word 'without' more than once, you can just keep a table that says ' =without' and replace every instance of the document with that symbol. But the other user is right, real implementations like deflate use Huffman coding.

.txt.xz
Also could possibly benefit from using a 6 bit encoding.

Depends on the algorithm, but yes, pretty much

The 200 most common words could be compressed to a single byte. Interesting

And thats 80% of books

There are better ways of going about it. You have to compress the information itself, not just words in text files. Storing everything in a 1TB .txt file and compressing it with 1TBs of RAM with the best algorithm and dictionary we currently have would be a start though.

Wtf are you talking about retard

I created a couple of .txt files just to test that out real quick, and diving the text into multiple files gave me the same final amount of space taken up as a single document with all the text in it. Tho then again Windows can't show you the differences in anything smaller than a single byte I don't believe.

dividing*

What compression are you using though

Not nearly. One of the most efficient ways known right now is phda compression. But you'll probably use zpaq or something instead.

Note that they're on the extreme side of a time-space trade-off. They're not the extremely fast utility compressors (lz4, zstd on fast settings) or relatively fast (zip, lzma2, bzip2, ...) that many applications actually use.

prize.hutter1.net/index.htm
mattmahoney.net/dc/text.html#1440

BTW there is almost never a reason not to run zstd over your .txt files if you want a cheap space efficiency gain.

Let me dumb it down for you. Efficiency tier list of storing books on your hard drive:

>A bunch of scanned pages in a PDF
>Compressed EPUBs
>Just the raw text data of the books compressed as one single blob
>That same blob compressed with an algorithm specifically designed for English text, utilizing infinite RAM and processing time
>Something even more advanced, like an AI that writes books for you given certain parameters

Whether you split it by files doesn't mean jack shit, I meant storing all the data you want in the same, raw format as plain text (with zero metadata or formatting information, not as RTF, LaTeX, EPUB, DOCX, PDF, etc.), and compressing all of the data at once (so the compression algorithm can factor in everything you want to compress and make decisions based on that, not just the first 500 megabytes or something).

7z can compress the everloving shit out of .txt files. Try it sometime.

Not if that ".txt" us UTF-16 or UCS-2, like the cancer Microsoft loves to put out.