Thursday, October 4, 2018

Reduced Length LZ (RLLZ)

Let me share here my one way to output LZ77 codes, which may improve LZ77/LZSS or LZW.

LZ77:

LZ77 coding transmits <offset, length> codes. Many times in the compression process, the same matching string is encoded with an <offset> and the same <length> code. We can avoid transmitting the <length> code for many strings as well as not outputting a bit to identify a literal (as in LZSS) by the following: 

Instead, read the whole input or block and gather duplicated or similar strings and encode <the whole string> (e.g., <length of the string (n>=2)> plus the actual string of characters) only this one time. The match lengths are exactly the same for the repeat strings so it is transmitted only one time, in defining the first string. And then <the number of "next" occurrences of the same matching strings (i.e., how many following offset codes after first offset)> and their succeeding occurrences in the input block by transmitting only the said <offset> codes. (Or you can use an escape code to end the offset codes, perhaps BLOCK_SIZE-1.) Do this for all other duplicated strings. This is actually a "Reduced Length LZ (RLLZ)". 

The literals are outputted last *without bit flags* since they fall into the block buffers not covered or "not activated" by encoded strings. So just one array of literals can be outputted maybe at the end of the file, or, since block-based coding is more practical for shorter offset codes, at the end of the appropriately-sized block of <offsets>.

During decoding, the strings are written first in the output or write buffer using the offsets, and the literals "fill in" the unwritten positions or "holes" in the write buffer. The output buffer is then written to actual file. This makes very compact LZ. (Note that there is no "history" or "window" buffer and hence no "sliding-window" function.)

Here is simple illustration for RLLZ, to be complete:

Reduced Length LZ (RLLZ)
(A very compact LZ. Made possible by deferred literals output.)

aacbcdeaabc <- input source
012345678910 <- index

Compress:

Encode all duplicated strings first:
Very simple:
(<size of string>, <string>, <number of "next" occurrences of string>), [positions in file or block (no match lengths needed)],
   :

aa: (2, aa, 1), [0, 7]
bc: (2, bc, 1), [3, 9]

Better:
aa: (2, aa, 1-1=0), [0, 7]
bc: (2, bc, 1-1=0), [3, 9]

Encode end of string_code stream: (0, , );

Encode literals last: [cde].


Decompress:

Decode all strings first;

[aa bc  aabc]

Read/Get literals. The literals are then inserted into the holes. Whatever the size of those holes randomly positioned in the write buffer, they are filled by the literals.

The strings and literals are nicely in their correct order or sequences. Write to actual file.

***

No need to output bit flags for literals or the number of following consecutive literals in some algorithms. 

Then the completely filled write buffer is written to file. 

That is, e.g. after gathering all distinct repeated strings on a block, 

1) no transmission of <length> code for the next occurrences of the same string (but, in the simplest way, you have to output the number of succeeding strings); 

2) transmitting the literals last (in the output buffer) means no need to transmit *bit flags* for literals (and matches). This is the novel or surprising idea here: deferred literals output; 

(3) if there are no calculated or expected compression for some repeats strings, then simply treat them as just "literals" or "incompressible" strings, which don't expand at all due to deferred literals transmission. Or you can entropy encode the literals too. If they somehow expand due to their probability distributions in their expected compression, then don't perform actual entropy encoding. Just really output them as literals. Thus, RLLZ has the capability to be "recursive" while there are expected compression in repeat strings. No, maybe if you rearrange the output block.


***


I have similar ideas as "rep-codes" in 2000s. My compression ideas are from the late 90s when I became interested again in data compression. 

The ones I call here "holes" they call "gaps", even in LZW. But the idea of deferring transmission of literals for an output buffer avoids bit flags for both literals and matches which is still used in some explanations of ROLZ. Other algorithms still need to output the number of following literals, or literal_length. Deferred, meaning this is not an "online" algorithm. RLLZ is how you do basic deduplication, I presume. Deduplication of strings in a file or block. 


No comments:

Post a Comment