Appears to always start with the sequence
Code:
4c 65 6e 5a 75 43 6f 6d 70 72 65 73 73 6f 72 00
31 00 00 00 30 00 00 00 00 00 00 00 00 00 00 00
In ASCII:
LenZuCompressor\0 followed by some data that looks like it may be binary? (This does not seem to be checked at any point, though it's likely a versioning system.)
The algorithm used is an LZ77 which appears similar
in principle to DEFLATE, though it differs in several ways that make it markedly worse than DEFLATE.
From-scratch C# reimplementation of decompressor:
src/MahoyoHDRepack/LenZuCompressorFile.Managed.cs
A simple no-compression encoder for the format:
src/MahoyoHDRepack/LenZuCompressorFile.NopEncoder.cs
- Starting after the 32-byte magic number, a u32le containing the uncompressed size of the file.
- The next 2 u32le values are the high and low parts of a 64-bit checksum, respectively. Yes, the high 32-bits are encoded as little endian, while coming before the low 32 bits. The mix of byte-orderings is a trend in this format.
- The next u32le is ignored. It is populated in all files I've looked at, but the decoder reads then ignores it.
- Starting at offset 0x30, there are 6 bytes which encode decompressor parameters. The first 4 are collectively used to compute the size of the table the decoder uses to decode Huffman codes, the 5th is BackrefLowBitCount, and the 6th is BackrefBaseDistance. Each of these bytes has constraints on valid values, but they're some strange constraints, and I haven't looked into them very thoroughly, so I won't list them here just yet. You can look at my implementation (in lz_adjust_data) if you're curious.
- After the 6 parameter bytes, is an encoding of the Huffman codes. Each of the lowest indicies in the table (where the index is the final value encoded) is represented as a u32le. This value is used to construct the final table.
The table is constructed one entry at a time, starting with the lowest-index entry that is not present in the file. In each step, the two lowest-valued entries already constructed are defined as the 2 children of the new entry. If there are not 2 entries left, the process is completed, and the last added entry is used as the start when decoding. The lowest valued child is used for a '1' bit in that position, and the second lowest is used for the '0' bit. If there are no nonzero values, then the highest index is always used. (This might be useful for a no-op encoder, which chooses to just always encode the same lengths, with literals.) - After the Huffman table, is the compressed datastream. More below.
This is copied almost verbatim from a description I wrote in a doc comment in my implementation; that will almost certainly be kept more up-to-date over time, though I intend to keep this post up-to-date.
The compressed data consists of a sequence of 'instructions', where each 'instruction' encodes BOTH a backreference AND a literal. All instructions are compactly encoded, using as few bits as possible in the bitstream.
The compressed datastream is a stream of
bits, not a stream of bytes. As such, the decompression code has to be very careful to always keep track of which bit in the current byte is being referred to. (Note that for some reason the bits in each byte are treated as big-endian order, so the first bit in each byte is bit 7 (the high bit), and the last is bit 0 (the low bit).)
The first bit in each instruction indicates whether or nor this instruction is a backreference. If the bit is 1, it does, and if the bit is 0, it doesn't. The following bits are a Huffman-coded sequence representing the a value we'll call X. X serves two purposes: it is the number of bytes to copy from earlier in the output stream, AND the 1-based number of literal bytes in the compressed stream. Note that is a backreference is used, a literal is NOT used.
If this instruction encodes a backreference, X is incremented by the header value
LzHeaderData.BackrefBaseDistance (which is the 6th byte after the main header). The next bits are another Huffman sequence (from the same code!) which encode the high bits of the backreference distance offset. The next
LzHeaderData.BackrefLowBitCount bits are a normally-encoded big-endian integer (no more than 16 bits!) which are the low bits of the backref distance offset. These values are concatenated, then added to
LzHeaderData.BackrefBaseDistance to compute the actual distance. Each byte is copied one-by-one, so the new data can overlap the backreferenced data (as is common in LZ77 algorithms) Then, the next X + 1 octets (8 bit bytes, at whatever alignment in the bitstream the encoder happens to be in at this point) are copied verbatim to the output, as a literal. The instruction is now completed.