Mahoutsukai no Yoru HD PC file format deciphering

**DaNike** · December 15th, 2023, 04:27 AM

I'm going to use this thread to document the file formats as I decipher them.

Here is the project I'll be implementing behavior in. It already has support for the core formats on Switch, along with documentation for the game data layout in the README. Unfortunately, it looks like the Steam release has very different formats

Format field shorthands

Key:

✅ Implemented

✔️ Understood, but not yet implemented

❔ Partially understood

❌ Not yet understood

Implementation status is only based on my project, MahoyoHDRepack.

✅	Outermost archives (.hfa)
	Notes Already implemented in HunexFileArchiveTool. Each archive starts with "HUNEXGGEFA10"u8 (hex 48 55 4e 45 58 47 47 45 46 41 31 30), followed by a u32le FileCount. After this header are FileCount 0x80-byte file entries, followed by the archive data. Each file entry consists of a 0x60-byte null-terminated UTF-8 string FileName, followed by a u32le Offset, then u32le Length. The remaining bytes in the entry appear to be unused. Offset is relative to the end of the file headers. The file with data first in the archive will have offset 0.
✅	LenZUCompressor compressed format (ex: data00200.hfa, script_text_**.ctd)
	Notes Appears to always start with the sequence Code: 4c 65 6e 5a 75 43 6f 6d 70 72 65 73 73 6f 72 00 31 00 00 00 30 00 00 00 00 00 00 00 00 00 00 00 In ASCII: LenZuCompressor\0 followed by some data that looks like it may be binary? (This does not seem to be checked at any point, though it's likely a versioning system.) The algorithm used is an LZ77 which appears similar in principle to DEFLATE, though it differs in several ways that make it markedly worse than DEFLATE. From-scratch C# reimplementation of decompressor: src/MahoyoHDRepack/LenZuCompressorFile.Managed.cs A simple no-compression encoder for the format: src/MahoyoHDRepack/LenZuCompressorFile.NopEncoder.cs Starting after the 32-byte magic number, a u32le containing the uncompressed size of the file. The next 2 u32le values are the high and low parts of a 64-bit checksum, respectively. Yes, the high 32-bits are encoded as little endian, while coming before the low 32 bits. The mix of byte-orderings is a trend in this format. The next u32le is ignored. It is populated in all files I've looked at, but the decoder reads then ignores it. Starting at offset 0x30, there are 6 bytes which encode decompressor parameters. The first 4 are collectively used to compute the size of the table the decoder uses to decode Huffman codes, the 5th is BackrefLowBitCount, and the 6th is BackrefBaseDistance. Each of these bytes has constraints on valid values, but they're some strange constraints, and I haven't looked into them very thoroughly, so I won't list them here just yet. You can look at my implementation (in lz_adjust_data) if you're curious. After the 6 parameter bytes, is an encoding of the Huffman codes. Each of the lowest indicies in the table (where the index is the final value encoded) is represented as a u32le. This value is used to construct the final table. The table is constructed one entry at a time, starting with the lowest-index entry that is not present in the file. In each step, the two lowest-valued entries already constructed are defined as the 2 children of the new entry. If there are not 2 entries left, the process is completed, and the last added entry is used as the start when decoding. The lowest valued child is used for a '1' bit in that position, and the second lowest is used for the '0' bit. If there are no nonzero values, then the highest index is always used. (This might be useful for a no-op encoder, which chooses to just always encode the same lengths, with literals.) After the Huffman table, is the compressed datastream. More below. This is copied almost verbatim from a description I wrote in a doc comment in my implementation; that will almost certainly be kept more up-to-date over time, though I intend to keep this post up-to-date. The compressed data consists of a sequence of 'instructions', where each 'instruction' encodes BOTH a backreference AND a literal. All instructions are compactly encoded, using as few bits as possible in the bitstream. The compressed datastream is a stream of bits, not a stream of bytes. As such, the decompression code has to be very careful to always keep track of which bit in the current byte is being referred to. (Note that for some reason the bits in each byte are treated as big-endian order, so the first bit in each byte is bit 7 (the high bit), and the last is bit 0 (the low bit).) The first bit in each instruction indicates whether or nor this instruction is a backreference. If the bit is 1, it does, and if the bit is 0, it doesn't. The following bits are a Huffman-coded sequence representing the a value we'll call X. X serves two purposes: it is the number of bytes to copy from earlier in the output stream, AND the 1-based number of literal bytes in the compressed stream. Note that is a backreference is used, a literal is NOT used. If this instruction encodes a backreference, X is incremented by the header value LzHeaderData.BackrefBaseDistance (which is the 6th byte after the main header). The next bits are another Huffman sequence (from the same code!) which encode the high bits of the backreference distance offset. The next LzHeaderData.BackrefLowBitCount bits are a normally-encoded big-endian integer (no more than 16 bits!) which are the low bits of the backref distance offset. These values are concatenated, then added to LzHeaderData.BackrefBaseDistance to compute the actual distance. Each byte is copied one-by-one, so the new data can overlap the backreferenced data (as is common in LZ77 algorithms) Then, the next X + 1 octets (8 bit bytes, at whatever alignment in the bitstream the encoder happens to be in at this point) are copied verbatim to the output, as a literal. The instruction is now completed.
❔	Script segment encoding
	Notes In the Switch builds of the game, the script text is stored in an MZP archive alongside a file that contains the offsets of the end of each "line" of text. This is what enables the game to have multiple segments of text on the same line. Additionally, this file needs to be updated if we want to be able to change the length of the lines of text. I haven't been able to find analagous files anywhere in the PC release so far. This is the current blocker for being able to replace text.
❌	CompressedBG_MT *.cbg files (ex. data00000.hfa/alphagradiation_inv064.cbg)
	Notes No work has been done to understand these. I assume this is an image format.
❔	mrgd00 *.mzp files (archives? images? (ex. data00000.hfa/txtwindow00.mzp)
	Notes This is used as an archive format in the Switch release, and while they appear to have the same structure here, they seem to be used as images. PS-HuneX_Tools talks about this somewhat, however the documentation is poor (though there is an implementation in Python as well), and I'm not sure if the format described is the same as the one present here. I'm also not sure how to distinguish between MZP files as an archive, and MZP files as an image, if there even is one. Based on a quick glance over the decomp, the math for sizes and offsets is different. Offsets are fairly straightforward and easy to find, but I haven't bene able to find where sizes are decoded, and all of my attempts fail on some MZP archives in the game. This game also uses a different image format for the main character sprites, that the implementation above doesn't understand and can't handle. This will likely require quite a bit of decomp work to figure out.
❌	*.ccit files (ex. data00100.hfa/Font000000.ccit
	Notes No work has been done to understand these. I assume this is a font metadata sidecar file.
❌	HunexCompiledScriptVer1.00 *.chs files (ex. data01000.hfa/staffroll.chs, data01100.hfa/1_0.chs)
	Notes No work has been done to understand these. This appears to be the actual game script format. It is not textual, like the Switch release, however it likely has a similar structure, and it may be possible to clean-room RE it using the Switch files.
	TODO: fill in more pieces as investigation is done

I will try to keep this post updated as I understand the formats better.

**Petrikow** · December 15th, 2023, 11:55 AM

Interesting. Will probably be the same framework used when Tsukihime PC comes out so could come in handy for later.

**Alyinghood** · December 15th, 2023, 12:18 PM

I put up some information here: https://twitter.com/alyinghood/statu...23730091036931

I made an extractor for the .hfa files: https://gist.github.com/Alyinghood/7...b5edc220da0bc8

To unpack the .hw files to a playable audio format, use oggsplit, which will output playable Ogg Vorbis files.

The executable needs to be unpacked before loading in e.g. Ghidra. Use pe-sieve to do that.

Once you have done that, the decompression code is at 0x7FF6BF455670. This is all for .cbg, .chs, .ctd, .ccit, .mzp files. This also includes Win32 resources embedded in the executable.

For reference:
Compression format headers: CompressedBG_MT, LenZuCompressor, mrgd00, MZX0
Archive (.hfa) format header: HUNEXGGEFA10
Audio (.hw) format header: 40 00 00 00, 68 77 20 20 ("hw "), OggS

**Gabriulio** · December 15th, 2023, 12:49 PM

Ah, so that's why something felt super weird to me. The textbox is left-aligned and not center-aligned.

I wonder how easy/complicated would it be to fix that.

**DaNike** · December 15th, 2023, 03:24 PM

Originally Posted by Alyinghood

Once you have done that, the decompression code is at 0x7FF6BF455670. This is all for .cbg, .chs, .ctd, .ccit, .mzp files. This also includes Win32 resources embedded in the executable.

Do they all use the same compression algo? If so, I think I already have a working impl (assuming it's the same as mzp on Switch)

**Alyinghood** · December 15th, 2023, 03:56 PM

Originally Posted by DaNike

Do they all use the same compression algo? If so, I think I already have a working impl (assuming it's the same as mzp on Switch)

The compression algorithm is the same for data with header mrgd00/MZX0.

**Alyinghood** · December 16th, 2023, 02:02 PM

Here is the decompression algorithm for CompressedBG_MT and LenZuCompressor

https://gist.github.com/Alyinghood/e...d7591613ad740a

(If you just want to decompress data, it would be easier to load the data, then dump memory.)

**DogeGod** · December 16th, 2023, 05:27 PM

all my dumb ass could figure out was how to do was change logo.mp4 https://www.youtube.com/watch?v=Yb803_yULxQ, my obs was laggy af and i think i muted audio. still can't figure out how to edit script_text_en.ctd tho fk

**DaNike** · December 17th, 2023, 06:40 AM

I've done some RE work on the decompression function:

The interesting function for LenZuCompressor files is lz_decompress(Span *outSpan, Span *compressedSpan) at 0x7ff607b44c50. (Span is just a pointer/int32 pair.)

It first checks the header to determine whether it can continue. This work is repeated (though more obfuscated here) from the caller. It then uses a function I've called lz_read_int(uint *result, Span *dataSpan, int offset, uint readBitIndex, int encMaxReads) at 7ff607b45110 several times, which seems to be an (obfuscated) generalized integer reader. readBitIndex seems to change how data is shifted around (which ends up merging data from adjacent bytes), but all uses I've seen so far pass 7 to this, which means that each byte's value is unchanged and adjacent bytes don't matter. encMaxReads seems to be a very strangely encoded value encoding the number of bytes to read. The logic is fairly easy to read in decomp, but the only value I've actually seen passed to this is 0x20, which corresponds to reading 4 bytes.

After the 4 4-byte reads, it reads 6 individual bytes with code that is very similar to lz_read_int, but appears to be inlined. Each of these is assigned to fields in a structure passed in from lz_decompress. I don't yet know what any of these do.

I'm stopping for the night here. Will probably continue tomorrow.

**LinkOFF** · December 17th, 2023, 02:33 PM

I have updated HunexFileArchiveTool.

**Alyinghood** · December 17th, 2023, 03:23 PM

Some LenZuCompressor can be decompressed now.

https://gist.github.com/Alyinghood/e...d7591613ad740a

There are still some out of bounds accesses done right now which make some files not work

**DogeGod** · December 18th, 2023, 12:21 AM

iv looked at more .hfa, but im too lazy towrite them all, heres some i wrote down anyway
data0200.hfa holds all the .ctd text files
data00400 holds the type moon logo.mp4
data04000 = all the voice lines in .hw format
data03100 = all the sound effects like from the @se tags in the old .ks files i think
data02900 = the video that plays during chapter one, no sound, mp4 format

**Alyinghood** · December 18th, 2023, 01:17 AM

Here is a complete list of the contents of the archive: https://gist.github.com/Alyinghood/9...990e513a65268d

**Alyinghood** · December 20th, 2023, 02:18 AM

Here is code that can use the executable's code to decompress (instead of reimplementing it).

https://gist.github.com/Alyinghood/2...082bd019482ae3

**DogeGod** · December 26th, 2023, 03:12 AM

yo i dont understand how to run this kind of code, I'm creating the exe after creating Minhook library, then going into command prompt and typing "plsgod.exe "C:\Users\Ryan\Desktop\Mahoutsukai no Yoru\WoH.exe" "C:\Users\Ryan\Desktop\asdf\script_text_en.ctd " "C:\Users\Ryan\Desktop\asdf\output.txt"
" Is that how ur supposed to run that kind of program cuz im been stuck on trying to run it

**requinDr** · December 26th, 2023, 04:19 PM

Originally Posted by DogeGod

yo i dont understand how to run this kind of code, I'm creating the exe after creating Minhook library

He did such a good job providing scripts, but yeah I'm stuck too. I can't even succeed in building the .c because of the included Minhook library (using gcc on windows with cygwin)

**DogeGod** · December 26th, 2023, 06:58 PM

ye idk how to make it recognize minhook.h when doing gcc in command prompt, but i think i was able to do it in visual studios. my exes dont work tho

**Alyinghood** · December 26th, 2023, 11:05 PM

Basically, I did this on a Ubuntu 22.04 based system

Code:

sudo apt install -y mingw-w64 git wine wine64 libwine fonts-wine
git clone https://github.com/TsudaKageyu/minhook.git
x86_64-w64-mingw32-gcc -o unpack.exe -static-libgcc -Iminhook/include -Iminhook/src hunex_decompression_reuse_exe.c minhook/src/buffer.c minhook/src/hook.c minhook/src/trampoline.c minhook/src/hde/hde32.c minhook/src/hde/hde64.c
wine unpack.exe 7ff6bf390000.WoH.exe script_text_en.ctd script_text_en.ctd_

"7ff6bf390000.WoH.exe" is the path of the unpacked executable using pe-sieve.
To get the unpacked executable:
Once you have the game running, execute the following in a PowerShell window

Code:

Invoke-WebRequest -Uri https://github.com/hasherezade/pe-sieve/releases/download/v0.3.8/pe-sieve64.zip -OutFile pe-sieve64.zip
Expand-Archive pe-sieve64.zip -DestinationPath pe-sieve64 -Force
cd pe-sieve64
$procid=get-process WoH | select -expand id
.\pe-sieve.exe /imp 3 /shellc 3 /pid $procid
explorer process_$procid

The folder containing the executable is opened in a File Explorer window.

Also note that if there is a new version of the executable, the addresses in the program may need to be changed. This can be figured out relatively quickly using BSim in Ghidra or Diaphora.

**requinDr** · December 27th, 2023, 06:20 AM

Thanks to your instructions I was I was able to build unpack.exe by installing `mingw64-x86_64-gcc-core` in Cygwin's setup.
I also got the unpacked executable with pe-sieve.
I got `script_text_en.ctd` using `unpack_hfa.py`

But now when I run `unpack.exe 7ff7e8230000.WoH.exe script_text_en.ctd script_text_en.ctd_`, the command run, complete and that's it. I have no visible error, nor I have the output file.
My `unpack.exe` is 284 291 o, is that the expected size?

**Alyinghood** · December 27th, 2023, 11:55 AM

Originally Posted by requinDr

But now when I run `unpack.exe 7ff7e8230000.WoH.exe script_text_en.ctd script_text_en.ctd_`, the command run, complete and that's it. I have no visible error, nor I have the output file.

Check the command result ($?) and see where it failed. If result code is 2-5, that means addresses need to be adjusted. I made a change to the program where offsets are now relative to the base address, so it should work properly now.

Originally Posted by requinDr

My `unpack.exe` is 284 291 o, is that the expected size?

That is the expected size.

"..."u8	A UTF-8 encoded string. This is the C# syntax for a literal UTF-8 string.
Sized integers	Pretty close to Rust type names. u or s for signed/unsigned, then a bitcount, then le or be for little-endian and big-endian. Ex: u32le = 32-bit unsigned little endian. s16be = 16-bit signed big endian.

Thread: Mahoutsukai no Yoru HD PC file format deciphering

Thread Tools

Search Thread

Display

Mahoutsukai no Yoru HD PC file format deciphering

Posting Permissions

✅	Implemented
✔️	Understood, but not yet implemented
❔	Partially understood
❌	Not yet understood