ITSF format

Preface

This chapter is an edited version of Matthew Russotto's page; "Microsoft's HTML Help (.chm) format", which is available at http://www.speakeasy.org/~russotto/chm/chmformat.htm. It is used with his knowledge and permission.

This is documentation on the ITSF format used by MS HH. This format has been reverse engineered in the past, but as far as is known this is the first freely available documentation on it, other than the predecessor of this chapter. One Usenet message indicates that CHM files are actually IStorage files documented in the MS Platform SDK. However, such documentation has searched for, without success. A code sample (Delphi one here) that shows how to get an IStorage object representing a CHM from a ITStorage object has been located outside MS. No reference to ITStorage has been found in the MSDN or anywhere else. The DLL used to implement this format is itss.dll and its resource information indicates that it is MS' InfoTech Storage System Library.

Note

The word "section" is badly overloaded in this document. Sorry about that.

All numbers are in decimal unless otherwise indicated in the text. Hex numbers are indicated by 0x. All values within the ITSF file are Intel byte order (little endian) unless indicated otherwise.

Overall format

The ITSF file begins with a short (0x38 byte) initial header. This is followed by the header section table, the offset to the content, and a number of bytes of information of unknown use. Collectively, this is the "header".

The header is followed by the header sections. There are two header sections. One header section is the file directory, the other contains the ITSF file length and some unknown data. Immediately following the header sections is the content. The image below shows the structure of the ITSF files.

ITSF structure

The header starts with the initial header, which has the following format:

Offset Type Comment/Value
0 char[4] ITSF
4 DWORD 2 or 3 (version number)
8 DWORD 0x58 in version 2 files, 0x60 in version 3 files (total header length, including header section table and following data)
0xC DWORD 1 (unknown)
0x10 DWORD Unknown checksum.
0x14 DWORD LCID of the OS at the time of compilation, not the one stored in the HHP file. It is unknown whether or not this is the system LCID (from GetSystemDefaultLCID), the user LCID (from GetUserDefaultLCID) or the thread LCID (from GetThreadLocale). It is likely to be GetUserDefaultLCID since that is the function that itss.dll depends on. If you have the facility to check this please let us know of the result.
0x18 GUID {7C01FD10-7BAA-11D0-9E0C-00A0C922E6EC}
0x28 GUID {7C01FD11-7BAA-11D0-9E0C-00A0C922E6EC}

It is followed by the header section table, which is 2 entries, where each entry is 0x10 bytes long and has this format:

Offset Type Comment/Value
0 QWORD Offset of header section from beginning of ITSF file
8 QWORD Length of header section

Following the header section table is 8 bytes of additional header data. In Version 2 files, this data is not there and the content section starts immediately after the directory.

Offset Type Comment/Value
0 QWORD Offset within ITSF file of content section 0

The Header Sections

Header Section 0

This section contains the total size of the ITSF file, and not much else.

Offset Type Comment/Value
0 DWORD 0x01fe (unknown)
4 DWORD 0 (unknown)
8 QWORD ITSF File Size
0x10 DWORD 0 (unknown)
0x14 DWORD 0 (unknown)

Header Section 1: The Directory Listing

The central part of the ITSF file: A directory of the files and information it contains.

Directory header

The directory starts with a header; its format is as follows:

Offset Type Comment/Value
0 char[4] ITSP
4 DWORD 1 (version number)
8 DWORD 0x54 (directory header length)
0xC DWORD 0x0a (unknown)
0x10 DWORD 0x1000 (directory chunk size)
0x14 DWORD Density of quickref section, usually 2.
0x18 DWORD Depth of the directory tree. 1 there is no index, 2 if there is one level of PMGI chunks …
0x1C DWORD Chunk number of root index chunk. -1 if there is none (though at least one file has 0 despite there being no index chunk, probably a bug.)
0x20 DWORD Chunk number of first PMGL (listing) chunk
0x24 DWORD Chunk number of last PMGL (listing) chunk
0x28 DWORD -1 (unknown)
0x2C DWORD Number of directory chunks (total)
0x30 DWORD LCID. 0x409=en-us is the only one seen. If you have a non-us version of HHW please change your system and HHP locales to something other than en-us & something other than the locale of HHW, compile a CHM and check its LCID at offset 168. It is probably from the program that compiled the ITSF, definately not the one stored in the HHP file or from the OS. It is unknown which EXE/DLL this LCID comes from, but at a guess it would be ITSS.DLL, which provides the following GUID.
0x34 GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC}
0x44 DWORD 0x54 (this is the length again)
0x48 DWORD -1 (unknown)
0x4C DWORD -1 (unknown)
0x50 DWORD -1 (unknown)

The Listing Chunks

The header is directly followed by the directory chunks. There are two types of directory chunks - index chunks, and listing chunks. The index chunk will be omitted if there is only one listing chunk. A listing chunk has the following format:

Offset Type Comment/Value
0 char[4] PMGL
4 DWORD Length of free space and/or quickref area at end of directory chunk
8 DWORD 0 (unknown)
0xC DWORD Chunk number of previous listing chunk when reading directory in sequence (-1 if this is the first listing chunk)
0x10 DWORD Chunk number of next listing chunk when reading directory in sequence (-1 if this is the last listing chunk)
0x14 Directory listing entries to quickref area. Sorted case-insensitively by filename. Consecutive entries do not necessarily have increasing offsets.

The format of a directory listing entry is:

Offset Type Comment/Value
0 BYTE length of name
1 BYTEs name (UTF-8 encoded)
+0 ENCINT content section
+0 ENCINT offset
+0 ENCINT length

The offset is from the beginning of the content section the file is in, after the section has been decompressed (if appropriate). The length also refers to length of the file in the section after decompression.

There are two kinds of file represented in the directory: user data and format related files. The files which are format-related have names which begin with '::', the user data files have names which begin with "/".

Between the chunk entries and the quickref entries is chunk length - ( num entries / n + !!( num entries % n ) ) * 2 bytes worth of free space. This usually contains the same data from the same offsets in the previous chunk, and can be zeroed out, with no effect on the decoder and a slight increase in the compressability of the file with zip/gzip/bzip2 & probably other crunchers. The free space is usually partial/junk chunk entries, free space and/or quickref entries.

The quickref area is written backwards from the end of the chunk. One quickref entry exists for every n entries in the file, where n is calculated as 1 + (1 << quickref density). So for density = 2, n = 5.

Offset Type Comment/Value
Chunklen-2 WORD Number of entries in the chunk
Chunklen-4 WORD Offset of entry n from entry 0
Chunklen-8 WORD Offset of entry 2n from entry 0
Chunklen-0xC WORD Offset of entry 3n from entry 0
...

The Index Chunk

An index chunk has the following format:

Offset Type Comment/Value
0 char[4] PMGI
4 DWORD Length of quickref/free area at end of directory chunk
8 Directory index entries (to quickref/free area)

The format of a directory index entry is as follows:

Offset Type Comment/Value
0 BYTE length of name
1 BYTEs name (UTF-8 encoded)
+0 ENCINT directory listing chunk which starts with name

When higher-level indexes exist (when the depth of the index tree is 3 or higher), presumably the upper-level indexes will contain the numbers of lower-level index chunks rather than listing chunks.

The quickref area in an PMGI is the same as in an PMGL.

Encoded Integers

An ENCINT is a variable-length integer. The high bit of each byte indicates "continued to the next byte". Bytes are stored most significant to least significant. So, for example, 0xEA 0x15 is (((0xEA&0x7F)<<7)|0x15) = 0x3515.

The Content

The content typically immediately follows the header sections, and is at the location indicated by the DWORD following the header section table. All content section 0 locations in the directory are relative to that point. The other content sections are stored within content section 0.

The Namelist file

There exists in content section 0 and in the directory a file called "::DataSpace/NameList". This file contains the names of all the content sections. The format is as follows:

Offset Type Comment/Value
0 WORD Length of file, in words
2 WORD Number of entries in file
4 Entries to the EOF

Each entry:

Offset Type Comment/Value
0 WORD Length of name in words, excluding terminating NIL
2 WORDs Double-byte characters
+0 WORD 0

Yes, the names have a length word and are NT; sort of a belt-and-suspenders approach. The coding system is likely UTF-16 (little endian).

The section names seen so far are:

"Uncompressed" is self-explanatory. The section "MSCompressed" is compressed with MS' LZX algorithm.

The Section Data

For each section other than 0, there exists a file called '::DataSpace/Storage/<Section Name>/Content'. This file contains the compressed and/or encrypted data for the section. So, conceptually, getting a file from a nonzero section is a multi-step process. First you must get the content file from section 0. Then you decompress (if appropriate) the section. Then you get the desired file from your decompressed section.

Other section format-related files

There are several other files associated with the sections

Appendix: The Compression

The compressed sections are compressed using LZX, a compression method MS also uses for its cabinet files. To ensure this, check the second DWORD of compression info in the ControlData file for the section - it should be 'LZXC'. To decompress, first read the file "::DataSpace/Storage/<SectionName>/Transform/{7FC28940-9D31-11D0-9B27-00A0C91E9C7C}/InstanceData/ResetTable". This reset table has the following format:

Offset Type Comment/Value
0 DWORD 2 (unknown - possibly a version number)
4 DWORD Number of entries in reset table
8 DWORD 8 (unknown)
0xC DWORD 0x28 (length of table header - area before table entries)
0x10 QWORD Uncompressed Length
0x18 QWORD Compressed Length
0x20 QWORD 0x8000 (block size for locations below)
0x28 QWORD Offset in compressed data of nth block boundary in uncompressed data (first offset = 0)
Repeat QWORD offsets to EOF

Now you can finally obtain the section (from its Content file). The window size for the LZX compression is 16 (decimal) on all the files seen so far. This is specified by the DWORD at 0x10 in the ControlData file (but note that DWORD gives the window size in 0x8000-byte blocks, not the LZX code for the window size).

There is one change from LZX as defined by MS: After each Huffman reset interval (defined in the ControlData file, but in practice equal to the window size) of compressed data is processed, the decoder state is partially reset: that is, the Huffman length tables are cleared and the one-bit preprocessing header is reread. The LZ window is not cleared.

The rule that the input bit-stream is to be re-aligned to a 16-bit boundary after 0x8000 output characters have been processed IS in effect, despite this LZX not being part of a CAB file. The reset table tells you when this was done, though there seems to be no need for that during decompression; you can just keep track of the number of output characters. Furthermore, while this does not appear to be documented in the LZX format, the compressed stream is padded to an 0x8000 (decimal) byte boundary.