The Phantom of a PDF File
This text is part of our blog series, where individuals write about mixed digital preservation themes in a free-form way. Our blog series is not intended as guiding rules nor instructions, but as an inspiration for ideas and discussion, including the opposite thoughts.
Changelog:
- Update 11/22/2024: We're happy that our blog has led to a good discussion during the last weeks. We thank that our blog has lead to an experiment of various tools, see a response blog by Johan van der Knijff. This shows that octal codes used in combination with UTF-16BE encoded text strings are well supported in various tools. We have updated the Finale and added a couple of other sentences in the blog. The wording “dual encoding” is not possibly the best one, although in practice, UTF-16BE has multiple-byte characters whereas octal coding is done byte-by-byte. Related to the number of bytes in UTF-16BE and octal codes, we also reported a request to PDF Specification Issues.
- Update 11/11/2024: We now refer to the most recent PDF specification ISO 32000-2:2020. Dual encoding of UTF-16BE and octal codes is not judged to violate specifications anymore, but we ponder a bit more on the definition of octal codes. From a point of view of digital preservation, dual encoding is still discouraged. This is now discussed a little more in the Finale.
This ghost story is originally written for Halloween 2024. It is based on real events. However, the PDF file in question has been replaced with a test file for this story.
Prologue
We recently received a simple PDF 1.4 file that did not pass all validation checks. JHOVE (v. 1.30.0) reported that the PDF file is “Not well-formed”. The error message reported by JHOVE was “PDF_HUL_66 Lexical Error” in offset 1888. This error message means that there is something wrong with the technical “language” in the PDF file’s internal structure.
We promised to analyze the file to see what was wrong with it. Oh boy, were we in for a ride!
Act 1: (Grave)Digging into the PDF Structure
We started inspecting the file using a hex editor. The offset 1888 was in the middle of a zlib/deflate encoded image stream. JHOVE should not need to do any Lexical checking in the middle of an image stream. This immediately raised our suspicions that something was going on in here. The dictionary for the image stream object is:
<< /N 3 /Alternate /DeviceRGB /Length 3187 /Filter /FlateDecode >>
The actual length of the stream is 3187 bytes, so there is nothing wrong with the information in the dictionary, such as the reported length. The ID for the image stream object is “7 0” and it is referred to from object “6 0” as follows:
[ /ICCBased 7 0 R ]
We followed the object references and eventually ended up in the first element of the PDF file. We found nothing exceptional anywhere.
What next? Time to return to JHOVE's error message.
JHOVE reports a Lexical error when there is an unexpected character in a numerical value or when a PDF dictionary object does not end correctly with “>>”. The PDF file has 10 objects, and 9 of them begin with “<<” and end with a corresponding “>>”. The remaining object is a list, correctly surrounded by square brackets “[“ and “]”. Three of the PDF objects contain streams, each correctly starting with the word “stream” and ending with “endstream”. The Length parameter is correct in all of the stream objects. All objects in the PDF file begin with “X 0 obj” (where X is the object index) and end with “endobj” properly.
We started to look at the xref table as our next step, which contains the byte offsets of the objects. All the byte offsets refer to the beginning of the objects, and the objects are numbered correctly in relation to the xref table index.
Overall, the PDF structure looks ok.
Intermission: Fixing the File with Acrobat Reader
We happen to know that some basic PDF problems can be fixed simply just by opening the file in Adobe Acrobat Reader and saving it as a new file without making any changes. We decided to try this out for this PDF file. After saving the file with Acrobat Reader we run it through JHOVE to see what it would tell us. JHOVE now reports that the new file is Well-Formed and valid.
Okay.
We opened both files in Acrobat Reader and they looked identical. When going through the PDF file structure with our hex editor, the fixed file’s structure was a bit different compared to the original. Saving the file with Acrobat Reader had caused some minor changes in the objects in the PDF structure, but at first glance there seemed to be nothing special in these changes. Just to be sure, we calculated the checksums of the byte streams from the original and new PDF file to confirm that there were no changes in the streams.
The problem is not in the offset 1888 in the original file, nor in anything referring to it directly or indirectly, but somewhere else.
Act 2: It’s There! The Phantom of a PDF File!
Until now, we had only carefully checked the objects that have a direct or indirect reference to the object containing the offset the JHOVE complains about. Since those objects are OK, the reason must be in some other objects in the PDF file. We finally realized that when fixing the PDF file, the Producer field in the Document Information Dictionary object has been changed. The Producer in the original file looks odd:
/Producer (\376\377\000P\000D\000F\000 \000P\000h\000a\000n\000t\000o\000m\000\000)
In the file fixed with Acrobat Reader it is:
/Producer (PDF Phantom)
The original Producer field starts with an 8-byte string “\376\377” (hex codes 5C 33 37 36 5C 33 37 37). The PDF file format includes an octal character encoding format, inherited from the PostScript format, and it is now used in this metadata. The octal values 376 and 377 correspond to FE and FF as hex values. This refers to a BOM character (byte order marker, hex FEFF), which is used in the UTF-16BE encoding. This means that the UTF-16BE encoded string has been dual encoded with an octal representation of the PDF file format. Likewise, substring “\000” refers just to a null byte, and the actual content can be picked between these null codes: \000P --> P, \000D --> D, \000F --> F, etc.
The PDF Reference ISO 32000-2:2020 defines in ch. 7.9.2.2.1: "For text strings encoded in UTF-16BE, the first two bytes shall be 254 followed by 255. These two bytes represent the Unicode byte order marker, ZERO WIDTH NO-BREAK SPACE (U+FEFF), indicating that the string is encoded in the UTF-16BE (big-endian) encoding scheme specified in Unicode.". In dual encoding of "\376\377", byte order marker is no longer fitted in two bytes. However, octal codes correspond to these two bytes and are part of the definition of literal strings, not text strings.
In ch. 7.3.4.2, octal coding is described as "character code ddd", which intuitively indicates that an octal code should be decoded to a character. But at the end of the chapter, this is added: "However, any 8-bit value may appear in a string, represented either as itself or with the \ddd notation described." It remains quite unclear whether a single octal “character code” should be decoded to a character or a byte (of a character). In UTF-16BE encoded string there is a difference, since all the characters use 2 or 4 bytes.
JHOVE can’t handle the dual encoding in the original file. JHOVE does not find the right parenthesis “)” of the Producer field to know when the field ends. Consequently, it continues reading the file until finally reaching a problem (Lexical Error) in byte offset 1888, which can’t be decoded anymore. The Producer field in the Document Information Dictionary object exists in the offset 87-169 of the PDF file. This was quite a bit to track back from the reported error at offset 1888.
UTF-16BE characters use 2 or 4-bytes each, and originally, JHOVE connects “)” with the neighbouring byte(s) to form a character. Just adding a space character before the right parenthesis “)” causes the file to be valid according to JHOVE, but as a result, JHOVE results nonsense as the Producer metadata.
Act 3: XMP as a Friendly Ghost
Acrobat Reader figures out that the Producer should be “PDF Phantom” and stores it to the Producer field using plain US-ASCII encoding. To test the functionality of the fix, we changed the Producer to “Ananasakäämä” with the hex editor, so that it would include Scandinavian characters, and reformulated it with the combination of UTF-16BE and octal representation. We also needed to be sure to handle the byte offsets in the xref table and startxref offset separately, because the byte offsets of other objects changed. The Producer field in our test looked like:
/Producer (\376\377\000A\000n\000a\000n\000a\000s\000a\000k\000\344\000\344\000m\000\344\000\000)
The octal values \000\344 correspond to the hex values 00E4, which is character “ä” in UTF-16BE. We opened and saved the test file with Acrobat Reader, and the result looked interesting. We got:
/Producer (PDF Phantom)
In other words, we just got rid of “PDF Phantom”, and now it is there again!
Spooky!
We finally figured out that since the file also has XMP metadata, in a totally different object within the PDF file, Acrobat Reader overrides the Producer in the Document Information Dictionary object with the Producer from the XMP, which looks like this:
<pdf:Producer>PDF Phantom</pdf:Producer>
Act 4: Shapeshifting of the Phantom
We did another test where we removed the XMP metadata from the original file. As a result, Acrobat Reader seems to remove the octal representation, but it keeps the UTF-16BE encoding for the “PDF Phantom” string. This is the result:
/Producer (<fe><ff><00>P<00>D<00>F<00> <00>P<00>h<00>a<00>n<00>t<00>o<00>m<00><00>)
Here, <fe>, <ff> and <00> are non-printable bytes (one byte each), corresponding to the hex codes FE, FF and 00, respectively.
Finale
Selecting a proper encoding is important. Probably the most standardized way to encode metadata strings nowadays is to use US-ASCII if possible, and Unicode otherwise. Instead, PDFDocEncoding with characters outside US-ASCII is inherited from the old PostScript world and relies on the widely deprecated Latin-1.
If the metadata consists of plain US-ASCII characters, US-ASCII can be used directly. In this case, encoding to UTF-16BE with octal codes is most likely unnecessary and unsensible. It increases the technical complexity of the string for no good reason. Also, from the perspective of interpreting the PDF specification in digital preservation actions, such as understanding metadata strings, the definition of octal codes could be clearer (i.e. character codes vs. coding bytes in UTF-16BE).
Software support for the dual encoding of UTF-16BE and octal codes is also an important aspect. JHOVE now got confused about it, but how about other software? Inspired by our blog text (see the footnote), Johan van der Knijff did a large experiment that concluded that wide software support exists for this case, see the blog "Escape from the Phantom of the PDF". We are grateful for this experimental effort and agree that software support probably will not be an issue. It is important to think about these kinds of things today, to minimize the risks of not being able to preserve features of digital content due to lacking software support. If a feature is problematic today, it will most likely be so in the future as well.
Written by Juha Lehtonen & Johan Kylander
The National Digital Preservation Services in Finland
Footnote: Originally we wrote: “JHOVE probably is not the only software that will get confused by dual encodings of metadata. In future migrations, these kinds of things need to be identified and handled somehow with the software support available. Unless we handle it today.”