The Phantom of a PDF File
Changelog:
- Update 11/11/2024: We now refer to the most recent PDF specification ISO 32000-2:2020. Dual encoding of UTF-16BE and octal codes is not judged to violate specifications anymore, but we ponder a bit more on the definition of octal codes. From a point of view of digital preservation, dual encoding is still discouraged. This is now discussed a little more in the Finale.
This ghost story is originally written for Halloween 2024. It is based on real events. However, the PDF file in question has been replaced with a test file for this story.
Prologue
We recently received a simple PDF 1.4 file that did not pass all validation checks. JHOVE (v. 1.30.0) reported that the PDF file is “Not well-formed”. The error message reported by JHOVE was “PDF_HUL_66 Lexical Error” in offset 1888. This error message means that there is something wrong with the technical “language” in the PDF file’s internal structure.
We promised to analyze the file to see what was wrong with it. Oh boy, were we in for a ride!
Act 1: (Grave)Digging into the PDF Structure
We started inspecting the file using a hex editor. The offset 1888 was in the middle of a zlib/deflate encoded image stream. JHOVE should not need to do any Lexical checking in the middle of an image stream. This immediately raised our suspicions that something was going on in here. The dictionary for the image stream object is:
<< /N 3 /Alternate /DeviceRGB /Length 3187 /Filter /FlateDecode >>
The actual length of the stream is 3187 bytes, so there is nothing wrong with the information in the dictionary, such as the reported length. The ID for the image stream object is “7 0” and it is referred to from object “6 0” as follows:
[ /ICCBased 7 0 R ]
We followed the object references and eventually ended up in the first element of the PDF file. We found nothing exceptional anywhere.
What next? Time to return to JHOVE's error message.
JHOVE reports a Lexical error when there is an unexpected character in a numerical value or when a PDF dictionary object does not end correctly with “>>”. The PDF file has 10 objects, and 9 of them begin with “<<” and end with a corresponding “>>”. The remaining object is a list, correctly surrounded by square brackets “[“ and “]”. Three of the PDF objects contain streams, each correctly starting with the word “stream” and ending with “endstream”. The Length parameter is correct in all of the stream objects. All objects in the PDF file begin with “X 0 obj” (where X is the object index) and end with “endobj” properly.
We started to look at the xref table as our next step, which contains the byte offsets of the objects. All the byte offsets refer to the beginning of the objects, and the objects are numbered correctly in relation to the xref table index.
Overall, the PDF structure looks ok.
Intermission: Fixing the File with Acrobat Reader
We happen to know that some basic PDF problems can be fixed simply just by opening the file in Adobe Acrobat Reader and saving it as a new file without making any changes. We decided to try this out for this PDF file. After saving the file with Acrobat Reader we run it through JHOVE to see what it would tell us. JHOVE now reports that the new file is Well-Formed and valid.
Okay.
We opened both files in Acrobat Reader and they looked identical. When going through the PDF file structure with our hex editor, the fixed file’s structure was a bit different compared to the original. Saving the file with Acrobat Reader had caused some minor changes in the objects in the PDF structure, but at first glance there seemed to be nothing special in these changes. Just to be sure, we calculated the checksums of the byte streams from the original and new PDF file to confirm that there were no changes in the streams.
The problem is not in the offset 1888 in the original file, nor in anything referring to it directly or indirectly, but somewhere else.
Act 2: It’s There! The Phantom of a PDF File!
Until now, we had only carefully checked the objects that have a direct or indirect reference to the object containing the offset the JHOVE complains about. Since those objects are OK, the reason must be in some other objects in the PDF file. We finally realized that when fixing the PDF file, the Producer field in the Document Information Dictionary object has been changed. The Producer in the original file looks odd:
/Producer (\376\377\000P\000D\000F\000 \000P\000h\000a\000n\000t\000o\000m\000\000)
In the file fixed with Acrobat Reader it is:
/Producer (PDF Phantom)
The original Producer field starts with an 8-byte string “\376\377” (hex codes 5C 33 37 36 5C 33 37 37). The PDF file format includes an octal character encoding format, inherited from the PostScript format, and it is now used in this metadata. The octal values 376 and 377 correspond to FE and FF as hex values. This refers to a BOM character (byte order marker, hex FEFF), which is used in the UTF-16BE encoding. This means that the UTF-16BE encoded string has been dual encoded with an octal representation of the PDF file format. Likewise, substring “\000” refers just to a null byte, and the actual content can be picked between these null codes: \000P --> P, \000D --> D, \000F --> F, etc.
The PDF Reference ISO 32000-2:2020 defines in ch. 7.9.2.2.1: "For text strings encoded in UTF-16BE, the first two bytes shall be 254 followed by 255. These two bytes represent the Unicode byte order marker, ZERO WIDTH NO-BREAK SPACE (U+FEFF), indicating that the string is encoded in the UTF-16BE (big-endian) encoding scheme specified in Unicode.". In dual encoding of "\376\377", byte order marker is no longer fitted in two bytes. However, octal codes correspond to these two bytes and are part of the definition of literal strings, not text strings.
In ch. 7.3.4.2, octal coding is described as "character code ddd", which intuitively indicates that an octal code should be decoded to a character. But at the end of the chapter, this is added: "However, any 8-bit value may appear in a string, represented either as itself or with the \ddd notation described." It remains quite unclear whether a single octal “character code” should be decoded to a character or a byte (of a character). In UTF-16BE encoded string there is a difference, since all the characters use 2 or 4 bytes.
JHOVE can’t handle the dual encoding in the original file. JHOVE does not find the right parenthesis “)” of the Producer field to know when the field ends. Consequently, it continues reading the file until finally reaching a problem (Lexical Error) in byte offset 1888, which can’t be decoded anymore. The Producer field in the Document Information Dictionary object exists in the offset 87-169 of the PDF file. This was quite a bit to track back from the reported error at offset 1888.
UTF-16BE characters use 2 or 4-bytes each, and originally, JHOVE connects “)” with the neighbouring byte(s) to form a character. Just adding a space character before the right parenthesis “)” causes the file to be valid according to JHOVE, but as a result, JHOVE results nonsense as the Producer metadata.
Act 3: XMP as a Friendly Ghost
Acrobat Reader figures out that the Producer should be “PDF Phantom” and stores it to the Producer field using plain US-ASCII encoding. To test the functionality of the fix, we changed the Producer to “Ananasakäämä” with the hex editor, so that it would include Scandinavian characters, and reformulated it with the combination of UTF-16BE and octal representation. We also needed to be sure to handle the byte offsets in the xref table and startxref offset separately, because the byte offsets of other objects changed. The Producer field in our test looked like:
/Producer (\376\377\000A\000n\000a\000n\000a\000s\000a\000k\000\344\000\344\000m\000\344\000\000)
The octal values \000\344 correspond to the hex values 00E4, which is character “ä” in UTF-16BE. We opened and saved the test file with Acrobat Reader, and the result looked interesting. We got:
/Producer (PDF Phantom)
In other words, we just got rid of “PDF Phantom”, and now it is there again!
Spooky!
We finally figured out that since the file also has XMP metadata, in a totally different object within the PDF file, Acrobat Reader overrides the Producer in the Document Information Dictionary object with the Producer from the XMP, which looks like this:
<pdf:Producer>PDF Phantom</pdf:Producer>
Act 4: Shapeshifting of the Phantom
We did another test where we removed the XMP metadata from the original file. As a result, Acrobat Reader seems to be able to remove the octal representation, but it keeps the UTF-16BE encoding for the “PDF Phantom” string. This is the result:
/Producer (<fe><ff><00>P<00>D<00>F<00> <00>P<00>h<00>a<00>n<00>t<00>o<00>m<00><00>)
Here, <fe>, <ff> and <00> are non-printable bytes (one byte each), corresponding to the hex codes FE, FF and 00, respectively.
A valid action is to change the Producer to use either PDFDocEncoding or UTF-16BE. The UTF-16BE is the only allowed Unicode format to encode strings in a PDF 1.x file. However, using both UTF-16BE and octal representation together for the same field is questionable. At least it increases the risk of misinterpretation and even makes some tools, such as JHOVE, to get totally confused. This might cause problems in the future.
Finale
We would like to recommend, from the perspective of digital preservation and in the encoding level, to use plain US-ASCII (without octal representation) in the PDF metadata fields. If this is not possible, UTF-16BE encoding can be used. Some characters in US-ASCII, such as parenthesis or line feeds, require an escape character “\”, which means that the character is used in the string and not as a part of the PDF syntax. This is probably the most standardized way to encode metadata strings nowadays. In PDFDocEncoding, the octal character codes are inherited from the old PostScript world, which is relying on the widely deprecated Latin-1.
Finally, it is questionable to use a dual encoding of UTF-16BE and octal representation. From the point of view of digital preservation, having multiple layers of encodings such as UTF-16BE added with octal encoding, might just not be sensible in practice. In the long-term, each unnecessary encoding layer raises the risk of causing problems in the future. JHOVE probably is not the only software that will get confused by dual encodings of metadata. In future migrations, these kinds of things need to be identified and handled somehow with the software support available. Unless we handle it today.
Written by Juha Lehtonen & Johan Kylander
The National Digital Preservation Services in Finland