Always look on the bright side of file

dPresFI

Thu, 06/11/2026 - 11:20

This text is part of our blog series, where individuals write about mixed digital preservation themes in a free-form way. Our blog series is not intended as guiding rules nor instructions, but as an inspiration for ideas and discussion.

Recently we ran into an information package that timed out before we could finish the validation in production. The package wasn't particularly large and contained DPX images, which we've ingested for years without problems. Upon further examination it was noted that only the bright images were causing problems and validation of dark images was seemingly five hundred times faster.

In this blog post we delve into the technical details of this fascinating bug. We start by debugging, zero in on the root cause and devise a fix to the slowness.

Surely the brightness of the image has to be a red herring?

Reproducing the bug

Sure enough, we were provided a dark and a bright DPX image that reproduced this behavior. In place of the original images we've created two images: dark.dpx and bright.dpx using GIMP and ffmpeg. Dark.dpx is a 3840x2160 canvas filled with the color 0x000414 and bright.dpx is the negative of this, i.e., the same size canvas filled with the color 0xfffbeb.

We can reproduce the behavior by unzipping the images:

# Unzip
$ gunzip *.gz

# Check file format
$ file *.dpx
bright.dpx: DPX image data, little-endian, 3840x2160, left to right/top to bottom
dark.dpx:   DPX image data, little-endian, 3840x2160, left to right/top to bottom

# Verify file checksums
$ sha256sum *.dpx
8e3f67a50901ca0572fe3010189aef8a3ad8b10e315e155d5a1eb75e3bfffb20  bright.dpx
cb8582f9cf16ef2e4d79f74334c4b42739971c81262024dbe1b144ce92cf6561  dark.dpx

and simply trying the scrape the files:

$ for file in bright.dpx dark.dpx
      time scraper scrape-file $file > /dev/null
  end

________________________________________________________
Executed in  142.17 secs    fish           external
   usr time  141.10 secs  809.00 micros  141.10 secs
   sys time    0.19 secs  153.00 micros    0.19 secs


________________________________________________________
Executed in  679.24 millis    fish           external
   usr time  499.83 millis    0.00 micros  499.83 millis
   sys time  163.17 millis  416.00 micros  162.75 millis

The bright image takes two minutes to validate while the dark image takes half a second, the only difference being the exact color chosen for the canvas.

Profiling python code performance

File-scraper runs a myriad of different tools behind the scenes. Identifying the exact tool causing the slowness can be done manually by reading the code and testing the commands one by one. However, there are also libraries that automate the whole process. This is known as code profiling and can be accomplished using the Python standard library profilers.

Let's start by running the bright image through cProfile and sorting by cumulative time spent in each function:

$ python3 -m cProfile -s cumtime /usr/bin/scraper scrape-file bright.dpx | head -50
.
. <scraper output>
.
         881009 function calls (860708 primitive calls) in 140.657 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    305/1    0.000    0.000  140.657  140.657 {built-in method builtins.exec}
        1    0.000    0.000  140.657  140.657 scraper:3(<module>)
        1    0.000    0.000  140.340  140.340 core.py:1126(__call__)
        1    0.000    0.000  140.340  140.340 core.py:981(main)
        1    0.000    0.000  140.340  140.340 core.py:1625(invoke)
        2    0.000    0.000  140.339   70.170 core.py:1384(invoke)
        2    0.000    0.000  140.339   70.170 core.py:703(invoke)
        1    0.000    0.000  140.339  140.339 cmdline.py:45(scrape_file)
        1    0.000    0.000  140.338  140.338 scraper.py:285(scrape)
        1    0.000    0.000  140.335  140.335 scraper.py:140(detect_filetype)
        7    0.000    0.000  140.334   20.048 scraper.py:464(_update_filetype)
        1    0.000    0.000  140.312  140.312 detectors.py:182(detect)
        1    0.000    0.000  140.263  140.263 detectors.py:100(identify)
        1    0.000    0.000  140.263  140.263 fido.py:340(identify_file)
        1    0.003    0.003  140.262  140.262 fido.py:665(match_formats)
       40    0.000    0.000  140.107    3.503 re.py:198(search)
       56  140.097    2.502  140.097    2.502 {method 'search' of 're.Pattern' objects}
    360/3    0.001    0.000    0.323    0.108 <frozen importlib._bootstrap>:1002(_find_and_load)
    354/2    0.001    0.000    0.323    0.161 <frozen importlib._bootstrap>:967(_find_and_load_unlocked)
    464/3    0.000    0.000    0.322    0.107 <frozen importlib._bootstrap>:220(_call_with_frames_removed)
    340/3    0.001    0.000    0.322    0.107 <frozen importlib._bootstrap>:659(_load_unlocked)
    303/3    0.001    0.000    0.322    0.107 <frozen importlib._bootstrap_external>:844(exec_module)
        1    0.000    0.000    0.316    0.316 cmdline.py:1(<module>)
        1    0.000    0.000    0.270    0.270 scraper.py:1(<module>)
    94/61    0.000    0.000    0.251    0.004 {built-in method builtins.__import__}
        1    0.000    0.000    0.245    0.245 iterator.py:1(<module>)
   112/46    0.000    0.000    0.238    0.005 <frozen importlib._bootstrap>:1033(_handle_fromlist)
        1    0.000    0.000    0.195    0.195 wand_extractor.py:1(<module>)
     1668    0.005    0.000    0.190    0.000 re.py:289(_compile)
        1    0.000    0.000    0.188    0.188 image.py:1(<module>)

We're looking for a function where we see a clear dip in the spent cumulative time. This case is particularly clear as there are 56 calls to method search of re.Pattern and we're spending the whole 140 seconds matching these regular expressions. Sorting by cumulative time means that we can follow up from there to see which functions called it. We clearly see that the effective traceback is:

scraper -> detectors -> fido -> re

Analyzing FIDO behavior

FIDO works by scanning files for signatures in the PRONOM registry. These signatures are regular expressions, which define patterns found in specific file formats. Take the above DPX images as an example. We can search PRONOM for the signatures and find:

Description     Magic number in first four bytes: XPDS indicates little-endian
                byte order; 4 byte offset to image; Version 1.0

Position type   Absolute from BOF
Offset          0
Maximum Offset  0
Byte order
Value           58504453{4}56312E30

stating that all v1.0 little-endian DPX images start with the pattern 58504453{4}56312E30.

58504453: Exact four bytes (XPDS)
{4}: Arbitrary four bytes
56312E30: Exact four bytes (v1.0)

We can compare this to our test images and see that they indeed match this pattern:

$ xxd bright.dpx | head -1
00000000: 5850 4453 8006 0000 5631 2e30 0000 0000  XPDS....V1.0....

FIDO iterates these signatures and tries to find matches in the analyzed file. We can modify FIDO to keep track of how long each signature check takes and print the statistics. We can accomplish this with the following changes to upstream FIDO v1.6.1 source code:

diff --git a/fido/fido.py b/fido/fido.py
index 3b5bc60..e1a337d 100755
--- a/fido/fido.py
+++ b/fido/fido.py
@@ -693,11 +693,14 @@ class Fido:
         """
         self.current_count += 1
         result = []
+        times = []
+        import time
         for format in self.formats:
             try:
                 self.current_format = format
                 if self.as_good_as_any(format, result):
                     for sig in self.get_signatures(format):
+                        time_start = time.time()
                         self.current_sig = sig
                         success = True
                         for pat in self.get_patterns(sig):
@@ -723,6 +726,8 @@ class Fido:
                                     break
                         if success:
                             result.append((format, sig.findtext("name")))
+                        time_end = time.time()
+                        times.append((sig, time_end-time_start))
             except Exception as e:
                 sys.stderr.write(str(e) + "\n")
                 continue
@@ -731,6 +736,10 @@ class Fido:
             # sys.stdout.write('***', self.get_puid(format), regex)

         result = [match for match in result if self.as_good_as_any(match[0], result)]
+
+        for i in times:
+            print(f"{i[1]:.8f}\t{i[0].findtext('name')}")
+
         return result

     def match_extensions(self, filename):

Running the code above and sorting the output gives us the ten slowest signature checks:

$ fido bright.dpx 2>/dev/null | sort -g | tail -10
0.00032163      Drawing Interchange File Format 1.2
0.00033593      Drawing Interchange File Format 2.2
0.00037837      Drawing Interchange File Format 1.0
0.00038409      Interlis transfer file 2.2 b
0.00039887      Interlis 2.3 Transfer File 2
0.00049639      BibTeX Database File
0.00053000      HFS
0.00067973      Apple Partition Map ISO 9660 Hybrid, sig n.1
0.00070667      Apple Partition Map ISO 9660 Hybrid, sig n.2
133.29057360    MPEG-1/2 Audio Layer 3

Essentially the whole time is spent checking if the bright DPX file is an MP3 audio file. For dark DPX images this check takes 0.1 milliseconds, which is roughly a million times, or six orders of magnitude, faster.

$ fido dark.dpx 2>/dev/null | grep 'MPEG-1/2'
0.00017285      MPEG-1/2 Audio Layer 3

Bringing it all together

At this point, astute readers might already have a hunch on why the brightness of the image affects the code performance. To recap, we have identified:

FIDO running PRONOM signature checks as the bottleneck.
MPEG-1/2 Audio Layer 3 signatures as the overwhelmingly slowest checks: https://www.nationalarchives.gov.uk/pronom/fmt/134.
Brightness of the image actually affecting the program behavior as otherwise identical darker image doesn't reproduce the bug.

Let's pull out the first MPEG-1/2 Audio Layer 3 signature FIDO checks and study that against both images.

<signature>
  <name>MPEG-1 Audio Layer 3 with ID3v2 Tag</name>
  <note>Regularly-spaced frame headers should always be discoverable near EOF. An ID3v1 tag of up to 355 bytes may be present at EOF.</note>
  <pattern>
    <position>EOF</position>
    <pronom_pattern>FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{47-1795}</pronom_pattern>
    <regex>(?s)\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{47,1795}\Z</regex>
  </pattern>
  <pattern>
    <position>BOF</position>
    <pronom_pattern>494433</pronom_pattern>
    <regex>(?s)\AID3</regex>
  </pattern>
</signature>

The relevant part here is the regex element, which is the actual regex that FIDO is executing. It is looking for multiple MP3 frames at the end of the file. Single frame that this specific signature is looking for is:

fffb[10-eb].{46,1439}

fffb: Exact two bytes
[10-eb]: One arbitrary byte between 0x10 and 0xeb
.{46,1439}: 46-1439 arbitrary bytes

Comparing this to our two images:

$ xxd bright.dpx | tail -1
017bb670: ebff fbeb fffb ebff fbeb fffb ebff fbeb  ................

$ xxd dark.dpx | tail -1
017bb670: 1400 0414 0004 1400 0414 0004 1400 0414  ................

we see that the bright image partially matches this pattern every three bytes. Every pixel of the bright DPX image is a possible MP3 frame header and FIDO has to search ahead and see if if matches the rest of the signature. The dark image doesn't match the regex at all, which makes it fast to evaluate.

We can compare our chosen color fffbeb to MP3 frame header specs:

AAAAAAAA AAABBCCD EEEEFFGH
11111111 11111011 11101011
      ff       fb       eb

which maps to:

MPEG Version 1 (ISO/IEC 11172-3)
Layer III
Not protected by CRC
Bitrate 320 kbps
Sampling rate 32000 Hz
Frame is padded with one extra slot
Arbitrary private bit is set.

First 11 bits of the MP3 frame header being always set also explains the higher probability of bright colors partially matching this pattern. Interestingly, all white ffffff wouldn't match this pattern so not all bright images are slow, but rather the specific colors that match the patterns checked by FIDO. Colors fffb10-fffbeb map to:

Truly a sunny picture.

Conclusions

We studied an intriguing bug where file format identification turned out to be three orders of magnitude slower than expected for bright DPX images. The bug was caused by each pixel of the bright image partially matching the MP3 frame header format and killing regular expression evaluation performance.

Understanding all this, the actual fix to DPX validation performance is trivial. We can run the MP3 regex last and skip it altogether if we have already identified the file as a DPX. This fixes DPX validation performance while ensuring that we do not break MP3 identification, but does possibly leave behind similar edge cases with other uncompressed file formats.

Similar regular expression problems aren't unheard of. One notable example is a major Cloudflare outage in 2019. Regular expressions are extremely powerful and can make it easier to formulate complex queries, but are also hard to understand and error prone. They can't always be avoided, but it's a good practice to pursue other avenues before falling back to regular expressions.

Juho Kuisma
0x5527DB198DF3508A