Always look on the bright side of file
This text is part of our blog series, where individuals write about mixed digital preservation themes in a free-form way. Our blog series is not intended as guiding rules nor instructions, but as an inspiration for ideas and discussion.
Recently we ran into an information package that timed out before we could finish the validation in production. The package wasn't particularly large and contained DPX images, which we've ingested for years without problems. Upon further examination it was noted that only the bright images were causing problems and validation of dark images was seemingly five hundred times faster.
In this blog post we delve into the technical details of this fascinating bug. We start by debugging, zero in on the root cause and devise a fix to the slowness.
Surely the brightness of the image has to be a red herring?
Reproducing the bug
Sure enough, we were provided a dark and a bright DPX image that reproduced this behavior. In place of the original images we've created two images: dark.dpx and bright.dpx using GIMP and ffmpeg. Dark.dpx is a 3840x2160 canvas filled with the color 0x000414 and bright.dpx is the negative of this, i.e., the same size canvas filled with the color 0xfffbeb.
We can reproduce the behavior by unzipping the images:
# Unzip $ gunzip *.gz # Check file format $ file *.dpx bright.dpx: DPX image data, little-endian, 3840x2160, left to right/top to bottom dark.dpx: DPX image data, little-endian, 3840x2160, left to right/top to bottom # Verify file checksums $ sha256sum *.dpx 8e3f67a50901ca0572fe3010189aef8a3ad8b10e315e155d5a1eb75e3bfffb20 bright.dpx cb8582f9cf16ef2e4d79f74334c4b42739971c81262024dbe1b144ce92cf6561 dark.dpx
and simply trying the scrape the files:
$ for file in bright.dpx dark.dpx time scraper scrape-file $file > /dev/null end ________________________________________________________ Executed in 142.17 secs fish external usr time 141.10 secs 809.00 micros 141.10 secs sys time 0.19 secs 153.00 micros 0.19 secs ________________________________________________________ Executed in 679.24 millis fish external usr time 499.83 millis 0.00 micros 499.83 millis sys time 163.17 millis 416.00 micros 162.75 millis
The bright image takes two minutes to validate while the dark image takes half a second, the only difference being the exact color chosen for the canvas.
Profiling python code performance
File-scraper runs a myriad of different tools behind the scenes. Identifying the exact tool causing the slowness can be done manually by reading the code and testing the commands one by one. However, there are also libraries that automate the whole process. This is known as code profiling and can be accomplished using the Python standard library profilers.
Let's start by running the bright image through cProfile and sorting by cumulative time spent in each function:
$ python3 -m cProfile -s cumtime /usr/bin/scraper scrape-file bright.dpx | head -50 . . <scraper output> . 881009 function calls (860708 primitive calls) in 140.657 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 305/1 0.000 0.000 140.657 140.657 {built-in method builtins.exec} 1 0.000 0.000 140.657 140.657 scraper:3(<module>) 1 0.000 0.000 140.340 140.340 core.py:1126(__call__) 1 0.000 0.000 140.340 140.340 core.py:981(main) 1 0.000 0.000 140.340 140.340 core.py:1625(invoke) 2 0.000 0.000 140.339 70.170 core.py:1384(invoke) 2 0.000 0.000 140.339 70.170 core.py:703(invoke) 1 0.000 0.000 140.339 140.339 cmdline.py:45(scrape_file) 1 0.000 0.000 140.338 140.338 scraper.py:285(scrape) 1 0.000 0.000 140.335 140.335 scraper.py:140(detect_filetype) 7 0.000 0.000 140.334 20.048 scraper.py:464(_update_filetype) 1 0.000 0.000 140.312 140.312 detectors.py:182(detect) 1 0.000 0.000 140.263 140.263 detectors.py:100(identify) 1 0.000 0.000 140.263 140.263 fido.py:340(identify_file) 1 0.003 0.003 140.262 140.262 fido.py:665(match_formats) 40 0.000 0.000 140.107 3.503 re.py:198(search) 56 140.097 2.502 140.097 2.502 {method 'search' of 're.Pattern' objects} 360/3 0.001 0.000 0.323 0.108 <frozen importlib._bootstrap>:1002(_find_and_load) 354/2 0.001 0.000 0.323 0.161 <frozen importlib._bootstrap>:967(_find_and_load_unlocked) 464/3 0.000 0.000 0.322 0.107 <frozen importlib._bootstrap>:220(_call_with_frames_removed) 340/3 0.001 0.000 0.322 0.107 <frozen importlib._bootstrap>:659(_load_unlocked) 303/3 0.001 0.000 0.322 0.107 <frozen importlib._bootstrap_external>:844(exec_module) 1 0.000 0.000 0.316 0.316 cmdline.py:1(<module>) 1 0.000 0.000 0.270 0.270 scraper.py:1(<module>) 94/61 0.000 0.000 0.251 0.004 {built-in method builtins.__import__} 1 0.000 0.000 0.245 0.245 iterator.py:1(<module>) 112/46 0.000 0.000 0.238 0.005 <frozen importlib._bootstrap>:1033(_handle_fromlist) 1 0.000 0.000 0.195 0.195 wand_extractor.py:1(<module>) 1668 0.005 0.000 0.190 0.000 re.py:289(_compile) 1 0.000 0.000 0.188 0.188 image.py:1(<module>)
We're looking for a function where we see a clear dip in the spent cumulative time. This case is particularly clear as there are 56 calls to method search of re.Pattern and we're spending the whole 140 seconds matching these regular expressions. Sorting by cumulative time means that we can follow up from there to see which functions called it. We clearly see that the effective traceback is:
scraper -> detectors -> fido -> re
Analyzing FIDO behavior
FIDO works by scanning files for signatures in the PRONOM registry. These signatures are regular expressions, which define patterns found in specific file formats. Take the above DPX images as an example. We can search PRONOM for the signatures and find:
Description Magic number in first four bytes: XPDS indicates little-endian
byte order; 4 byte offset to image; Version 1.0
Position type Absolute from BOF
Offset 0
Maximum Offset 0
Byte order
Value 58504453{4}56312E30
stating that all v1.0 little-endian DPX images start with the pattern 58504453{4}56312E30.
- 58504453: Exact four bytes (XPDS)
- {4}: Arbitrary four bytes
- 56312E30: Exact four bytes (v1.0)
We can compare this to our test images and see that they indeed match this pattern:
$ xxd bright.dpx | head -1 00000000: 5850 4453 8006 0000 5631 2e30 0000 0000 XPDS....V1.0....
FIDO iterates these signatures and tries to find matches in the analyzed file. We can modify FIDO to keep track of how long each signature check takes and print the statistics. We can accomplish this with the following changes to upstream FIDO v1.6.1 source code:
diff --git a/fido/fido.py b/fido/fido.py index 3b5bc60..e1a337d 100755 --- a/fido/fido.py +++ b/fido/fido.py @@ -693,11 +693,14 @@ class Fido: """ self.current_count += 1 result = [] + times = [] + import time for format in self.formats: try: self.current_format = format if self.as_good_as_any(format, result): for sig in self.get_signatures(format): + time_start = time.time() self.current_sig = sig success = True for pat in self.get_patterns(sig): @@ -723,6 +726,8 @@ class Fido: break if success: result.append((format, sig.findtext("name"))) + time_end = time.time() + times.append((sig, time_end-time_start)) except Exception as e: sys.stderr.write(str(e) + "\n") continue @@ -731,6 +736,10 @@ class Fido: # sys.stdout.write('***', self.get_puid(format), regex) result = [match for match in result if self.as_good_as_any(match[0], result)] + + for i in times: + print(f"{i[1]:.8f}\t{i[0].findtext('name')}") + return result def match_extensions(self, filename):
Running the code above and sorting the output gives us the ten slowest signature checks:
$ fido bright.dpx 2>/dev/null | sort -g | tail -10 0.00032163 Drawing Interchange File Format 1.2 0.00033593 Drawing Interchange File Format 2.2 0.00037837 Drawing Interchange File Format 1.0 0.00038409 Interlis transfer file 2.2 b 0.00039887 Interlis 2.3 Transfer File 2 0.00049639 BibTeX Database File 0.00053000 HFS 0.00067973 Apple Partition Map ISO 9660 Hybrid, sig n.1 0.00070667 Apple Partition Map ISO 9660 Hybrid, sig n.2 133.29057360 MPEG-1/2 Audio Layer 3
Essentially the whole time is spent checking if the bright DPX file is an MP3 audio file. For dark DPX images this check takes 0.1 milliseconds, which is roughly a million times, or six orders of magnitude, faster.
$ fido dark.dpx 2>/dev/null | grep 'MPEG-1/2' 0.00017285 MPEG-1/2 Audio Layer 3
Bringing it all together
At this point, astute readers might already have a hunch on why the brightness of the image affects the code performance. To recap, we have identified:
- FIDO running PRONOM signature checks as the bottleneck.
- MPEG-1/2 Audio Layer 3 signatures as the overwhelmingly slowest checks: https://www.nationalarchives.gov.uk/pronom/fmt/134.
- Brightness of the image actually affecting the program behavior as otherwise identical darker image doesn't reproduce the bug.
Let's pull out the first MPEG-1/2 Audio Layer 3 signature FIDO checks and study that against both images.
<signature> <name>MPEG-1 Audio Layer 3 with ID3v2 Tag</name> <note>Regularly-spaced frame headers should always be discoverable near EOF. An ID3v1 tag of up to 355 bytes may be present at EOF.</note> <pattern> <position>EOF</position> <pronom_pattern>FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{47-1795}</pronom_pattern> <regex>(?s)\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{47,1795}\Z</regex> </pattern> <pattern> <position>BOF</position> <pronom_pattern>494433</pronom_pattern> <regex>(?s)\AID3</regex> </pattern> </signature>
The relevant part here is the regex element, which is the actual regex that FIDO is executing. It is looking for multiple MP3 frames at the end of the file. Single frame that this specific signature is looking for is:
fffb[10-eb].{46,1439}
- fffb: Exact two bytes
- [10-eb]: One arbitrary byte between 0x10 and 0xeb
- .{46,1439}: 46-1439 arbitrary bytes
Comparing this to our two images:
$ xxd bright.dpx | tail -1 017bb670: ebff fbeb fffb ebff fbeb fffb ebff fbeb ................
$ xxd dark.dpx | tail -1 017bb670: 1400 0414 0004 1400 0414 0004 1400 0414 ................
we see that the bright image partially matches this pattern every three bytes. Every pixel of the bright DPX image is a possible MP3 frame header and FIDO has to search ahead and see if if matches the rest of the signature. The dark image doesn't match the regex at all, which makes it fast to evaluate.
We can compare our chosen color fffbeb to MP3 frame header specs:
AAAAAAAA AAABBCCD EEEEFFGH
11111111 11111011 11101011
ff fb eb
which maps to:
- MPEG Version 1 (ISO/IEC 11172-3)
- Layer III
- Not protected by CRC
- Bitrate 320 kbps
- Sampling rate 32000 Hz
- Frame is padded with one extra slot
- Arbitrary private bit is set.
First 11 bits of the MP3 frame header being always set also explains the higher probability of bright colors partially matching this pattern. Interestingly, all white ffffff wouldn't match this pattern so not all bright images are slow, but rather the specific colors that match the patterns checked by FIDO. Colors fffb10-fffbeb map to:

Truly a sunny picture.
Conclusions
We studied an intriguing bug where file format identification turned out to be three orders of magnitude slower than expected for bright DPX images. The bug was caused by each pixel of the bright image partially matching the MP3 frame header format and killing regular expression evaluation performance.
Understanding all this, the actual fix to DPX validation performance is trivial. We can run the MP3 regex last and skip it altogether if we have already identified the file as a DPX. This fixes DPX validation performance while ensuring that we do not break MP3 identification, but does possibly leave behind similar edge cases with other uncompressed file formats.
Similar regular expression problems aren't unheard of. One notable example is a major Cloudflare outage in 2019. Regular expressions are extremely powerful and can make it easier to formulate complex queries, but are also hard to understand and error prone. They can't always be avoided, but it's a good practice to pursue other avenues before falling back to regular expressions.
Juho Kuisma
0x5527DB198DF3508A