Scanning ReDoS in Python Automatically

Hi, this is Seokchan Yoon.
I'm currently taking a compiler course at KyungHee University this semester. Despite the short duration, I've had the pleasure of delving into intriguing topics, ranging from Lexing and Parsing algorithms to Abstract Syntax Trees (AST), Intermediate Representations (IR), and basic compiler optimization techniques. It has been an enlightening experience.
While learning about AST, I was particularly intrigued by its potential application in static vulnerability analysis. My curiosity led me to discover brilliant tools such as CodeQL, which is based on AST, and also to learn about the significant progress made in research fields like Taint Analysis.
ReDoS Vulnerability: A Deep Dive and Reflection on CVE-2023-23969
During my studies, I revisited the concept of Regular Expression Denial of Service (ReDoS). This is a type of application-level DoS attack that capitalizes on the complexity of certain regular expressions (regex) to exponentially increase the processing time. This computational overhead can render an application unresponsive, effectively denying service to its legitimate users.
A relatively recent vulnerability sprung to mind: 'Potential Denial-of-Service via Accept-Language Headers' (CVE-2023-23969). This vulnerability arose from the use of incorrect regex when parsing Accept-Language
, which could trigger a ReDoS attack.

accept_language_re = _lazy_re_compile(
r"""
# "en", "en-au", "x-y-z", "es-419", "*"
([A-Za-z]{1,8}(?:-[A-Za-z0-9]{1,8})*|\*)
# Optional "q=1.00", "q=0.8"
(?:\s*;\s*q=(0(?:\.[0-9]{,3})?|1(?:\.0{,3})?))?
# Multiple accepts per header.
(?:\s*,\s*|$)
""",
re.VERBOSE,
)
# ...
@functools.lru_cache(maxsize=1000)
def parse_accept_lang_header(lang_string):
result = []
pieces = accept_language_re.split(lang_string.lower())
if pieces[-1]:
return ()
for i in range(0, len(pieces) - 1, 3):
first, lang, priority = pieces[i : i + 3]
if first:
return ()
if priority:
priority = float(priority)
else:
priority = 1.0
result.append((lang, priority))
result.sort(key=lambda k: k[1], reverse=True)
return tuple(result)
The above code is part of Django's source code. To summarize, it incorrectly uses the star notation when parsing the Accept-Language
header, which can trigger a ReDoS attack. It can parse strings like the one below, but the longer the string, the more computing resources are consumed.
"xxxxxxxx" + "-xxxxxxxx" * 500
We call this 'ReDoS Attack'
Conceiving a Python-based ReDoS Detection Solution
It made me ponder - wouldn't it be possible to preemptively detect ReDoS by extracting arguments used in regex-related functions like re.compile()
using the ast
module in Python? Spurred by this idea, I decided to create a Proof-of-Concept (PoC) code.
You can check out my idea and the corresponding PoC in the repository linked below.
Here's the concept I've envisioned:
- Parse the Python code into an AST tree using the
ast
module. - Extract all regex strings used in functions like
re.compile()
,match()
,search()
, andfullmatch()
. - Fuzz the extracted regex strings using the
atheris
module. (dynamic method) - Find nested
*
,+
notations. (static method) - Derive the ReDoS vulnerabilities.
I was able to find vulnerabilities in simple codes like the one below using this approach.
import re
import time
for n in range(100):
start_time = time.time()
re.match("(bb|b.)*a", "b" * n)
print(n, time.time() - start_time)
$ python3 scan.py
INFO: Using built-in libfuzzer
WARNING: Failed to find function "__sanitizer_acquire_crash_state". Reason dlsym(RTLD_DEFAULT, __sanitizer_acquire_crash_state): symbol not found.
WARNING: Failed to find function "__sanitizer_print_stack_trace". Reason dlsym(RTLD_DEFAULT, __sanitizer_print_stack_trace): symbol not found.
WARNING: Failed to find function "__sanitizer_set_death_callback". Reason dlsym(RTLD_DEFAULT, __sanitizer_set_death_callback): symbol not found.
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 2780101204
INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
INFO: A corpus is not provided, starting from an empty corpus
#2 INITED cov: 4 ft: 4 corp: 1/1b exec/s: 0 rss: 38Mb
Suspicious pattern: (bb|b.)*a
#59057 NEW cov: 5 ft: 5 corp: 2/50b lim: 589 exec/s: 59057 rss: 39Mb L: 49/49 MS: 5 CopyPart-ChangeBit-ChangeByte-ChangeBit-InsertRepeatedBytes-
Suspicious pattern: (bb|b.)*a
You can see that you have found a vulnerable regex string (bb|b.)*a
However, it's still challenging to identify actual vulnerabilities with such short codes. Particularly, there's an issue with the fuzzer not working properly when the regex is long. In such cases, it might be possible to divide the regex string into multiple parts and fuzz them separately, following a divide-and-conquer approach. This issue is certainly a challenge that needs to be addressed.
Related Articles


