Project

Scanning ReDoS in Python Automatically

On the way to Brighton, UK 🇬🇧

Hi, this is Seokchan Yoon.

I'm currently taking a compiler course at KyungHee University this semester. Despite the short duration, I've had the pleasure of delving into intriguing topics, ranging from Lexing and Parsing algorithms to Abstract Syntax Trees (AST), Intermediate Representations (IR), and basic compiler optimization techniques. It has been an enlightening experience.

While learning about AST, I was particularly intrigued by its potential application in static vulnerability analysis. My curiosity led me to discover brilliant tools such as CodeQL, which is based on AST, and also to learn about the significant progress made in research fields like Taint Analysis.

ReDoS Vulnerability: A Deep Dive and Reflection on `CVE-2023-23969`

During my studies, I revisited the concept of Regular Expression Denial of Service (ReDoS). This is a type of application-level DoS attack that capitalizes on the complexity of certain regular expressions (regex) to exponentially increase the processing time. This computational overhead can render an application unresponsive, effectively denying service to its legitimate users.

A relatively recent vulnerability sprung to mind: 'Potential Denial-of-Service via Accept-Language Headers' (CVE-2023-23969). This vulnerability arose from the use of incorrect regex when parsing Accept-Language, which could trigger a ReDoS attack.

accept_language_re = _lazy_re_compile(
    r"""
        # "en", "en-au", "x-y-z", "es-419", "*"
        ([A-Za-z]{1,8}(?:-[A-Za-z0-9]{1,8})*|\*)
        # Optional "q=1.00", "q=0.8"
        (?:\s*;\s*q=(0(?:\.[0-9]{,3})?|1(?:\.0{,3})?))?
        # Multiple accepts per header.
        (?:\s*,\s*|$)
    """,
    re.VERBOSE,
)

# ...

@functools.lru_cache(maxsize=1000)
def parse_accept_lang_header(lang_string):
    
    result = []
    pieces = accept_language_re.split(lang_string.lower())
    if pieces[-1]:
        return ()
    for i in range(0, len(pieces) - 1, 3):
        first, lang, priority = pieces[i : i + 3]
        if first:
            return ()
        if priority:
            priority = float(priority)
        else:
            priority = 1.0
        result.append((lang, priority))
    result.sort(key=lambda k: k[1], reverse=True)
    return tuple(result)

Django source code

The above code is part of Django's source code. To summarize, it incorrectly uses the star notation when parsing the Accept-Language header, which can trigger a ReDoS attack. It can parse strings like the one below, but the longer the string, the more computing resources are consumed.

"xxxxxxxx" + "-xxxxxxxx" * 500

ReDoS payload

We call this 'ReDoS Attack'

Conceiving a Python-based ReDoS Detection Solution

It made me ponder - wouldn't it be possible to preemptively detect ReDoS by extracting arguments used in regex-related functions like re.compile() using the ast module in Python? Spurred by this idea, I decided to create a Proof-of-Concept (PoC) code.

You can check out my idea and the corresponding PoC in the repository linked below.

Here's the concept I've envisioned:

Parse the Python code into an AST tree using the ast module.
Extract all regex strings used in functions like re.compile(), match(), search(), and fullmatch().
Fuzz the extracted regex strings using the atheris module. (dynamic method)
Find nested *, + notations. (static method)
Derive the ReDoS vulnerabilities.

I was able to find vulnerabilities in simple codes like the one below using this approach.

import re
import time


for n in range(100):
    start_time = time.time()
    re.match("(bb|b.)*a", "b" * n)
    print(n, time.time() - start_time)

test.py

$ python3 scan.py
INFO: Using built-in libfuzzer
WARNING: Failed to find function "__sanitizer_acquire_crash_state". Reason dlsym(RTLD_DEFAULT, __sanitizer_acquire_crash_state): symbol not found.
WARNING: Failed to find function "__sanitizer_print_stack_trace". Reason dlsym(RTLD_DEFAULT, __sanitizer_print_stack_trace): symbol not found.
WARNING: Failed to find function "__sanitizer_set_death_callback". Reason dlsym(RTLD_DEFAULT, __sanitizer_set_death_callback): symbol not found.
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 2780101204
INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
INFO: A corpus is not provided, starting from an empty corpus
#2	INITED cov: 4 ft: 4 corp: 1/1b exec/s: 0 rss: 38Mb
Suspicious pattern: (bb|b.)*a
#59057	NEW    cov: 5 ft: 5 corp: 2/50b lim: 589 exec/s: 59057 rss: 39Mb L: 49/49 MS: 5 CopyPart-ChangeBit-ChangeByte-ChangeBit-InsertRepeatedBytes-
Suspicious pattern: (bb|b.)*a

executed result

You can see that you have found a vulnerable regex string (bb|b.)*a

However, it's still challenging to identify actual vulnerabilities with such short codes. Particularly, there's an issue with the fuzzer not working properly when the regex is long. In such cases, it might be possible to divide the regex string into multiple parts and fuzz them separately, following a divide-and-conquer approach. This issue is certainly a challenge that needs to be addressed.

Scanning ReDoS in Python Automatically

ReDoS Vulnerability: A Deep Dive and Reflection on `CVE-2023-23969`

Conceiving a Python-based ReDoS Detection Solution

Read next

Kyung Hee University's SW Security Competition: From Concept to Reality

Project: Chrome Dino Game with Body Gesture

ReDoS Vulnerability: A Deep Dive and Reflection on CVE-2023-23969

Conceiving a Python-based ReDoS Detection Solution

Related Articles

Read next

ReDoS Vulnerability: A Deep Dive and Reflection on `CVE-2023-23969`