Scanning ReDoS in Python Automatically

Scanning ReDoS in Python Automatically
On the way to Brighton, UK 🇬🇧

Hi, this is Seokchan Yoon.

I'm currently taking a compiler course at KyungHee University this semester. Despite the short duration, I've had the pleasure of delving into intriguing topics, ranging from Lexing and Parsing algorithms to Abstract Syntax Trees (AST), Intermediate Representations (IR), and basic compiler optimization techniques. It has been an enlightening experience.

While learning about AST, I was particularly intrigued by its potential application in static vulnerability analysis. My curiosity led me to discover brilliant tools such as CodeQL, which is based on AST, and also to learn about the significant progress made in research fields like Taint Analysis.

ReDoS Vulnerability: A Deep Dive and Reflection on CVE-2023-23969

During my studies, I revisited the concept of Regular Expression Denial of Service (ReDoS). This is a type of application-level DoS attack that capitalizes on the complexity of certain regular expressions (regex) to exponentially increase the processing time. This computational overhead can render an application unresponsive, effectively denying service to its legitimate users.

A relatively recent vulnerability sprung to mind: 'Potential Denial-of-Service via Accept-Language Headers' (CVE-2023-23969). This vulnerability arose from the use of incorrect regex when parsing Accept-Language, which could trigger a ReDoS attack.

CVE-2023-23969: Potential denial-of-service via Accept-Language headers
Overview Hi guys, this is Seokchan Yoon. The Django security vulnerability has released on February 1, 2023. Because I have big interest in Django, so I analyze this issue just in time. To see the detail for this issue, click the link below. Django security releases issued: 4.1.6,
accept_language_re = _lazy_re_compile(
    r"""
        # "en", "en-au", "x-y-z", "es-419", "*"
        ([A-Za-z]{1,8}(?:-[A-Za-z0-9]{1,8})*|\*)
        # Optional "q=1.00", "q=0.8"
        (?:\s*;\s*q=(0(?:\.[0-9]{,3})?|1(?:\.0{,3})?))?
        # Multiple accepts per header.
        (?:\s*,\s*|$)
    """,
    re.VERBOSE,
)

# ...

@functools.lru_cache(maxsize=1000)
def parse_accept_lang_header(lang_string):
    
    result = []
    pieces = accept_language_re.split(lang_string.lower())
    if pieces[-1]:
        return ()
    for i in range(0, len(pieces) - 1, 3):
        first, lang, priority = pieces[i : i + 3]
        if first:
            return ()
        if priority:
            priority = float(priority)
        else:
            priority = 1.0
        result.append((lang, priority))
    result.sort(key=lambda k: k[1], reverse=True)
    return tuple(result)
Django source code

The above code is part of Django's source code. To summarize, it incorrectly uses the star notation when parsing the Accept-Language header, which can trigger a ReDoS attack. It can parse strings like the one below, but the longer the string, the more computing resources are consumed.

"xxxxxxxx" + "-xxxxxxxx" * 500
ReDoS payload

We call this 'ReDoS Attack'

Conceiving a Python-based ReDoS Detection Solution

It made me ponder - wouldn't it be possible to preemptively detect ReDoS by extracting arguments used in regex-related functions like re.compile() using the ast module in Python? Spurred by this idea, I decided to create a Proof-of-Concept (PoC) code.

You can check out my idea and the corresponding PoC in the repository linked below.

GitHub - ch4n3-yoon/ReDoS-scanner: really simple scanner for ReDoS vulnerability
really simple scanner for ReDoS vulnerability. Contribute to ch4n3-yoon/ReDoS-scanner development by creating an account on GitHub.

Here's the concept I've envisioned:

  1. Parse the Python code into an AST tree using the ast module.
  2. Extract all regex strings used in functions like re.compile(), match(), search(), and fullmatch().
  3. Fuzz the extracted regex strings using the atheris module. (dynamic method)
  4. Find nested *, + notations. (static method)
  5. Derive the ReDoS vulnerabilities.

I was able to find vulnerabilities in simple codes like the one below using this approach.

import re
import time


for n in range(100):
    start_time = time.time()
    re.match("(bb|b.)*a", "b" * n)
    print(n, time.time() - start_time)
test.py
$ python3 scan.py
INFO: Using built-in libfuzzer
WARNING: Failed to find function "__sanitizer_acquire_crash_state". Reason dlsym(RTLD_DEFAULT, __sanitizer_acquire_crash_state): symbol not found.
WARNING: Failed to find function "__sanitizer_print_stack_trace". Reason dlsym(RTLD_DEFAULT, __sanitizer_print_stack_trace): symbol not found.
WARNING: Failed to find function "__sanitizer_set_death_callback". Reason dlsym(RTLD_DEFAULT, __sanitizer_set_death_callback): symbol not found.
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 2780101204
INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
INFO: A corpus is not provided, starting from an empty corpus
#2	INITED cov: 4 ft: 4 corp: 1/1b exec/s: 0 rss: 38Mb
Suspicious pattern: (bb|b.)*a
#59057	NEW    cov: 5 ft: 5 corp: 2/50b lim: 589 exec/s: 59057 rss: 39Mb L: 49/49 MS: 5 CopyPart-ChangeBit-ChangeByte-ChangeBit-InsertRepeatedBytes-
Suspicious pattern: (bb|b.)*a
executed result

You can see that you have found a vulnerable regex string (bb|b.)*a

However, it's still challenging to identify actual vulnerabilities with such short codes. Particularly, there's an issue with the fuzzer not working properly when the regex is long. In such cases, it might be possible to divide the regex string into multiple parts and fuzz them separately, following a divide-and-conquer approach. This issue is certainly a challenge that needs to be addressed.

How to fix a ReDoS | The GitHub Blog
Code scanning detects ReDoS vulnerabilities automatically, but fixing them isn’t always easy. This blog post describes a 4-step strategy for fixing ReDoS bugs.
Detecting SQL injections in Python code using AST
Detecting SQL Injections in Python code using abstract syntax trees
Fuzzing Django Applications With the Atheris Fuzzing Engine
This week Google released Atheris, libFuzzer adapter for Python. Let’s use it to test a real-world web application.