Scanning ReDoS in Python Automatically
Hi, this is Seokchan Yoon.
I'm currently taking a compiler course at KyungHee University this semester. Despite the short duration, I've had the pleasure of delving into intriguing topics, ranging from Lexing and Parsing algorithms to Abstract Syntax Trees (AST), Intermediate Representations (IR), and basic compiler optimization techniques. It has been an enlightening experience.
While learning about AST, I was particularly intrigued by its potential application in static vulnerability analysis. My curiosity led me to discover brilliant tools such as CodeQL, which is based on AST, and also to learn about the significant progress made in research fields like Taint Analysis.
ReDoS Vulnerability: A Deep Dive and Reflection on CVE-2023-23969
During my studies, I revisited the concept of Regular Expression Denial of Service (ReDoS). This is a type of application-level DoS attack that capitalizes on the complexity of certain regular expressions (regex) to exponentially increase the processing time. This computational overhead can render an application unresponsive, effectively denying service to its legitimate users.
A relatively recent vulnerability sprung to mind: 'Potential Denial-of-Service via Accept-Language Headers' (CVE-2023-23969). This vulnerability arose from the use of incorrect regex when parsing Accept-Language
, which could trigger a ReDoS attack.
The above code is part of Django's source code. To summarize, it incorrectly uses the star notation when parsing the Accept-Language
header, which can trigger a ReDoS attack. It can parse strings like the one below, but the longer the string, the more computing resources are consumed.
We call this 'ReDoS Attack'
Conceiving a Python-based ReDoS Detection Solution
It made me ponder - wouldn't it be possible to preemptively detect ReDoS by extracting arguments used in regex-related functions like re.compile()
using the ast
module in Python? Spurred by this idea, I decided to create a Proof-of-Concept (PoC) code.
You can check out my idea and the corresponding PoC in the repository linked below.
Here's the concept I've envisioned:
- Parse the Python code into an AST tree using the
ast
module. - Extract all regex strings used in functions like
re.compile()
,match()
,search()
, andfullmatch()
. - Fuzz the extracted regex strings using the
atheris
module. (dynamic method) - Find nested
*
,+
notations. (static method) - Derive the ReDoS vulnerabilities.
I was able to find vulnerabilities in simple codes like the one below using this approach.
You can see that you have found a vulnerable regex string (bb|b.)*a
However, it's still challenging to identify actual vulnerabilities with such short codes. Particularly, there's an issue with the fuzzer not working properly when the regex is long. In such cases, it might be possible to divide the regex string into multiple parts and fuzz them separately, following a divide-and-conquer approach. This issue is certainly a challenge that needs to be addressed.