Compiler From Scratch: Phase 1 - Tokenizer Generator 004: Regex Pattern to NFA

5 months ago
3

Streamed on 2024-08-09 (https://www.twitch.tv/thediscouragerofhesitancy)

Zero Dependencies Programming!

With some stolen code out of the way to store all of our state I can start the Pattern to NFA processing. This process involves stepping through a regex pattern one character at a time. Depending on the character we either add a "fragment" to our list of fragments, or we update some state, like "Are we inside square brackets?" or "Are we escaping the next character?".

Then we run through the list multiple times, once for each precedence of operator. If an operator at the current precedence level is found, we reduce the list by moving any neighboring elements to the proper relationship under the current operator. If everything parses correctly, we end up with one fragment at the end which contains the root state of our NFA. A decent amount of code was stolen from my old sandbox as there is a lot of detailed bookkeeping involved in these processes.

I then added the ability to log the NFA as a formatted table, as well as a tree. I think the table is easier to read and trace.

Then I started adding new features to my Regex to NFA converter, starting with non-greedy modifiers. The cardinality operators (+, *, and ?) can be made non-greedy by putting "?" after them. Strangely, this is a very simple change that just involves swapping the left and right out pointers for the given state. However, during this change I found a bug in my "+" implementation which I had to fix, and due to the time elapsed since I had last played in the sandbox, it took me longer than it should have. But in the end, I got it all working.

Loading comments...