[Challenges] LeetCode — Solve Regular Expression Matching using FA

Zong Yuan
11 min readAug 31, 2021

Even though it will be easier if I use if else & for loop to solve the problem of regular expression matching, but I want to take a challenge to use the FA that learn in the school to tackle it.

The problem description can be found at Leetcode.

Finite Automaton (FA)

First, let’s revise what is Finite Automaton. Finite Automaton is a state machine that decide the next state according to the current state and next input. If the last state of the input or the input didn’t have the next state, then we can say that the input is not match the finite automation.

There are two type of finite automation, deterministic finite automaton and nondeterministic finite automaton.

Deterministic Finite Automaton (DFA)

Deterministic finite automaton has two restrictions

  1. Every state transition required an input
  2. Every input can only trigger an unique state transition on current state

The yellow circle (s0) is the initial state and green circle are the final state. Every round will read an input and make a state transition accordingly. For example, if current state is s0 and input is 0, the next state will be s1. If the current state is s0 and input is 1, then next state will back to s0 too. At the end of input, if current state is one of the final state, then the input is matching the DFA. If the last state is not final state, or current state not able to move to next state according to the input (such as get an input 1 at state s4), then the input is not matching the DFA.

Nondeterministic Finite Automaton (NFA)

Nondeterministic finite automaton is similar to DFA, just without the 2 restrictions. In other words, NFA can have multiple possible state transition according to 1 input, such as s1 can move to s2 or back to s1 if read an input 0. Other than this, NFA also allow empty input (ε), like s0 can have ε as input.

Solution Idea

My idea on solution is split the problem to two parts

  1. Build DFA/NFA from pattern
  2. Validate input with the DFA/NFA

I had failed to continue with DFA as solution, but I wrote down the process to validate my idea, you can skip to NFA part if you are not interested with it.

DFA

First, I wanted to try with DFA since I believe DFA will achieve better performance than NFA.

Let’s list down several possible cases on pattern with DFA.

  1. abc
A very basic DFA

Just a very basic DFA then only 1 path to next state.

2. a*bc

Added a loop path a for the a*

Since a can be zero or more, so we can use a loop back path to s0 to indicate the *.

3. a*ab

If we design as last pattern, it become NFA

Even the pattern similar to last pattern, but as you can see from the graph, the s0 has two path with same input, which mean it is NFA. Therefore, we have to enhance it to become DFA by modifying the state s0 and s1.

Move the loop path on s0 to s1, the result remains but the graph has become DFA now

By moving the a* to behind a (aa*b), we will get the same results but the graph has became a DFA now.

So now we have Rule #1

If * followed by same characters, move the * until the last of sequence same character
Example: a*ab = aa*b, a*aab = aaa*b

4. a*b*c*d

Every state with * will have a path to following states until first state without *

As showed in the graph, s0 has to put a path to s1, s2 and s3 directly to fulfill the pattern.

So here we got Rule #2

Every * state will have a path to each following state until the first non * state.

6. a*a*b

If design DFA according to a*a*b, the graph will become NFA (s0 has two paths with input a)

For two or more consecutive * state with same pattern (eg. a*a*a*, b*b*,…), the graph will become NFA. However, since a*a* is equals to a*, we can minimized the pattern first before construct the DFA.

So as mentioned above, a*a*b equals to a*b .

The graph now is DFA

Here we got Rule #3

If there are consecutive same * pattern, we should always combine the same * pattern to 1 * pattern first before construct DFA.

7. .*c

There will be a come back path to .* state

For .* (any character), since it accept all characters, it’s next pattern become important on DFA design. As you can see from the graph above, the next pattern character will exclude from .* state loop path, and next state will have a path with all character except the next pattern character.

8. .*b*c

Things become really complicated here, I have to combine Rule #2 with .* which make the whole graph’s path very complicated.

s1 return to s0 if input is not b or c

Let’s consider more complicated pattern .*b*ca , the graph should be as below.

The complexity of the DFA grow as s1 have to return to s0 with input not a,b,c; s2 have to return to s0 with input not a,c; s3 have to return to s0 with input not a;

Now imagine .*b*cabcde or .*b*a*d*abced, the DFA complexity just grow with scary speed.

When designing for NFA, I realize that .*a*bcde is same as .*bcde, however, it still don’t help on the complexity issue.

Until here, I am stuck with solve the problem by DFA. After did some researches, I found there are some articles suggest NFA as solution (eg, https://swtch.com/~rsc/regexp/regexp1.html). So, let’s try with NFA.

NFA

Since NFA don’t have the restrictions as DFA, the NFA construction according to pattern will be easier compared with DFA.

List down the NFA on each pattern.

  1. abc
Same with DFA

2. a*b

When reach state s1 and get next input (a or b), it will moving to s0 with ε path since can’t find any path except the ε path

Since NFA allowed empty string (ε) as input, for a* pattern, we will create a new state s1 and path a from s0 to s1, and then use an empty string path from s1 to s0.

3. a*ab

9Diamond represent split state, which will be used as checkpoints later.

Diamond shape element is introduced here as a split state, a split state is a state that has ambiguous path (for example, 2 a paths from s0). A split state can be viewed as a checkpoint at validate steps later, so when the input validation failed, it can go back to last checkpoint and try with another path.

4. a*a*b

As you can see above s0 + s1 and s2 + s3 are very similar, according to DFA section, we knew that the a*a*b can be reduce as a*b . Even though the graph above is correct, but it will cause the validation steps do extra loop at s2 and s3 there, so we will do the pattern reduction to enhance the performance.

5. da*b*c

As you can see from above, if the * pattern is not the first pattern, there will be an extra state (s2 or s4) that will act as split state to the looping state and next state. The path enter from last state to the extra splitting state will always be empty string.

6. .*c

Since NFA allows ambiguous paths, we can treat the . pattern as other string pattern.

7. .*a*b

.*b is same as .*a*b , so for performance wise, we will reduce the pattern to .*b before proceed NFA construction.

8. .*bcd

As you can see, the graph is simpler compared with DFA version

The NFA complexity of pattern . is much more simpler compared with DFA version, which mean I will start the problem solving with the NFA idea.

TDD Environment — Traversal

Let’s proceed the development with TDD practice. I will start with traversal part first so after traversal is ready, it can be used to validate the result from the state building part.

First, prepare the State class. The class included a boolean to check if the state is final state, and also a Map to indicate path to next state.

private final static Character DOT = '.';
private final static Character EMPTY = '\0';
static class State {
boolean isFinal;
Map<Character, List<State>> paths;
State() {
isFinal = false;
paths = new HashMap<>();
}
void addPath(Character input, State state) {
List<State> nextStates = paths.get(input);
if (nextStates == null) {
nextStates = new LinkedList<State>();
paths.put(input, nextStates);
}
nextStates.add(state);
}
}

Next, prepare the interface.

public static boolean isMatch(State startState, String input) {
return true;
}

Final, prepare the test cases.

public static void assertState(String name, State startState, String input, boolean expectedResult) {
boolean result = isMatch(startState, input);
if (result == expectedResult) {
System.out.println("[" + name + "] [" + input + "] Assertion Success");
} else {
System.out.println("[" + name + "] [" + input + "] Assertion Failed, Expected " + expectedResult + " but Actual " + result);
}
}
public static void main(String[] args) throws Exception {
/*
* Test Case 0 Pattern: abc
*/
// State Construct
State test0Start = new State();
State test0State1 = new State();
test0Start.addPath('a', test0State1);
State test0State2 = new State();
test0State1.addPath('b', test0State2);
State test0State3 = new State();
test0State2.addPath('c', test0State3);
test0State3.isFinal = true;
// State Traversal
// Input: abc, Expected: true
assertState("TestCase0", test0Start, "abc", true);
// Input: aabc, Expected: false
assertState("TestCase0", test0Start, "aabc", false);
// Input: bca, Expected: false
assertState("TestCase0", test0Start, "bca", false);
// Input: ab, Expected: false
assertState("TestCase0", test0Start, "ab", false);
// Input: abcd, Expected: false
assertState("TestCase0", test0Start, "abcd", false);
/*
* Test Case 1 Pattern: a*b
*/
...
/*
* Test Case 2 Pattern: a*ab
*/
...
/*
* Test Case 3 Pattern: .*c
*/
...
/*
* Test Case 4 Pattern: .*bcd
*/
...
/*
* Test Case 5 Pattern: abc*.*dd
*/
...
/*
* Test Case 6 Pattern: b*abd.d.*e
*/
...
/*
* Test Case 7 Pattern: bcda*
*/
...
}

Implementation — Traversal

After prepare the test cases, time to implement the traversal functions. The idea is using recursion to select a matched path moving to the next state, so if selected path is not working, the nature of recursion will bring the logic back to the last working state and continue to find next matched path.

public static boolean isMatch(State startState, String input) {
return traversal(startState, input, 0);
}
public static boolean traversal(State currentState, String input, int index) {
if (input.length() <= index) {
if (currentState.isFinal) {
return true;
} else if (currentState.paths.containsKey(EMPTY)) {
// Try to walk empty string path if exists
for (State nextState : currentState.paths.get(EMPTY)) {
if (traversal(nextState, input, index)) {
return true;
}
}
}
// Reach here mean failed to find final state
return false;
}
/**
* Priority:
* 1. Try find exact same pattern path
* 2. Try find EMPTY pattern path
* 3. Try find DOT pattern path
*/
Character nextInputChar = input.charAt(index);
if (currentState.paths.containsKey(nextInputChar)) {
for (State nextState : currentState.paths.get(nextInputChar)) {
if (traversal(nextState, input, index + 1)) {
return true;
}
}
}
if (currentState.paths.containsKey(EMPTY)) {
for (State nextState : currentState.paths.get(EMPTY)) {
// Since is empty path, the input isn't consume, so index not increment
if (traversal(nextState, input, index)) {
return true;
}
}
}
if (currentState.paths.containsKey(DOT)) {
for (State nextState : currentState.paths.get(DOT)) {
if (traversal(nextState, input, index + 1)) {
return true;
}
}
}
// Reach here mean failed to find any path, return false
return false;
}

TDD Environment — State Building

Since we know traversal part is workable, now we can copy the test cases from traversal part, and modify them a bit to let the traversal actually testing on state building part results.

First, prepare the build function.

public static State build(String pattern) {
State startState = new State();
// Waiting for implementation
return startState;
}

Then prepare the test cases.

public static void main(String[] args) throws Exception {
/*
* Test Case 0 Pattern: abc
*/
// State Construct
State test0Start = build("abc");
// State Traversal
// Input: abc, Expected: true
assertState("TestCase0", test0Start, "abc", true);
// Input: aabc, Expected: false
assertState("TestCase0", test0Start, "aabc", false);
// Input: bca, Expected: false
assertState("TestCase0", test0Start, "bca", false);
// Input: ab, Expected: false
assertState("TestCase0", test0Start, "ab", false);
// Input: abcd, Expected: false
assertState("TestCase0", test0Start, "abcd", false);
/*
* Test Case 1 Pattern: a*b
*/
...
/*
* Test Case 2 Pattern: a*ab
*/
...
/*
* Test Case 3 Pattern: .*c
*/
...
/*
* Test Case 4 Pattern: .*bcd
*/
...
/*
* Test Case 5 Pattern: abc*.*dd
*/
...
/*
* Test Case 6 Pattern: b*abd.d.*e
*/
...
/*
* Test Case 7 Pattern: bcda*
*/
...
}

Implementation — State Building

As first, we need to process the pattern before build the state.

public static String preBuild(String pattern) {
String result = "";
int index = 0;
String current = null, last = null;
while (index < pattern.length()) {
if (pattern.length() > index + 1
&& pattern.charAt(index + 1) == STAR.charValue()) {
current = pattern.substring(index, index + 2);
index++;
/**
* If last == current (a*a*, .*.*), skip current
* If last = .*, skip current
* If next = .*, skip current
*/
if (current.equals(last) || DOT_STAR.equals(last)) {
index++;
continue;
} else if (pattern.length() >= index + 3) {
String next = pattern.substring(index + 1, index + );
if (DOT_STAR.equals(next)) {
index++;
continue;
}
}
} else {
current = pattern.substring(index, index + 1);
}
result = result + current;
last = current;
index++;
}
return result;
}

After processed the pattern, build the state from the pattern. (The building logic can refer to NFA and also the traversal test cases.)

public static State build(String pattern) {
pattern = preBuild(pattern);
// Initial state
State startState = new State();
State currentState = startState;
int index = 0;
while (index < pattern.length()) {
Character c = pattern.charAt(index);
if (pattern.length() > index + 1
&& pattern.charAt(index + 1) == STAR.charValue()) {
index++;
// Create an empty state as loop states entry
State newEmptyState = new State();
State newStarState = new State();
currentState.addPath(EMPTY, newEmptyState);
newEmptyState.addPath(c, newStarState);
newStarState.addPath(EMPTY, newEmptyState);
currentState = newEmptyState;
} else {
State newState = new State();
currentState.addPath(c, newState);
currentState = newState;
}
index++;
}
// The last state will be final state
currentState.isFinal = true;
return startState;
}

Final Result

I will skip the process that copy the code to Leetcode, the final solution on Leetcode is look like

public boolean isMatch(String s, String p) {
State start = build(p);
return isMatch(start, s);
}

As expected from test cases, the result is success. However, the runtime and memory usage is quite bad compared with other submissions.

This is expected since I used a lot runtime and memory on building states, but my initial objective is solving this problem with Finite Automation not challenging best runtime and memory usage. Therefore I am quite satisfied with my result.

That’s all for this episode, next time will try with another problem. Thanks for reading.

--

--