Benchmark

REX Library

Input text

The text file for this benchmark is a version of concatenated Linux howto document. The file size is 39,422,105 bytes.

Patterns

  1. URI (protocol://server/path): ([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?
  2. Email (name@server): ([^ @]+)@([^ @]+)
  3. Date (month/day/year): ([0-9][0-9]?)/([0-9][0-9]?)/([0-9][0-9]([0-9][0-9])?)
  4. URI|Email: ([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)

Matching

We will compare the performance of REX Grep with the standard grep program distributed with all Linux systems. Note: All results are measured in seconds.

All matching is done using the patterns above with the sample text file.
# bzcat howto.bz2 > howto.txt
# time rexgrep -e "([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?" howto.txt
# time egrep -e "([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?" howto.txt

Results

Library/programURIEmailDateURI|Email
rexgrep2.43.00.24.6
egrep1.91.60.615.7

When using the union operator '|' the performance of egrep is much worse compared to its performance without the operarot '|'. REX grep performs consistently with and without the operator '|', we will also show that even with a union of thousands of regular expressions the performance is close to the performance with a single regular expression without using the union operator.

Large number of regular expressions

For this test we will generate a union of thousands of regular expressions using the standard antivirus test string eicar. The union is about 15,000 variations of the string and looks like this:

aaO\!P\%\@AP\[4\\PZX54\(P\^\)7CC\)7\}\$E[I-J]CAR-[S-V]TANDARD-[A-E]NTIVIRUS-TEST-FILE\!\$H\+H\*|
aaO\!P\%\@AP\[4\\PZX54\(P\^\)7CC\)7\}\$E[I-J]CAR-[S-V]TANDARD-[A-E]NTIVIRVS-TEST-FILE\!\$H\+H\*|
aaO\!P\%\@AP\[4\\PZX54\(P\^\)7CC\)7\}\$E[I-J]CAR-[S-V]TANDARD-[A-E]NTIVIRWS-TEST-FILE\!\$H\+H\*|
...
This is the entire eicar-regex.txt file and the eicar-gen.sh script that generated it.

1. Lets add the eicar string to the end of the test file:

# echo 'X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*' >> howto.txt
2. Next we will compile the large regular expression file eicar-regex.txt to Deterministic Finite Automata (DFA) binary representation, so we can accurately measure only the time for searching.
# rexgrep -f eicar-regex.txt -c -b eicar-regex.bin
3. Use the binary file to search in our test file.
# time rexgrep -b eicar-regex.bin howto.txt
X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*

real	0m0.193s
user	0m0.160s
sys	0m0.030s

This results in about 200 MB/sec. If you experiment with different number of regular expressions you will notice that the performance doesn't change. For example, the speed stays the same even if the number of regular expressions is doubled to 30,000.