jrexx - automaton based regluar expression API for Java
D O C U M E N T A T I O N
REQUIREMENTS
jrexx is implemented in 100% pure Java and requires the
Java 2 Plattform Standard Edition
(J2SE) version 1.3.x (or greater) on the target computer.
API DOCUMENTATION
For a complete API documentation please see the
jrexx API documentation pages (JavaDoc).
REGULAR EXPRESSION SYNTAX
For the complete jrexx regular expressions syntax please see the
jrexx syntax definition.
QUICKSTART
These lines provide a really quick quickstart of the jrexx library usage.
- Installation
Download the jrexx library from our
download section and
append it to the java CLASSPATH.
- Preparation
Open your java editor, create a new java class file and inport
com.karneim.util.collection.regex.Pattern, which is the main
class of the jrexx library.
Pattern provides the usual pattern matching functionality via
its contains method (we chose the method name contains
instead of something like matches,
because of the set characteristic of regular expressions).
- Usage Example: matching an IP address
import com.karneim.util.collection.regex.Pattern;
public class Main {
public static void main(String[] args) {
// very simple pattern of ip addresses
Pattern p = new Pattern( "([0-9]{1,3}\\.){3}[0-9]{1,3}");
String input = "192.168.0.1";
boolean result = p.contains( input);
if (result) System.out.println( input + " is an ip address" );
else System.out.println( input + " is NOT an ip address" );
}
}
|
|
- Usage Example: matching a special floating point number
Besides pattern matching functionality jrexx provides set operations on
regular expressions by offering an extended regular expression syntax.
Problem: write a regular expression for floating point numbers (eg. 24.2) with a maximum length of 5 characters.
normally you have to treat three cases separately:
case 1: [0-9]{1,1}\\.[0-9]{1,3}
case 2: [0-9]{1,2}\\.[0-9]{1,2}
case 3: [0-9]{1,3}\\.[0-9]{1,1}
...and join them with an OR: (case1|case2|case3)
In case of a maximum length of 10 characters or in case of a length n this would be stupid work.
But as we know, the best solution is always the simplest:
with [0-9]+\\.[0-9]+ we describe the set of all floating point numbers
and with .{1,5} we describe the set of all strings that have a maximum length of five characters.
What we want is the intersection of both, so the solution is (& stands for AND):
([0-9]+\\.[0-9]+)&(.{1,5})
Therefore the example code for a floating point number with maximum length of five characters is:
import com.karneim.util.collection.regex.Pattern;
public class Main {
public static void main(String[] args) {
// pattern for a floating point number with maximum five characters
Pattern p = new Pattern( "([0-9]+\\.[0-9]+)&(.{1,5})");
String input = "24.2";
boolean result = p.contains( input);
if (result) System.out.println( input + " is ok" );
else System.out.println( input + " is NOT ok" );
}
}
|
|
- Usage Example: matching a special email address
Now we want to introduce the use of the complement functionality of Pattern.
In most cases complement is used in combination with intersection in the way A&!B which means A\B
(in words "A ANDNOT B" which means "A WITHOUT B").
problem: write a regular expression for email addresses without the address rm242@web.de
the solution is: (EMAIL)&(!(RM))
where EMAIL is [A-Za-z0-9]+(\\.[A-Za-z0-9]+)*@[A-Za-z]+(\\.[A-Za-z]+)* and RM is [rR][mM]242@[wW][eE][bB]\\.[dD][eE]
The example code for an email address without the address rm242@web.de is:
import com.karneim.util.collection.regex.Pattern;
public class Main {
public static void main(String[] args) {
// simple pattern for email addresses without the address rm242@web.de
String email = "[A-Za-z0-9]+(\\.[A-Za-z0-9]+)*@[A-Za-z]+(\\.[A-Za-z]+)*";
String rm242 = "[rR][mM]242@[wW][eE][bB]\\.[dD][eE]";
Pattern p = new Pattern( "("+email+")&(!("+rm242+"))" );
String input = "michael.karneim@karneim.com";
boolean result = p.contains( input);
if (result) System.out.println( input + " is ok" );
else System.out.println( input + " is NOT ok" );
}
}
|
|
Hint: the complement feature is very powerful, but it's use is often not
so easy as it seems to be (by my
experience).
In many cases A&(!B) was not that what I wanted, but A&(!(.*B.*)). Thinking in
sets of strings instead of patterns might be helpful.
As soon as possible, I will provide an element iterator for Pattern as it
is common for java sets.
I think this would be very helpful to check whether a specific regular
expression describes exactly the set you need.
EXAMPLE PATTERNS
The following list contains some ready-to-use patterns for common
tasks. All patterns are ready to be used inside java source code, especially
the escape sequences for special characters (like "." or "\") are
already applied.
- IP Address (exact)
This pattern describes an exact ip adddress without leading zeros.
Only numbers from 0 to 255 are valid.
new Pattern("([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){1,3}");
|
|
- Email Address (exact)
This pattern describes a syntactically valid email address.
new Pattern("([^()\\-<> @,;:\"[\\]][^()<> @,;:\"[\\]]*|\"[^()<>@,;:\"[\\]]+\")@(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){1,3}|([A-Za-z](-*[A-Za-z0-9])*(\\.[A-Za-z](-*[A-Za-z0-9])*)*))");
|
|
HOW TO SERIALIZE AUTOMATONS
jrexx is able to save/serialize and load/deserialze automatons using the
serializable FSAData class (a kind of value object), which might be "XML-able"
in future. The FSAData class should
be the fundament for the interchange of automatons between different apps/tools.
For example, jrexx-Lab's 'load' and 'safe' functionality works with FSAData. So you
can visualize and test automatons with jrexx-Lab, even if your automaton has not
been built with the jrexx API.
There is more than one way to save/serialize an automaton
- Serializing using jrexx-Lab
create your regular expression with jrexx Lab and save it to disk.
- Serializing using PAutomaton
PatternPro pattern = new PatternPro("[0-9]+");
FileOutputStream out = new FileOutputStream(filename);
pattern.getAutomaton().toData(out);
out.close();
|
|
- Serializing using ObjectOutputStream
PatternPro pattern = new PatternPro("[0-9]+");
FSAData data = pattern.getAutomaton().toData();
FileOutputStream out = new FileOutputStream(filename);
new ObjectOutputStream(out).write(data);
out.close();
|
|
Loading/Deserializing an automaton
- Deserializing using jrexx-Lab
load and serialized automaton with jrexx-Lab
- Deserializing using PAutomaton
FileInputStream in = new FileInputStream(filename);
PatternPro pattern = new PatternPro(new PAutomaton(in));
in.close();
|
|
- Deserializing using ObjectInputStream
FileInputStream in = new FileInputStream(filename);
FSAData data = (FSAData)ObjectInputStream(in).readObject();
in.close();
PatternPro pattern = new PatternPro(new PAutomaton(data));
|
|
All results of these examples for serialisation are compatibe to each other,
but differ because of optional information.
Example c): Only FSAData is serialized.
Example b): PAutomaton serializes the FSAData and the regular expression
string.
Example a): PAutomaton serializes the FSAData and the regular expression
string. jrexx-Lab appends the graphical positions of the states to the
stream.
Since the regular expression string and the positions are optional, you can
serialize an automaton with jrexx-Lab and deserialize it using
PAutomaton or ObjectInputStream.
USING JREXXLITE
jrexxLite is a subset of the jrexx API and a very small library (currently
23KB) for pattern matching. jrexxLite contains the DFASet class, that
provides the same matching functionality as the Pattern class does. Both
DFASet and Pattern use a deterministic finite state automaton (DFA), but
Pattern creates a DFA from a given regular expression whereas DFASet needs
an already created DFA in the form of FSAData (see new feature).
-
Question:
When can I use DFASet instead of Pattern?
Answer:
If you use a fixed regular expression in your code and only need the
matching functionality. Fixed means that it is does not change during runtime.
Explanation:
For example you use an regular expression for email to check whether a
given string is a valid email address. This expression is fixed because you
make the decision about the expression at compile time.
-
Question: When does it make sense to use DFASet instead of Pattern?
Answer:
- You want to use jrexxLite because of it's small size
- You use a huge regular expression
- Your code runs within an environment with few memory such as PDA
Explanation:
- You have your own reasons
- Using Pattern means that the regular expression has to be
compiled/transformed into a DFA
before you can use the matching functionality. Of course this is done
only once, but it can take a while. Alternatively you can create a PatternPro
and save it's automaton to a stream (e.g. to disk). Then you work with
DFASet using the
serialized DFA.
- Using Pattern means, that the regular expression has to be
compiled/transformed into a DFA
before you can use the matching functionality. While converting a
regular expression into a DFA, your system could run out of memory
(you will get an OutOfMemoryError).
But this does not mean that the resulting DFA is to big for your
runtime environment, it just means, that your runtime environment
has not enough memory for the conversion.
What you can do is, create a PatternPro with your regular expression
on a system with enough memory, store it's automaton to disk and use
DFASet with the serialized DFA in your
runtime environment.
-
Examples for using DFASet with an serialized automaton
FileInputStream in = new FileInputStream(filename);
DFASet dfa = new DFASet(in);
in.close();
dfa.contains(inputString);
|
|
FileInputStream in = new FileInputStream(filename);
FSAData data = (FSAData)new ObjectInputStream(in).readObject();
in.close();
DFASet dfa = new DFASet(data);
dfa.contains(inputString);
|
|