tango.text.Regex

License:

BSD style: see license.txt

Version:

Initial release: Jan 2008

Authors:

Jascha Wetzel

This is a regular expression compiler and interpreter based on the Tagged NFA/DFA method.

The Regex class is not thread safe

See Wikpedia's article on regular expressions for details on regular expressions in general.

The used method implies, that the expressions are regular, in the way language theory defines it, as opposed to what "regular expression" means in most implementations (e.g. PCRE or those from the standard libraries of Perl, Java or Python). The advantage of this method is it's performance, it's disadvantage is the inability to realize some features that Perl-like regular expressions have (e.g. back-references). See "Regular Expression Matching Can Be Simple And Fast" for details on the differences.

The time for matching a regular expression against an input string of length N is in O(M*N), where M depends on the number of matching brackets and the complexity of the expression. That is, M is constant wrt. the input and therefore matching is a linear-time process.

The syntax of a regular expressions is as follows. X and Y stand for an arbitrary regular expression.

Operators
X|Y alternation, i.e. X or Y
(X) matching brackets - creates a sub-match
(?X) non-matching brackets - only groups X, no sub-match is created
[Z] character class specification, Z is a string of characters or character ranges, e.g. [a-zA-Z0-9_.\-]
[^Z] negated character class specification
<X lookbehind, X may be a single character or a character class
>X lookahead, X may be a single character or a character class
^ start of input or start of line
$ end of input or end of line
\b start or end of word, equals (?<\s>\S|<\S>\s)
\B opposite of \b, equals (?<\S>\S|<\s>\s)

Quantifiers
X? zero or one
X* zero or more
X+ one or more
X{n,m} at least n, at most m instances of X.
If n is missing, it's set to 0.
If m is missing, it is set to infinity.
X?? non-greedy version of the above operators
X*? see above
X+? see above
X{n,m}? see above

Pre-defined character classes
. any printable character
\s whitespace
\S non-whitespace
\w alpha-numeric characters or underscore
\W opposite of \w
\d digits
\D non-digit

Note that "alphanumeric" only applies to Latin-1.
class RegExpT(char_t)
Regular expression compiler and interpreter.
this(const(char_t)[] pattern, const(char_t)[] attributes = null)
this(const(char_t)[] pattern, bool swapMBS, bool unanchored, bool printNFA = false)
Construct a RegExpT object.

Parameters:

patternRegular expression.

Throws:

RegExpException if there are any compilation errors.

Example:

Declare two variables and assign to them a Regex object:
1
2
auto r = new Regex("pattern");
auto s = new Regex(r"p[1-5]\s*");
RegExpT!(char_t) opCall(const(char_t)[] pattern, const(char_t)[] attributes = null) [static]
Generate instance of Regex.

Parameters:

patternRegular expression.

Throws:

RegExpException if there are any compilation errors.

Example:

Declare two variables and assign to them a Regex object:
1
2
auto r = Regex("pattern");
auto s = Regex(r"p[1-5]\s*");
RegExpT!(char_t) search(const(char_t)[] input) [public]
int opApply(int delegate(ref RegExpT!(char_t)) dg) [public]
Set up for start of foreach loop.

Returns:

Instance of RegExpT set up to search input.

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
import tango.io.Stdout;
import tango.text.Regex;

void main()
{
    foreach(m; Regex("ab").search("qwerabcabcababqwer"))
        Stdout.formatln("{}[{}]{}", m.pre, m.match(0), m.post);
}
// Prints:
// qwer[ab]cabcababqwer
// qwerabc[ab]cababqwer
// qwerabcabc[ab]abqwer
// qwerabcabcab[ab]qwer
bool test(const(char_t)[] input)
Search input for match.

Returns:

false for no match, true for match
bool test()
Pick up where last test(input) or test() left off, and search again.

Returns:

false for no match, true for match
const(char_t)[] match(uint index)
const(char_t)[] opIndex(uint index)
Return submatch with the given index.

Parameters:

index0 returns whole match, index > 0 returns submatch of bracket #index

Returns:

Slice of input for the requested submatch, or null if no such submatch exists.
const(char_t)[] pre()
Return the slice of the input that precedes the matched substring. If no match was found, null is returned.
const(char_t)[] post()
Return the slice of the input that follows the matched substring. If no match was found, the whole slice of the input that was processed in the last test.
inout(char_t)[][] split(inout(char_t)[] input)
Splits the input at the matches of this regular expression into an array of slices.

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
import tango.io.Stdout;
import tango.text.Regex;

void main()
{
    auto strs = Regex("ab").split("abcabcababqwer");
    foreach( s; strs )
        Stdout.formatln("{}", s);
}
// Prints:
// c
// c
// qwer
char_t[] replaceAll(const(char_t)[] input, const(char_t)[] replacement, char_t[] output_buffer = null)
Returns a copy of the input with all matches replaced by replacement.
char_t[] replaceLast(const(char_t)[] input, const(char_t)[] replacement, char_t[] output_buffer = null)
Returns a copy of the input with the last match replaced by replacement.
char_t[] replaceFirst(const(char_t)[] input, const(char_t)[] replacement, char_t[] output_buffer = null)
Returns a copy of the input with the first match replaced by replacement.
char_t[] replaceAll(const(char_t)[] input, char_t[] delegate(RegExpT!(char_t)) dg, char_t[] output_buffer = null)
Calls dg for each match and replaces it with dg's return value.
const(char)[] compileToD(const(char)[] func_name = "match", bool lexer = false)
Compiles the regular expression to D code.
NOTE : Remember to import this module (tango.text.Regex) in the module where you put the generated D code.
const(char_t)[] pattern() [public]
Get the pattern with which this regex was constructed.
uint tagCount()
Get the tag count of this regex, representing the number of sub-matches.
This value is the max valid value for match/opIndex.