Module Java.util.Regex

Support for working with regular expressions.

This exposes java.util.regex.Pattern as Regex and java.util.regex.MatchResult as MatchResult.

This module must have been imported during compilation or certain language features can not be used (see below).

Frege language support

Regex literals

Regular expression literals have type Regex and are written:

 ´\b(foo|bar)\b´       -- string enclosed in grave accents
 '\w+'                 -- string with length > 1 enclosed in apostrophes

The notation with the apostrophes has been introduced because many have a hard time entering a grave accent mark on their terminal. However, it is not possible to write a regular expressions with length 1 this way, because then the literal gets interpreted as Char literal. (One can write something like '(?:X)' for a Regex that matches a single 'X').

The advantage of using literals rather than compiling string constants to patterns at runtime is

1) they are checked by the Frege compiler, and illegal ones cause compilation to fail

2) at runtime, literals are available as immutable values of type java.util.regex.Pattern, not as strings, as can be seen so often in human written Java code. Thus the cost of repeated regex compilation is avoided.

3) The regular expression can be written directly, like in Perl or other languages, without taking String escape rules into account. That is, to match a $ sign, one writes

 ´\$´

and not

 "\\$"

like in Java.

The Regex values created from the literals have the following flags enabled: Regex.canon_eq, Regex.unicode_case and Regex.unicode_character_class. That is, they do their best - as far as Java supports it - to get processing of Unicode strings right.

Regex literals in pattern matching

A Regex literal is a valid pattern, like any literal:

 case expr1 of
   ´\w+´  →  expr2

This is equivalent to:

 case expr1 of
   s | s ~ ´\w+´ →  expr2

where s is some fresh variable that doesn't appear in expr2.

The match succeeds if the String value resulting from evaluation of expr1 matches the regular expression given. The match diverges when expr1 diverges. In all other cases, the match fails.

Note that in this simple form, the value of expr1 is not available in expr2. One can use an "at" pattern for this:

 case expr1 of
    string@'\w+'  →  expr2

which gets desugared to

 case expr1 of
   string | string ~ '\w+'  →  expr2

leaving string bound to the value of expr1.

The following syntax not only checks if the regular expression matches but also binds the MatchResult:

 case expr1 of
   m~'\w+'  →  expr2

It is equivalent to

 case expr1 of
   s | Just m ← s =~ '\w+'  →  expr2

where s is again a fresh variable not known in expr2, but when the match succeeds m is bound to the MatchResult and can be inspected further in expr2.

And finally, a pattern like

 s@m~'\w+'

works like the previous one but in addition binds s to the string value in question.

Imports

Table of Content

Definitions

data Regex = pure native java.util.regex.Pattern

Regex values are based on Java's java.util.regex.Pattern objects, which are compiled representations of regular expressions.

We call them Regex and not Pattern in Frege, because the word "pattern" and the phrase "pattern matching" have a quite different meaning in deconstruction of algebraic data types.

There are 2 ways to obtain a compiled regular expression:

  1. Use a Regex literal, see the language reference section 2.7.3. All regular expression literals are values of type Regex. It is not possible to run a program which contains a syntactically invalid Regex literal - the compiler will flag it already at compile time. Hence, with Regex literals one is immune against PatternSyntaxExceptions.
  2. Compile a String with Regex.compile or regforce at run-time. Illegal regular expression will cause PatternSyntaxException to be thrown, which, when not handled appropriately, may crash the program.

Regex values are immutable.

Member Functions

canon_eqInt
pure native java.util.regex.Pattern.CANON_EQ

Enables canonical equivalence.

When this flag is specified then two characters will be considered to match if, and only if, their full canonical decompositions match. The expression ´a\u030A´, for example, will match the string "\u00E5" when this flag is specified.

There is no embedded flag character for enabling canonical equivalence.

Specifying this flag may impose a performance penalty.

Frege Regex literals will always be parsed with this flag.

case_insensitiveInt
pure native java.util.regex.Pattern.CASE_INSENSITIVE

Enables case-insensitive matching.

By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the Regex.unicode_case flag in conjunction with this flag.

Case-insensitive matching can also be enabled via the embedded flag expression (?i).

Specifying this flag may impose a slight performance penalty.

commentsInt
pure native java.util.regex.Pattern.COMMENTS

Permits whitespace and comments in pattern.

In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line.

Comments mode can also be enabled via the embedded flag expression (?x).

compileString → (PatternSyntaxException | Regex)
         | StringInt → (PatternSyntaxException | Regex)
pure native java.util.regex.Pattern.compile

compile a String to a Regex

Returns either a PatternSyntaxException in Either.Left or a Regex in Either.Right.

The overloaded form has an additional Int argument which is a set of flags. Such a set can be constructed by adding up the individual flag values, for example:

  Regex.compile "foo|bar" (Regex.canon_eq + Regex.case_insensitive)
compileαString → (PatternSyntaxException | Regex)
pure native java.util.regex.Pattern.compile  overloads compile

compile a String to a Regex

Returns either a PatternSyntaxException in Either.Left or a Regex in Either.Right.

The overloaded form has an additional Int argument which is a set of flags. Such a set can be constructed by adding up the individual flag values, for example:

  Regex.compile "foo|bar" (Regex.canon_eq + Regex.case_insensitive)
compileβStringInt → (PatternSyntaxException | Regex)
pure native java.util.regex.Pattern.compile  overloads compile

compile a String to a Regex

Returns either a PatternSyntaxException in Either.Left or a Regex in Either.Right.

The overloaded form has an additional Int argument which is a set of flags. Such a set can be constructed by adding up the individual flag values, for example:

  Regex.compile "foo|bar" (Regex.canon_eq + Regex.case_insensitive)
dotallInt
pure native java.util.regex.Pattern.DOTALL

Enables dotall mode.

In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)

flagsRegexInt
pure native flags

return the flags this pattern was compiled with

literalInt
pure native java.util.regex.Pattern.LITERAL

Enables literal parsing of the pattern.

When this flag is specified then the input string that specifies the pattern is treated as a sequence of literal characters. Metacharacters or escape sequences in the input sequence will be given no special meaning.

The flags Regex.case_insensitive and Regex.unicode_case retain their impact on matching when used in conjunction with this flag. The other flags become superfluous.

There is no embedded flag character for enabling literal parsing.

multilineInt
pure native java.util.regex.Pattern.MULTILINE

Enables multiline mode.

In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence.

Multiline mode can also be enabled via the embedded flag expression (?m).

patternRegexString
pure native pattern

return the source of this Regex

splitRegexStringJArray String
pure native split

java.util.regex.Pattern.split method

As this returns a String array, one usually uses Regex.splitted

splittedRegexString → [String]

Split a string around matches of a regular expression and return the result as list of strings.

Empty strings on the start of the string are preserved, but not those on the end of the string, e.g.:

  ´,´.splitted ",,a,b,c,," == ["", "", "a", "b", "c"]
toStringRegexString
pure native toString

Returns the string representation of this pattern.

This is the regular expression from which this pattern was Regex.compiled.

unicode_caseInt
pure native java.util.regex.Pattern.UNICODE_CASE

Enables Unicode-aware case folding.

When this flag is specified then case-insensitive matching, when enabled by the Regex.case_insensitive flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.

Unicode-aware case folding can also be enabled via the embedded flag expression (?u).

Specifying this flag may impose a performance penalty.

Because Frege Regex literals will always be parsed with the Regex.unicode_character_class flag, this flag is also implicitly enabled by default.

It can be disabled with the embedded flag expression (?-u).

unicode_character_classInt
pure native java.util.regex.Pattern.UNICODE_CHARACTER_CLASS

Enables the Unicode version of Predefined character classes and POSIX character classes.

When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.

The Regex.unicode_character_class mode can also be enabled via the embedded flag expression (?U).

The flag implies Regex.unicode_case, that is, it enables Unicode-aware case folding.

Specifying this flag may impose a performance penalty.

Frege Regex literals will always be parsed with this flag. It can be disabled with the embedded flag expression (?-U).

unix_linesInt
pure native java.util.regex.Pattern.UNIX_LINES

Enables UNIX lines mode.

In this mode, only the \n line terminator is recognized in the behavior of ., ^, and $.

Unix lines mode can also be enabled via the embedded flag expression (?d).

data PatternSyntaxException = native java.util.regex.PatternSyntaxException

Will be thrown when one compiles a String to a Regex using Regex.compile or regforce

regcompString → (PatternSyntaxException | Regex)
         | StringInt → (PatternSyntaxException | Regex)
pure native java.util.regex.Pattern.compile

Alias for Regex.compile

regforceStringRegex
          | StringIntRegex
pure native java.util.regex.Pattern.compile

compile a String to a Regex

This will throw PatternSyntaxException when the regular expression is illegal.

Use this only if you're sure that pattern compilation will not throw an exception, or if you don't care.

regforceαStringRegex
pure native java.util.regex.Pattern.compile  overloads regforce

compile a String to a Regex

This will throw PatternSyntaxException when the regular expression is illegal.

Use this only if you're sure that pattern compilation will not throw an exception, or if you don't care.

regforceβStringIntRegex
pure native java.util.regex.Pattern.compile  overloads regforce

compile a String to a Regex

This will throw PatternSyntaxException when the regular expression is illegal.

Use this only if you're sure that pattern compilation will not throw an exception, or if you don't care.

class Replace s

this is only used to inject the replace functions into String

Known Instances

StringJ

Member Functions

replaceAllReplace s ⇒ s → Regex → s → s

Like Replace.replaceFirst, but replaces all substrings that match the pattern.

replaceFirstReplace s ⇒ s → Regex → s → s

Replaces the first subsequence of the input sequence that matches the pattern with the given replacement string.

This method scans the input sequence from the start looking for a match of the pattern. Characters that are not part of the match are appended directly to the result string; the match is replaced in the result by the replacement string. The replacement string may contain references to captured subsequences.

Note that backslashes (\\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.

For example

 replaceFirst "zzzdogzzzdogzzz" ´dog´ "cat" == "zzzcatzzzdogzzz"
substituteFirstStringRegexStringString
pure native frege.runtime.Regex9.replaceFirst

Alias for Replace_String.replaceFirst

substituteAllStringRegexStringString
pure native frege.runtime.Regex9.replaceAll

Alias for Replace_String.replaceAll

data MatchResult = pure native java.util.regex.MatchResult

This interface contains query methods used to determine the results of a match against a regular expression.

More: JavaDoc

Member Functions

endMatchResultIntInt
pure native end

Returns the offset after the last character of the subsequence captured by the given group during the previous match operation where group 0 denotes the entire pattern.

If the specified capturing group failed to match, the return value will be -1.

The following property holds:

 (m.group n == Nothing) ==> (m.end n < 0)
groupMatchResultIntMaybe String
 group m n

retrieves the input subsequence captured by the given group index during the previous match operation.

Capturing groups are indexed from left to right, starting at one. Group zero denotes the entire pattern, so the expression (m.group 0) retrieves that portion of the input string that was matched by the pattern.

If the match was successful but the group specified failed to match any part of the input sequence, then Maybe.Nothing is returned.

Note that some groups, for example (a?), match the empty string. This function will return Just "" when such a group successfully matches the empty string in the input.

The result will also be Maybe.Nothing when the specified group does not exist, that is, when the Int argument is outside the range [0..MatchResult.groupCount m]

The following property holds for a MatchResult m with input sequence s and valid group index g:

 isJust (m.group g) ==> (m.group g) == Just (s.substr (m.start g) (m.end g))
groupCountMatchResultInt
pure native groupCount

Returns the number of capturing groups in this matcher's pattern.

Group zero denotes the entire pattern by convention. It is not included in this count.

Any non-negative integer smaller than or equal to the value returned by this method is guaranteed to be a valid group index for this matcher.

groupPrimMatchResultInt → (IndexOutOfBoundsException | Maybe String)
pure native group
groupsMaybe MatchResult → [Maybe String]

The list of subsequences captured during the previous match operation, or the empty list if there was no match.

To be used like:

 groups $ s =~ regex

See also =~~

matchMatchResultString
pure native group

MatchResult.match m returns the input subsequence matched by the previous match. The result is undefined if the last match was not successful.

For a MatchResult m with input sequence s, the following holds:

 isJust (m.group 0) ==> unJust (m.group 0) == m.match

Note that some patterns, for example a?, match the empty string. This method will return the empty string when the pattern successfully matches the empty string in the input.

startMatchResultIntInt
pure native start

Returns the start index of the subsequence captured by the given group during the previous match operation where group 0 denotes the entire pattern.

If the specified capturing group failed to match, the return value will be -1.

The following property holds:

 (m.group n == Nothing) ==> (m.start n < 0)
=~StringRegexMaybe MatchResult
pure native frege.runtime.Regex9.findResult
infix  15
 string =~ regex

tries to match string against regex and returns Just matcher if it succeeds, Nothing otherwise.

findAtStringRegexIntMaybe MatchResult
pure native frege.runtime.Regex9.findResult
 findAt string regex offset

Like =~, but matches only the substring of string that starts at offset against regex.

If offset is negative or not smaller than the length of string, the result is Nothing

=~~StringRegex → [Maybe String]
infix  15
 string =~~ regex

tries to match string against regex and returns a list of all captured substrings, see MatchResult.groups.

Example:

 case "today is 2014-02-18" =~~ ´(\d\d\d\d)-(0[1-9]|1[012])-(\d\d)´ of
    [Just date, Just year, Just month, Just day] -> ...
    baddate -> ....
~StringRegexBool
pure native frege.runtime.Regex9.find
infix  15
string ~ regex

true if string matches regex, false otherwise

!~StringRegexBool
infix  15
 s !~ p == !(s ~ p)
~~StringRegexMaybe String
infix  15
 ("string" ~~ ´r..´) == Just "rin"

Tries a match and returns Just x where x is the matched substring or Nothing if there was no match.

~~~StringRegexIntMaybe String
infix  15
 string ~~~ regex

Matches string with regex and returns a function that can be used to extract the matched part of the string and the captured substrings.

   let f = "frege" ~~~ ´(..).(..)´
   in [ f i | i <- 0..3 ]

yields

[Just "frege", Just "fr", Just "ge", Nothing]
~~*StringRegex → [String]
infix  15
 "cats and dogs are not concatenated." ~~* ´cat|dog´

Matches the string with the regex and returns a list of all matches. The example gives:

 ["cat", "dog", "cat"]

Instances

instance Exceptional PatternSyntaxException

Member Functions

javaClassClass PatternSyntaxException
pure native java.util.regex.PatternSyntaxException.class
instance Replace String

Member Functions

replaceAllStringRegexStringString
pure native frege.runtime.Regex9.replaceAll

Like String.replaceFirst, but replaces all substrings that match the pattern.

replaceFirstStringRegexStringString
pure native frege.runtime.Regex9.replaceFirst

Replaces the first subsequence of the input sequence that matches the pattern with the given replacement string.

This method scans the input sequence from the start looking for a match of the pattern. Characters that are not part of the match are appended directly to the result string; the match is replaced in the result by the replacement string. The replacement string may contain references to captured subsequences.

Note that backslashes (\\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.

For example

 "zzzdogzzzdogzzz".replaceFirst  ´dog´ "cat" == "zzzcatzzzdogzzz"

Functions and Values by Type

Maybe MatchResult → [Maybe String]

MatchResult.groups

StringRegexStringString

Replace_String.replaceFirst, Replace_String.replaceAll

StringRegexIntMaybe String

~~~

StringRegexIntMaybe MatchResult

findAt

StringRegexMaybe String

~~

StringRegexMaybe MatchResult

=~

StringRegex → [Maybe String]

=~~

StringRegex → [String]

~~*

StringRegexBool

!~, ~

StringInt → (PatternSyntaxException | Regex)

Regex.compileβ

StringIntRegex

regforceβ

String → (PatternSyntaxException | Regex)

Regex.compileα

StringRegex

regforceα

MatchResultInt → (IndexOutOfBoundsException | Maybe String)

MatchResult.groupPrim

MatchResultIntMaybe String

MatchResult.group

MatchResultIntInt

MatchResult.start, MatchResult.end

MatchResultString

MatchResult.match

MatchResultInt

MatchResult.groupCount

RegexStringJArray String

Regex.split

RegexString → [String]

Regex.splitted

RegexString

Regex.toString, Regex.pattern

RegexInt

Regex.flags

Class PatternSyntaxException

Exceptional_PatternSyntaxException.javaClass

Int

Regex.unix_lines, Regex.unicode_case, Regex.unicode_character_class, Regex.multiline, Regex.dotall, Regex.literal, Regex.comments, Regex.canon_eq, Regex.case_insensitive

Replace s ⇒ s → Regex → s → s

Replace.replaceFirst, Replace.replaceAll

ω

regforce, Regex.compile

Valid HTML 4.01 Strict