The ability to match a class, or collection, of characters at a specific point in a target string permits patterns that can match a range of text.
Including a class of characters to be possibly matched is achieved through one of three methods:
the dot metacharacter;
dot -- the match-any-character class (.)
A class that matches any character except the null character '\0'. Since it matches almost any character, it is the most general of all possible character classes.
A character class, also known as a "list" and "bracket expression", is a list of one or more items. The list is defined through the items included between the squarebrackets, "[...]".
An item in a character class can be either an ordinary character, representing itself, or a metacharacter. However, the definitions for metacharacters within a character class are different from those metacharacters outside of character classes.
"[abc]" matches either "a" or "b" or "c".
"Defen[sc]e" will match either "Defense" or "Defence"
If you want to include a "]" in a character list, either include it as the first character (eg "]"), or escape it using a backslash (eg "[\\]]").
Character classes have their own rules for what are and what aren't metacharacters. Something that is a metacharacter outside of a character class may not be a metacharacter inside a character class.
For example, the dot metacharacter is just a plain a dot inside a character class.
||The dash indicates a range of characters. A range is formed by placing a dash between two characters.The range represented falls between the beginning and ending elements in the ASCII sequence.
- "[a-z]" is equivalent to "[abcdefghijklmnopqrstuvwxyz]"
- "[0-9]" is the same as ""
- "<H1>[a-zA-Z0-9 ]+</H1>" may match a level 1 heading in HTML code.
Cases when the dash is not a metacharacter inside a character class:
the dash is the first or last character in the list;
the dash is the last character in a range;
the dash is escaped with a backslash "\".
||If the caret is the first element in the list, the character class matches any character that is not in the list.|
[^...] classes are known as negated character classes
- "[^a-z]" matches any character that is not a lower case alphabetical character.
- "<!--[^>]+--!>" will match HTML comments - "[^>]+" means match any character up until a ">" occurs.
||The escape allows character class metacharacters to be represented as themselves.|
When using an escape in a pattern, the backslash itself needs to be escaped to enable Frontier to pass the escape to the regex engine.
"[,\\-\\]]" matchs a comma, a dash, and a closing square bracket.
|A POSIX* Bracket expression** contains one of several special class shortcuts
These character shortcuts are only valid within character classes
regex.easyMatch ("[[:alpha:]]", "Ë")
regex.easyMatch ("[:alpha:]", "Ë")
» false - because it attempts to match the class ":", "a", "l", "p" and "h" against "Ë".
The supported POSIX characters shortcuts are:
||letters (including diacritical characters) and digits.
||letters (including diacritical characters).
||a space or tab.
||control characters in the ASCII encoding (ie codes less than 32 and code 127).
||digits - 0123456789.
||same as "print" except omits space.
||lowercase letters - including diacritical characters.
||printable characters (in the ASCII encoding, space tilde--codes 32 through 126).
||neither control nor alphanumeric characters.
||space, carriage return, newline, tab, and form feed.
||uppercase letters - including diacritical characters.
||hexadecimal digits: "0"-"9", "a"-"f", "A"-"F".
Class shorthands are shortcuts for a character class.
When using an escape, "\", in a pattern, the backslash itself needs to be escaped to enable Frontier to pass the escape to the regex engine.
|| Match any digit. It is equivalent to "[0-9]"
|| Match any character that is not a digit. It is equivalent to "[^0-9]"
|| Match any whitespace character - horizontal tab, line feed, vertical tab, form feed, carriage return and space.
|| Match any character that is not whitespace.
|| Match any character that can be part of a word. It is similar to "[a-zA-Z0-9_]" except that it also includes all characters with diacritic marks.
|| Match any character that cannot be part of a word. It is similar to "[^a-zA-Z0-9_]" except that it also excludes all characters with diacritic marks.
* POSIX - is short for Portable Operating System interface - a standard for ensuring portability across operating systems.
** Actually, a POSIX bracket expression is what we call a character class, and POSIX uses the term "character class" for the metasequences inside a bracket expression. We'll stick with the standard regular expression nomenclature.