|
|
 |
Character Classes
The ability to match a class, or collection, of characters at a specific point in a target string permits patterns that can match a range of text.
Including a class of characters to be possibly matched is achieved through one of three methods:
the dot metacharacter;
character classes;
class shorthands
dot -- the match-any-character class (.)
A class that matches any character except the null character '\0'. Since it matches almost any character, it is the most general of all possible character classes.
Example
regex.easyMatch ("c.t","catheter")
» true
A character class, also known as a "list" and "bracket expression", is a list of one or more items. The list is defined through the items included between the squarebrackets, "[...]".
An item in a character class can be either an ordinary character, representing itself, or a metacharacter. However, the definitions for metacharacters within a character class are different from those metacharacters outside of character classes.
Example
"[abc]" matches either "a" or "b" or "c".
"Defen[sc]e" will match either "Defense" or "Defence"
If you want to include a "]" in a character list, either include it as the first character (eg "[]]"), or escape it using a backslash (eg "[\\]]").
character-class metacharacters
Character classes have their own rules for what are and what aren't metacharacters. Something that is a metacharacter outside of a character class may not be a metacharacter inside a character class.
For example, the dot metacharacter is just a plain a dot inside a character class.
| - |
the dash |
The dash indicates a range of characters. A range is formed by placing a dash between two characters.The range represented falls between the beginning and ending elements in the ASCII sequence.
Examples
- "[a-z]" is equivalent to "[abcdefghijklmnopqrstuvwxyz]"
- "[0-9]" is the same as "[0123456789]"
- "<H1>[a-zA-Z0-9 ]+</H1>" may match a level 1 heading in HTML code.
Cases when the dash is not a metacharacter inside a character class:
the dash is the first or last character in the list;
the dash is the last character in a range;
the dash is escaped with a backslash "\".
|
| |
| ^ |
the caret |
If the caret is the first element in the list, the character class matches any character that is not in the list.
[^...] classes are known as negated character classes
Examples
- "[^a-z]" matches any character that is not a lower case alphabetical character.
- "<!--[^>]+--!>" will match HTML comments - "[^>]+" means match any character up until a ">" occurs.
|
| |
| \ |
the escape |
The escape allows character class metacharacters to be represented as themselves.
When using an escape in a pattern, the backslash itself needs to be escaped to enable Frontier to pass the escape to the regex engine.
Example
"[,\\-\\]]" matchs a comma, a dash, and a closing square bracket.
|
| |
| [:...:] |
POSIX bracket expressions |
A POSIX* Bracket expression** contains one of several special class shortcuts
These character shortcuts are only valid within character classes
Examples
regex.easyMatch ("[[:alpha:]]", "Ë")
» true
regex.easyMatch ("[:alpha:]", "Ë")
» false - because it attempts to match the class ":", "a", "l", "p" and "h" against "Ë".
The supported POSIX characters shortcuts are:
| alnum |
letters (including diacritical characters) and digits. |
| alpha |
letters (including diacritical characters). |
| blank |
a space or tab. |
| cntrl |
control characters in the ASCII encoding (ie codes less than 32 and code 127). |
| digit |
digits - 0123456789. |
| graph |
same as "print" except omits space. |
| lower |
lowercase letters - including diacritical characters. |
| print |
printable characters (in the ASCII encoding, space tilde--codes 32 through 126). |
| punct |
neither control nor alphanumeric characters. |
| space |
space, carriage return, newline, tab, and form feed. |
| upper |
uppercase letters - including diacritical characters. |
| xdigit |
hexadecimal digits: "0"-"9", "a"-"f", "A"-"F". |
|
| |
Class shorthands are shortcuts for a character class.
When using an escape, "\", in a pattern, the backslash itself needs to be escaped to enable Frontier to pass the escape to the regex engine.
| \d |
Digit |
Match any digit. It is equivalent to "[0-9]" |
| \D |
Non-digit |
Match any character that is not a digit. It is equivalent to "[^0-9]" |
| \s |
Whitespace |
Match any whitespace character - horizontal tab, line feed, vertical tab, form feed, carriage return and space. |
| \S |
Non-whitespace |
Match any character that is not whitespace. |
| \w |
Word character |
Match any character that can be part of a word. It is similar to "[a-zA-Z0-9_]" except that it also includes all characters with diacritic marks. |
| \W |
Non-word character |
Match any character that cannot be part of a word. It is similar to "[^a-zA-Z0-9_]" except that it also excludes all characters with diacritic marks. |
* POSIX - is short for Portable Operating System interface - a standard for ensuring portability across operating systems.
** Actually, a POSIX bracket expression is what we call a character class, and POSIX uses the term "character class" for the metasequences inside a bracket expression. We'll stick with the standard regular expression nomenclature.
|
 |