SCRIPT MERIDIAN
REGEX PROJECT
 

Regex Home

Download

Documentation
 Regex Verbs
 MatchInfo Table

 Metacharacters
  Grouping
  Anchors
  Character Classes
  Frontier Constants
  Quantifiers
  Alternation


Pointers

Credits

 

Character Classes

The ability to match a class, or collection, of characters at a specific point in a target string permits patterns that can match a range of text.

Including a class of characters to be possibly matched is achieved through one of three methods:

    the dot metacharacter;
    character classes;
    class shorthands

dot -- the match-any-character class (.)

A class that matches any character except the null character '\0'. Since it matches almost any character, it is the most general of all possible character classes.

Example
regex.easyMatch ("c.t","catheter")
   » true

Character classes ([...])

A character class, also known as a "list" and "bracket expression", is a list of one or more items. The list is defined through the items included between the squarebrackets, "[...]".

An item in a character class can be either an ordinary character, representing itself, or a metacharacter. However, the definitions for metacharacters within a character class are different from those metacharacters outside of character classes.

Example
"[abc]" matches either "a" or "b" or "c".

"Defen[sc]e" will match either "Defense" or "Defence"

If you want to include a "]" in a character list, either include it as the first character (eg "[]]"), or escape it using a backslash (eg "[\\]]").

character-class metacharacters

Character classes have their own rules for what are and what aren't metacharacters. Something that is a metacharacter outside of a character class may not be a metacharacter inside a character class.
For example, the dot metacharacter is just a plain a dot inside a character class.

- the dash The dash indicates a range of characters. A range is formed by placing a dash between two characters.The range represented falls between the beginning and ending elements in the ASCII sequence.

Examples

  1. "[a-z]" is equivalent to "[abcdefghijklmnopqrstuvwxyz]"
  2. "[0-9]" is the same as "[0123456789]"
  3. "<H1>[a-zA-Z0-9 ]+</H1>" may match a level 1 heading in HTML code.

Cases when the dash is not a metacharacter inside a character class:
        the dash is the first or last character in the list;
        the dash is the last character in a range;
        the dash is escaped with a backslash "\".

 
^ the caret If the caret is the first element in the list, the character class matches any character that is not in the list.

[^...] classes are known as negated character classes

Examples

  1. "[^a-z]" matches any character that is not a lower case alphabetical character.
  2. "<!--[^>]+--!>" will match HTML comments - "[^>]+" means match any character up until a ">" occurs.
 
\ the escape The escape allows character class metacharacters to be represented as themselves.

When using an escape in a pattern, the backslash itself needs to be escaped to enable Frontier to pass the escape to the regex engine.

Example
"[,\\-\\]]" matchs a comma, a dash, and a closing square bracket.

 
[:...:] POSIX
bracket
expressions
A POSIX* Bracket expression** contains one of several special class shortcuts These character shortcuts are only valid within character classes

Examples
regex.easyMatch ("[[:alpha:]]", "Ë")
   » true

regex.easyMatch ("[:alpha:]", "Ë")
   » false - because it attempts to match the class ":", "a", "l", "p" and "h" against "Ë".

The supported POSIX characters shortcuts are:

alnum letters (including diacritical characters) and digits.
alpha letters (including diacritical characters).
blank a space or tab.
cntrl control characters in the ASCII encoding (ie codes less than 32 and code 127).
digit digits - 0123456789.
graph same as "print" except omits space.
lower lowercase letters - including diacritical characters.
print printable characters (in the ASCII encoding, space tilde--codes 32 through 126).
punct neither control nor alphanumeric characters.
space space, carriage return, newline, tab, and form feed.
upper uppercase letters - including diacritical characters.
xdigit hexadecimal digits: "0"-"9", "a"-"f", "A"-"F".

 

Class shorthands

Class shorthands are shortcuts for a character class.

When using an escape, "\", in a pattern, the backslash itself needs to be escaped to enable Frontier to pass the escape to the regex engine.

\d Digit Match any digit. It is equivalent to "[0-9]"
\D Non-digit Match any character that is not a digit. It is equivalent to "[^0-9]"
\s Whitespace Match any whitespace character - horizontal tab, line feed, vertical tab, form feed, carriage return and space.
\S Non-whitespace Match any character that is not whitespace.
\w Word character Match any character that can be part of a word. It is similar to "[a-zA-Z0-9_]" except that it also includes all characters with diacritic marks.
\W Non-word character Match any character that cannot be part of a word. It is similar to "[^a-zA-Z0-9_]" except that it also excludes all characters with diacritic marks.

* POSIX - is short for Portable Operating System interface - a standard for ensuring portability across operating systems.

** Actually, a POSIX bracket expression is what we call a character class, and POSIX uses the term "character class" for the metasequences inside a bracket expression. We'll stick with the standard regular expression nomenclature.



 [ Previous ]  [ Next  To the top  

Send questions and comments to regex@lists.scriptmeridian.org.
Page last updated: Thu, 10 Dec 1998 22:01:22 GMT.