Contents 

Overview
Introducing the Check&Get
Why You Need Check&Get?
System Requirements
Distribution Agreement
License Agreement
How to get Help
How to buy Check&Get
Acknowledgements
User Interface
Program Settings
General Settings
Connection Settings
Proxy Settings
E-Mail Settings
Browser Settings
Confirmation Settings
Bookmark Properties
General Properties
Bookmark Settings
Basic Monitoring Settings
Web-Page Status
Advanced Monitoring Settings
Change Detection Options
Actions
Connection Settings
Web-Filter Manager
Creating Web Filters (Select And Ignore)
Web Filter Properties (Simple)
Web Filter Properties (Advanced)
Keyword Monitoring
Folder Properties
Find the Bookmarks and Folders
Duplicate URL Cleaner
Import/Export/Synchronize the Bookmarks
Clipboard, Text or HTML file, Browser Window
Internet Explorer
Opera Browser
Mozilla/Netscape Browsers
References
Regular Expression Examples
Regular Expression Symtax

Check&Get Web Change Monitor Online Help

Prev Page Next Page
Regular Expression Syntax Overview

A regular expression is a formula for matching strings that follow some pattern. Many people are afraid to use them because they can look confusing and complicated. However, with a little practice, it's pretty easy to write the expressions to make the advanced web filters in Check&Get.

We are prepared some easy examples that could help you to understand regular expression use and start making your own web filters.

This document describes the formal Regular Expression syntax, used in Check&Get

See also:

Table Of Context

  1. Regular Expression Syntax;
  2. Order of Precedence
  3. Character Matching
  4. Bracket Expressions
  5. Quantifiers and Associated Meanings
  6. Anchors

  7. Alternation and Grouping

 

Regular Expression Syntax

A regular expression is a pattern of text that consists of ordinary characters (for example, letters a through z) and special characters, known as metacharacters. The pattern describes one or more strings to match when searching a body of text. The regular expression serves as a template for matching a character pattern to the string being searched.

Here are some examples of regular expression you might encounter:

Regular Expression

Matches

^\s*$

Match a blank line.

\d{2}-\d{5}

Validate an ID number consisting f 2 digits, a hyphen, and another 5 digits.

 

The following table contains the complete list of metacharacters and their behavior in the context of regular expressions:

 

Character

Description

\

Marks the next character as either a special character, a literal, a backreference, or an octal escape. For example, 'n' matches the character "n". '\n' matches a newline character. The sequence '\\' matches "\" and "\(" matches "(".

^

Matches the position at the beginning of the input string. If the RegExp object's Multiline property is set, ^ also matches the position following '\n' or '\r'.

$

Matches the position at the end of the input string. If the RegExp object's Multiline property is set, $ also matches the position preceding '\n' or '\r'.

*

Matches the preceding character or subexpression zero or more times. For example, zo* matches "z" and "zoo". * is equivalent to {0,}.

+

Matches the preceding character or subexpression one or more times. For example, 'zo+' matches "zo" and "zoo", but not "z". + is equivalent to {1,}.

?

Matches the preceding character or subexpression zero or one time. For example, "do(es)?" matches the "do" in "do" or "does". ? is equivalent to {0,1}

{n}

n is a nonnegative integer. Matches exactly n times. For example, 'o{2}' does not match the 'o' in "Bob," but matches the two o's in "food".

{n,}

n is a nonnegative integer. Matches at least n times. For example, 'o{2,}' does not match the "o" in "Bob" and matches all the o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.

{n,m}

m and n are nonnegative integers, where n <= m. Matches at least n and at most m times. For example, "o{1,3}" matches the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Note that you cannot put a space between the comma and the numbers.

?

When this character immediately follows any of the other quantifiers (*, +, ?, {n}, {n,}, {n,m}), the matching pattern is non-greedy. A non-greedy pattern matches as little of the searched string as possible, whereas the default greedy pattern matches as much of the searched string as possible. For example, in the string "oooo", 'o+?' matches a single "o", while 'o+' matches all 'o's.

.

Matches any single character except "\n". To match any character including the '\n', use a pattern such as '[\s\S]'.

(pattern)

Matches pattern and captures the match. The captured match can be retrieved from the resulting Matches collection, using the SubMatches collection in VBScript or the $0$9 properties in JScript. To match parentheses characters ( ), use '\(' or '\)'.

(?:pattern)

Matches pattern but does not capture the match, that is, it is a non-capturing match that is not stored for possible later use. This is useful for combining parts of a pattern with the "or" character (|). For example, 'industr(?:y|ies) is a more economical expression than 'industry|industries'.

(?=pattern)

Positive lookahead matches the search string at any point where a string matching pattern begins. This is a non-capturing match, that is, the match is not captured for possible later use. For example 'Windows (?=95|98|NT|2000)' matches "Windows" in "Windows 2000" but not "Windows" in "Windows 3.1". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match begins immediately following the last match, not after the characters that comprised the lookahead.

(?!pattern)

Negative lookahead matches the search string at any point where a string not matching pattern begins. This is a non-capturing match, that is, the match is not captured for possible later use. For example 'Windows (?!95|98|NT|2000)' matches "Windows" in "Windows 3.1" but does not match "Windows" in "Windows 2000". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match begins immediately following the last match, not after the characters that comprised the lookahead.

x|y

Matches either x or y. For example, 'z|food' matches "z" or "food". '(z|f)ood' matches "zood" or "food".

[xyz]

A character set. Matches any one of the enclosed characters. For example, '[abc]' matches the 'a' in "plain".

[^xyz]

A negative character set. Matches any character not enclosed. For example, '[^abc]' matches the 'p' in "plain".

[a-z]

A range of characters. Matches any character in the specified range. For example, '[a-z]' matches any lowercase alphabetic character in the range 'a' through 'z'.

[^a-z]

A negative range characters. Matches any character not in the specified range. For example, '[^a-z]' matches any character not in the range 'a' through 'z'.

\b

Matches a word boundary, that is, the position between a word and a space. For example, 'er\b' matches the 'er' in "never" but not the 'er' in "verb".

\B

Matches a nonword boundary. 'er\B' matches the 'er' in "verb" but not the 'er' in "never".

\cx

Matches the control character indicated by x. For example, \cM matches a Control-M or carriage return character. The value of x must be in the range of A-Z or a-z. If not, c is assumed to be a literal 'c' character.

\d

Matches a digit character. Equivalent to [0-9].

\D

Matches a nondigit character. Equivalent to [^0-9].

\f

Matches a form-feed character. Equivalent to \x0c and \cL.

\n

Matches a newline character. Equivalent to \x0a and \cJ.

\r

Matches a carriage return character. Equivalent to \x0d and \cM.

\s

Matches any whitespace character including space, tab, form-feed, etc. Equivalent to [ \f\n\r\t\v].

\S

Matches any non-white space character. Equivalent to [^ \f\n\r\t\v].

\t

Matches a tab character. Equivalent to \x09 and \cI.

\v

Matches a vertical tab character. Equivalent to \x0b and \cK.

\w

Matches any word character including underscore. Equivalent to '[A-Za-z0-9_]'.

\W

Matches any nonword character. Equivalent to '[^A-Za-z0-9_]'.

\xn

Matches n, where n is a hexadecimal escape value. Hexadecimal escape values must be exactly two digits long. For example, '\x41' matches "A". '\x041' is equivalent to '\x04' & "1". Allows ASCII codes to be used in regular expressions.

\num

Matches num, where num is a positive integer. A reference back to captured matches. For example, '(.)\1' matches two consecutive identical characters.

\n

Identifies either an octal escape value or a backreference. If \n is preceded by at least n captured subexpressions, n is a backreference. Otherwise, n is an octal escape value if n is an octal digit (0-7).

\nm

Identifies either an octal escape value or a backreference. If \nm is preceded by at least nm captured subexpressions, nm is a backreference. If \nm is preceded by at least n captures, n is a backreference followed by literal m. If neither of the preceding conditions exists, \nm matches octal escape value nm when n and m are octal digits (0-7).

\nml

Matches octal escape value nml when n is an octal digit (0-3) and m and l are octal digits (0-7).

\un

Matches n, where n is a Unicode character expressed as four hexadecimal digits. For example, \u00A9 matches the copyright symbol (©).

 

Order of Precedence

From Highest to Lowest, the Order of Precedence of the Regular Expression Operators:

Operator(s)
Description

\

Escape

(), (?:), (?=), []

Parentheses and Brackets

*, +, ?, {n}, {n,}, {n,m}

Quantifiers

^, $, \anymetacharacter

Anchors and Sequences

|

Alternation

Characters have higher precedence than the alternation operator, which allows 'm|food' to match "m" or "food". To match "mood" or "food", use parentheses to create a subexpression, which results in '(m|f)ood'.

 

Character Matching

The period (.) matches any single printing or non-printing character in a string, except a newline character (\n). The following regular expression matches 'aac', 'abc', 'acc', 'adc', and so on, as well as 'a1c', 'a2c', a-c', and a#c':

a.c

If you are trying to match a string containing a word where a period (.) is part of the input string, you do so by preceding the period in the regular expression with a backslash (\) character. To illustrate, the following regular expression matches 'filename.ext':

filename\.ext

Bracket Expressions

You can create a list of matching characters by placing one or more individual characters within square brackets ([ and ]). When characters are enclosed in brackets, the list is called a bracket expression. Within brackets, as anywhere else, ordinary characters represent themselves, that is, they match an occurrence of themselves in the input text. Most special characters lose their meaning when they occur inside a bracket expression. Here are some exceptions:

  • The ']' character ends a list if it's not the first item. To match the ']' character in a list, place it first, immediately following the opening '['.
  • The '\' character continues to be the escape character. To match the '\' character, use '\\'.

Characters enclosed in a bracket expression match only a single character for the position in the regular expression where the bracket expression appears. The following JScript regular expression matches 'Chapter 1', 'Chapter 2', 'Chapter 3', 'Chapter 4', and 'Chapter 5':

Chapter [12345]

If you want to express the matching characters using a range instead of the characters themselves, you can separate the beginning and ending characters in the range using the hyphen (-) character. The character value of the individual characters determines their relative order within a range. The following regular expression contains a range expression that is equivalent to the bracketed list shown above.

Chapter [1-5]

When a range is specified in this manner, both the starting and ending values are included in the range.

If you want to include the hyphen character in your bracket expression, you must do one of the following:

  • Escape it with a backslash: [\-]

You can also find all the characters not in the list or range by placing the caret (^) character at the beginning of the list. If the caret character appears in any other position within the list, it matches itself, that is, it has no special meaning. The following regular expression matches chapter headings with numbers greater than 5':

Chapter [^12345]

OR

Chapter [^1-5]

A typical use of a bracket expression is to specify matches of any upper- or lowercase alphabetic characters or any digits. The following regular expression specifies such a match:

[A-Za-z0-9]

 

Quantifiers and Associated Meanings

Sometimes, you do not know how many characters there are to match. In order to accommodate that kind of uncertainty, regular expressions support the concept of quantifiers. These quantifiers let you specify how many times a given component of your regular expression must occur for your match to be true.

Character
Description

*

Matches the preceding character or subexpression zero or more times. For example, 'zo*' matches "z" and "zoo". * is equivalent to {0,}.

+

Matches the preceding character or subexpression one or more times. For example, 'zo+' matches "zo" and "zoo", but not "z". + is equivalent to {1,}.

?

Matches the preceding character or subexpression zero or one time. For example, 'do(es)?' matches the "do" in "do" or "does". ? is equivalent to {0,1}

{n}

n is a nonnegative integer. Matches exactly n times. For example, 'o{2}' does not match the 'o' in "Bob," but matches the two o's in "food".

{n,}

n is a nonnegative integer. Matches at least n times. For example, 'o{2,}' does not match the 'o' in "Bob" and matches all the o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.

{n,m}

m and n are nonnegative integers, where n <= m. Matches at least n and at most m times. For example, 'o{1,3}' matches the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Note that you cannot put a space between the comma and the numbers.

With a large input document, chapter numbers could easily exceed nine, so you need a way to handle two or three digit chapter numbers. Quantifiers give you that capability. The following regular expression matches chapter headings with any number of digits:

Chapter [1-9][0-9]*

If you know that your chapter numbers are limited to only 99 chapters, you can use the following regular expression to specify at least one, but not more than 2 digits.

Chapter [0-9]{1,2}

The disadvantage to the expression shown above is that if there is a chapter number greater than 99, it will still only match the first two digits. Another disadvantage is that somebody could create a Chapter 0 and it would match. A better expression for matching only two digits are the following:

Chapter [1-9][0-9]?

The '*', '+', and '?' quantifiers are all what are referred to as greedy, that is, they match as much text as possible. Sometimes that is not at all what you want to happen. Sometimes, you just want a minimal match.

Say, for example, you are searching an HTML document for an occurrence of a chapter title enclosed in an H1 tag. That text appears in your document as:

<H1>Chapter 1 – Introduction to Regular Expressions</H1>

The following expression matches everything from the opening less than symbol (<) to the greater than symbol (>) at the end of the closing H1 tag.

<.*>

If all you really wanted to match was the opening H1 tag, the following, non-greedy expression matches only <H1>.

<.*?>

By placing the '?' after a '*', '+', or '?' quantifier, the expression is transformed from a greedy to a non-greedy, or minimal, match.


Anchors

So far, the examples you've seen have been concerned only with finding chapter headings wherever they occur. Any occurrence of the string 'Chapter' followed by a space, followed by a number, could be an actual chapter heading, or it could also be a cross-reference to another chapter. Since true chapter headings always appear at the beginning of a line, you'll need to devise a way to find only the headings and not find the cross-references.


Anchors provide that capability. Anchors allow you to fix a regular expression to either the beginning or end of a line. They also allow you to create regular expressions that occur either within a word or at the beginning or end of a word. The following table contains the list of regular expression anchors and their meanings:

Character

Description

^

Matches the position at the beginning of the input string.

$

Matches the position at the end of the input string.

\b

Matches a word boundary, that is, the position between a word and a space.

\B

Matches a nonword boundary.

To match text at the beginning of a line of text, use the '^' character at the beginning of the regular expression. Don't confuse this use of the '^' with the use within a bracket expression. They're definitely not the same.

To match text at the end of a line of text, use the '$' character at the end of the regular expression.

To use anchors when searching for chapter headings, the following regular expression matches a chapter heading with up to two following digits that occurs at the beginning of a line:

^Chapter [1-9][0-9]{0,1}

Not only does a true chapter heading occur at the beginning of a line, it's also the only thing on the line, so it also must be at the end of a line as well. The following expression ensures that the match you've specified only matches chapters and not cross-references. It does so by creating a regular expression that matches only at the beginning and end of a line of text.

^Chapter [1-9][0-9]{0,1}$

Matching word boundaries is a little different but adds a very important capability to regular expressions. A word boundary is the position between a word and a space. A non-word boundary is any other position. The following expression matches the first three characters of the word 'Chapter' because they appear following a word boundary:

\bCha

The position of the '\b' operator is critical here. If it's positioned at the beginning of a string to be matched, it looks for the match at the beginning of the word; if it's positioned at the end of the string, it looks for the match at the end of the word. For example, the following expressions match 'ter' in the word 'Chapter' because it appears before a word boundary:

ter\b

The following expressions match 'apt' as it occurs in 'Chapter', but not as it occurs in 'aptitude':

\Bapt

That's because 'apt' occurs on a non-word boundary in the word 'Chapter' but on a word boundary in the word 'aptitude'. For the non-word boundary operator, position isn't important because the match isn't relative to the beginning or end of a word.

Alternation and Grouping

Alternation allows use of the '|' character to allow a choice between two or more alternatives. Expanding the chapter heading regular expression, you can expand it to cover more than just chapter headings. However, it's not as straightforward as you might think. When alternation is used, the largest possible expression on either side of the '|' character is matched. You might think that the following expressions match either 'Chapter' or 'Section' followed by one or two digits occurring at the beginning and ending of a line:

^Chapter|Section [1-9][0-9]{0,1}$

Unfortunately, what happens is that the regular expressions shown above match either the word 'Chapter' at the beginning of a line, or 'Section' and whatever numbers follow that, at the end of the line. If the input string is 'Chapter 22', the expression shown above only matches the word 'Chapter'. If the input string is 'Section 22', the expression matches 'Section 22'. But that's not the intent here so there must be a way to make that regular expression more responsive to what you're trying to do and there is.

You can use parentheses to limit the scope of the alternation, that is, make sure that it applies only to the two words, 'Chapter' and 'Section'. However, parentheses are tricky as well, because they are also used to create subexpressions, something that's covered later in the section on subexpressions. By taking the regular expressions shown above and adding parentheses in the appropriate places, you can make the regular expression match either 'Chapter 1' or 'Section 3'.
The following regular expressions use parentheses to group 'Chapter' and 'Section' so the expression works properly:

^(Chapter|Section) [1-9][0-9]{0,1}$

See also:

 

 

 

 

 

 


Check&Get - Web Change Monitor, Bookmark Manager and Web Capture tool. Organizes Bookmarks, Tracks / Checks web-sites for updates and changes, Alerts on new web page content, detects dead-links and duplicates. Imports, Exports and Synchronizes bookmarks among Internet Explorer, Mozilla, FireFox, Opera.