|
A regular expression is a formula for matching strings that follow some pattern. Many people are afraid to use them because they can look confusing and complicated. However, with a little practice, it's pretty easy to write the expressions to make the advanced web filters in Check&Get.
Here are the some easy examples that could help you to understand regular expression use and start making your own web filters.
Ignoring Date/Time on Web-Page:
Date/Time Format |
Regular Expression |
Explanation |
| April 2, 2006 |
\w+ \d{1,2}, \d{4} |
\w+ |
matches any word (January, May, etc.) |
| \d{1,2} |
matches one or two digits (01, 22) |
| , |
matches comma |
| \d{4} |
matches four digits |
|
| January 28, 2006, 09:07:19 AM |
\w+ \d{1,2}, 200\d, \d\d:\d\d:\d\d (AM|PM) |
\w+ |
matches any word (April, May, etc) |
| \d{1,2} |
matches one or two digits (1 or 33) |
| , |
matches comma (,) |
| 200\d |
matches 200 and any digit (2006, 2000, 2008) |
| \d\d: |
matches two digits and : (23:, 12:, 01:) |
| (AM|PM) |
matches AM or PM words |
|
| 17-Jan-2006 |
\d\d-\w{3}-\d{4} |
\d\d |
matches any two digits (01, 22, etc.) |
| - |
matched "-" char |
| \w{3} |
matches three characters (Jan, Feb, ZZZ, etc.) |
| - |
matched "-" char |
| \d{4} |
matches any four digits |
|
| 2006/06/06 12:38:51 |
\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2} |
| \d{4} |
matches any four digits |
| / |
matched "/" char |
\d{2} |
matches any two digits (01, 22, etc.) |
| : |
matched ":" char |
|
Ignoring Counters:
Counter Format |
Regular Expression |
Explanation |
| 12432 Visitors |
\d+ Visitors |
\d+ |
matches any number of digits |
| Visitors |
matches "Visitors" word |
|
| Activity: 78.2% |
Activity: \d+\.\d+% |
Activity: |
matches "Activity:" word |
| \d+ |
matches one or more digits (1 or 33) |
| \. |
matches "dot"character (.) |
| \d+ |
matches one or more digits (1 or 33) |
| % |
matches "percent"character (%) |
|
| (c) 1999-2006 ActiveURLs. All Rights Reserved. |
\(c\) \d{4}-\d{4} |
\(c\) |
matches "(c)" word |
| \d{4}- |
matched four digits and "-" char |
| \d{4} |
matched four digits |
| |
This filter will ignore the changes in dates of copyright (1999-2005, 1998-2006 etc.) |
|
Ignoring Advertisements:
The following example shows how to ignore the typical text advertisement, like this one:
SPONSORED LINKS
Get a $200,000 Loan for $770/month
Fill out 1 form, and regardless of credit receive up to 4 loan offers in minutes from our certified lenders. When Banks Compete, You Win!
Mortgage solutions that fit your needs.
Comparing rates of fixed rate loans and adjustable rate mortgages? We'll match the interest rate quoted by any competing lender for products with the same terms AND we'll beat the other lender's fees ...
Buy a Link Now » |
We need to ignore the text between the "SPONSORED LINKS" words and "Buy a Link Now" words.
The following regular expression does this job:
Regular Expression |
Explanation |
| SPONSORED LINKS[\w\W]+?Buy a Link Now |
SPONSORED LINKS |
matches "SPONSORED LINKS" words |
| [\w\W]+ |
matches any number of any characters, including the line feeds |
| ? |
"?" means, that expression [\w\W]+ is not greedy (the minimal match) |
| Buy a Link Now |
matches "Buy a Link Now" words |
|
 |
Note: Check&Get provides the easy to use "Select and Ignore" way to ignore the advertisements like shown in this example. You do not need to create such web-filters manually - Check&Get will do this job for you automatically. Nevertheless, you can use the methods, descrived above to create the highly customizable web filters that will solve nearly any filtering task that could be imagined. |
See Also:
|