Regular Expressions
Previous Top Next

Regular Expressions in Bible Analyzer



Regular expressions can be used for Bible text searches.

Basically, a regular expression (or RE) is a pattern describing a certain amount of text. Their name comes from the mathematical theory on which they are based.

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like A, a, or 0, are the simplest regular expressions; they simply match themselves. You can also combine ordinary characters, so last matches the string last in 'last' or 'blast'. 

Examples:
Lord will match Lord and Lord (unless Case Sensitive is selected)
or will match or, Lord, error, and any other word with the string or.



Special Characters

Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use. Some characters, like "|" or "(", are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign will have a special meaning. There are more examples below.

'[ ]' — Character Classes or Character Sets 
With a character class, also called character set, you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets. If you want to match an a or an e, use [ae]. You could use this in gr[ae]y to match either gray or grey. Very useful if you do not know whether the document you are searching through is written in American or British English. For example, sep[ae]r[ae]te or li[cs]en[cs]e. 

You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9. You can combine ranges and single characters. [0-9a-zA-Z].


Negated Character Classes
Typing a caret after the opening square bracket will negate the character class. The result is that the character class will match any character that is not in the character class. Unlike the dot, negated character classes also match (invisible) line break characters.
It is important to remember that a negated character class still must match a character. q[^u] does not mean: "a q not followed by a u". It means: "a q followed by a character that is not a u". It will not match the q in the string Iraq. It will match the q and the space after the q in Iraq is a country. Indeed: the space will be part of the overall match, because it is the "character that is not a u" that is matched by the negated character class in the above regexp.

Examples:
gr[ae]y will match  gray or grey.
lo[uv]e will match  love or loue (AV1611 spelling).
sep[ae]r[ae]te will find seperate, separate, seperete, and separete.
lo[^u]e will find love but not loue

' | ' — Vertical Bar or Pipe (Alternation)
You can use alternation to match a single regular expression out of several possible regular expressions. It differs from a character class in that REs with more than one character can be used. ie. cat|dog will find cat or dog.

Examples:
lord|God will find all verses with either Lord or God
mercy|grace|lo[uv]e will match mercy, grace, love or loue

'.'  — Dot
This matches any character except a newline.  However, because of its broad capabilities, it can lead to unintended matches.

Examples:
lo.e will find love and loue, but also loqe or lo2e.

'^'  '$' — Caret, Dollar Sign (Location Anchors)
These two characters anchor the location of the search. ^ Matches at the start of the string only and $ at the end.

Examples:
^christ will match Christ in, 'Christ hath redeemed us...' (Gal. 3:13), but not Christ in, 'Paul, a servant of Jesus Christ, called...' (Rom. 1:1).
christ\W?$ (don't forget to allow for the punctuation, see bellow) will find Christ in, 'Be ye followers of me, even as I also am of Christ.' (1 Cor. 11:1), but not Christ in Gal. 3:13.

'?' — Question Mark (Optional Items)
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. Thus the question mark makes the preceding token in the regular expression optional. E.g.: colou?r matches both colour and color. You can make several tokens optional by grouping them together using round brackets, and placing the question mark after the closing bracket.

Examples:
Lords? will find both Lord and Lords
Right(eousness)? will match Right and Righteousness.
To match a literal ? use \?

'*' '+' '{}' —  Star, Plus, and Curly Braces  (Repitition)
The star (or asterisk) causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. The plus causes the resulting RE to match 1 or more repetitions of the preceding RE.

There is also an additional repetition operator that allows you to specify how many times a token can be repeated. The syntax is {min,max}, where min is a positive integer number indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches. If the comma is present but max is omitted, the maximum number of matches is infinite. So {0,} is the same as *, and {1,} is the same as +. Omitting both the comma and max tells the engine to repeat the token exactly min times.

You could use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. Notice the use of the word boundaries.

Examples:
ab* will match 'a', 'ab', or 'abbbbb...' (until something other thn 'b' is encountered).
ab+ will match 'ab' or 'abbbbb...' It will not match just 'a'.
lord {3} will find lord lord lord (remember to add the space and any possible punctuation for whole words).
To match a literal * or + use a slash before it (\?).

*?, +?, ?? — Dealing With Greediness
The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn't desired;

if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding "?" after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'.

(...) — Grouping
By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a regex operator, e.g. a repetition operator, to the entire group.

Examples:
Right(eousness)? will match Right and Righteousness.

(?=...) — Positive Lookahead
Matches if ... matches next, but doesn't consume any of the string. This is called a lookahead assertion.

Examples:
Jesus (?=Christ) will match Jesus  only if it's followed by Christ (use Jesus\W?(?=Christ) to deal with any possible punctuation i.e. Jesus, Christ).

(?!...) — Negative Lookahead
Matches if ... doesn't match next. This is a negative lookahead assertion.

Examples:
Jesus (?!Christ) will match Jesus  only if it's not followed by Christ.


Shortcuts
These shortcuts can be used in place of Character Classes and other characters.

\A
Matches only at the start of the string.

\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character.

\B
Matches the empty string, but only when it is not at the beginning or end of a word.

\d
Matches any decimal digit; this is equivalent to the set [0-9].

\D
Matches any non-digit character; this is equivalent to the set [^0-9].

\s
Matches any whitespace character; this is equivalent to the set [ \t\n\r\f\v].

\S
Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v].

\w
matches any alphanumeric character; this is equivalent to the set [a-zA-Z0-9_].

\W
matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_].

\Z
Matches only at the end of the string.

\\
Matches a literal backslash.


  
Advanced Examples

Words Near Each Other
With regular expressions you can describe almost any text pattern, including a pattern that matches two words near each other. This pattern consists of three parts: the first word, a certain number of unspecified words, and the second word.

An unspecified word can be matched with the shorthand character class \w+. The spaces and other characters between the words can be matched with \W+ (uppercase W this time).

The complete regular expression becomes \bword1\W+(?:\w+\W+){1,6}?word2\b . The quantifier {1,6}? makes the regex require at least one word between "word1" and "word2", and allow at most six words.

If the words may also occur in reverse order, we need to specify the opposite pattern as well:

\b(?:word1\W+(?:\w+\W+){1,6}?word2|word2\W+(?:\w+\W+){1,6}?word1)\b 

If you want to find any pair of two words out of a list of words, you can use:

\b(word1|word2|word3)(?:\W+\w+){1,6}?\W+(word1|word2|word3)\b. This regex will

also find a word near itself, e.g. it will match word2 near word2.

Examples:

·    \bJesus\W+(?:\w+\W+){1,6}?Christ\b will find Jesus and Christ in order seperated by at least one word and no more than six.
·    \b(Lord|Jesus|Christ)(?:\W+\w+){1,6}?\W+(Lord|Jesus|Christ)\b will match Lord, Jesus or Christ seperated by at least one word and no more than six before a second instance of Lord, Jesus or Christ.


For more information about Regular Expressions check this excellent website, http://www.regular-expressions.info/reference.html