Regular Expression
I started learning about regular expressions. They are nice tools to have in a programmers toolbox. Although I still use google to search for regular expressions but it is nice to have some understand of them.
Here are some notes which I have written to I do not have to google every time.
Notes
Literal Character
/cat/
matches both cat and communicated (notice the cat at the end in communiCATed)- Literal matches are case sensitive use
i
global modifier to make them insensitive - Only first occurrence is returned. To return all the occurrences uses
g
global modifier /car/
will not matchc a r
spaces are not ignored
- Literal matches are case sensitive use
Metacharacter Character
\
.
*
+
-
{
}
[
]
^
$
|
?
(
)
:
!
=
- Character with special meaning and its meaning may differ with the context in which it is used Wild Card Meta Character
.
- can be used to match anything except newline
/.a.a.a/
matches both papaya and banana
Escaping Meta Characters
- Use the
\
meta character tells regex engine to escape the character next to it./9\.00/
matches9.00
not9500
or9-00
.
- To match space leave space in regex
- To match tab use
\t
(notice the we are using the\
to escape thet
.)\r
and\n
Character Sets
[]
[
opening a character set]
closing a character set- Only matches a single character
- Order of characters in character set does not matter e.g
[aeiou]
is same is[ueioa]
etc etc /gr[ea]y/
matches bothgrey
andgray
- Character Ranges
-
to indicate range of characters-
is only meta character inside a character set; a liter dash otherwise[0-9]
to indicate digits 0 to 9[a-zA-Z]
all alphabets
- Remember its characters range not number range
Negative Character Sets
^
Think of it as a!(not)
of character set.[^aeiou]
matches any one consonant/see[^mn]/
matchesseek
andsees
but does notseem
orseen
but it will still match1. it also won’t match `see` because there is nothing after the last `e`
see.
Meta Characters inside Character Set
- Meta Character inside character sets are already escaped
/h[a.]t
matcheshat
andh.t
but nothot
- exceptions
]
-
^
\
]
means that we are closing character set-
means range^
means negative character set\
means escaping
- exceptions
- Shorthand Character Set
\d
for digit for[0-9]
\w
for word character[a-zA-Z0-9_]
_
is a word character but hypen is not
\s
for white space[\t\r\n]
\D
for not digit[^0-9]
\W
for not a word character[^a-zA-Z0-9_]
\S
for not a space[^\t\n\r]
[^\d\s]
not a digit or a space
Repetitions Metacharacters
*
Preceding Item, zero or more times+
Preceding Item, one or more times?
Preceding Item, zero or one time
I find these repetition meta characters hard to remeber and I often forget the difference between them. I found a nice stackoverflow answer which explains it as following:
"
In RegEx, {i,f}
means "between i to f matches". Let's take a look at the following examples:
{3,7}
means between 3 to 7 matches{,10}
means up to 10 matches with no lower limit (i.e. the low limit is 0){3,}
means at least 3 matches with no upper limit (i.e. the high limit is infinity){,}
means no upper limit or lower limit for the number of matches (i.e. the lower limit is 0 and the upper limit is infinity){5}
means exactly 4
Most good languages contain abbreviations, so does RegEx:
+
is the shorthand for{1,}
*
is the shorthand for{,}
?
is the shorthand for{,1}
This means +
requires at least 1 match while *
accepts any number of matches or no matches at all and ?
accepts no more than 1 match or zero matches.