Regular Expression

I started learning about regular expressions. They are nice tools to have in a programmers toolbox. Although I still use google to search for regular expressions but it is nice to have some understand of them.

Here are some notes which I have written to I do not have to google every time.


Literal Character

  1. /cat/ matches both cat and communicated (notice the cat at the end in communiCATed)
    1. Literal matches are case sensitive use i global modifier to make them insensitive
    2. Only first occurrence is returned. To return all the occurrences uses g global modifier
    3. /car/ will not match c a r spaces are not ignored

Metacharacter Character

\ . * + - { } [ ] ^ $ | ? ( ) : ! =

  1. Character with special meaning and its meaning may differ with the context in which it is used Wild Card Meta Character
  2. .
  3. can be used to match anything except newline
  4. /.a.a.a/ matches both papaya and banana

Escaping Meta Characters

  1. Use the \ meta character tells regex engine to escape the character next to it.
    1. /9\.00/ matches 9.00 not 9500 or 9-00.
  2. To match space leave space in regex
  3. To match tab use \t (notice the we are using the \ to escape the t.) \r and \n

Character Sets

  1. []
    1. [ opening a character set
    2. ] closing a character set
    3. Only matches a single character
    4. Order of characters in character set does not matter e.g [aeiou] is same is [ueioa] etc etc
    5. /gr[ea]y/ matches both grey and gray
  2. Character Ranges
    1. - to indicate range of characters
    2. - is only meta character inside a character set; a liter dash otherwise
    3. [0-9] to indicate digits 0 to 9
    4. [a-zA-Z] all alphabets
  3. Remember its characters range not number range

Negative Character Sets

  1. ^ Think of it as a !(not) of character set.
  2. [^aeiou] matches any one consonant
  3. /see[^mn]/ matches seek and sees but does not seem or seen
    1. it also won’t match `see` because there is nothing after the last `e`
    but it will still match see.

Meta Characters inside Character Set

  1. Meta Character inside character sets are already escaped
  2. /h[a.]t matches hat and h.t but not hot
    1. exceptions ] - ^ \
      1. ] means that we are closing character set
      2. - means range
      3. ^ means negative character set
      4. \ means escaping
  3. Shorthand Character Set
    1. \d for digit for [0-9]
    2. \w for word character [a-zA-Z0-9_]
      1. _ is a word character but hypen is not
    3. \s for white space [\t\r\n]
    4. \D for not digit [^0-9]
    5. \W for not a word character [^a-zA-Z0-9_]
    6. \S for not a space [^\t\n\r]
    7. [^\d\s] not a digit or a space

Repetitions Metacharacters

  • * Preceding Item, zero or more times
  • + Preceding Item, one or more times
  • ? Preceding Item, zero or one time

I find these repetition meta characters hard to remeber and I often forget the difference between them. I found a nice stackoverflow answer which explains it as following:

" In RegEx, {i,f} means "between i to f matches". Let's take a look at the following examples:

  • {3,7} means between 3 to 7 matches
  • {,10} means up to 10 matches with no lower limit (i.e. the low limit is 0)
  • {3,} means at least 3 matches with no upper limit (i.e. the high limit is infinity)
  • {,} means no upper limit or lower limit for the number of matches (i.e. the lower limit is 0 and the upper limit is infinity)
  • {5} means exactly 4

Most good languages contain abbreviations, so does RegEx:

  • + is the shorthand for {1,}
  • * is the shorthand for {,}
  • ? is the shorthand for {,1}

This means + requires at least 1 match while * accepts any number of matches or no matches at all and ? accepts no more than 1 match or zero matches.