Regular Expressions Head start guide

Regular Expressions are powerful demons which can be very useful for us once we understand how they work. Regular expressions are used to extract data as per some pattern or behavior.

By pattern I mean like an

email address: [start with text][number/dot/underscore][@ symbol][text][dot][text]
mobile number: [10 digit Numbers]

or anything having some specific behavior/ pattern. Regular Expressions are commonly known as regex or regexp so if you see regex, regexp or regular expression don’t get confused as they mean the same.

What am i going to get from this article..??

This article will give us a head start to create your own regex and also to understand what we are doing instead for just guessing and banging head on walls after every unexpected output.

Quick note for non- programmers. It’s not necessary that regex are only for developers if you are a computer user you can find them very useful for searching the stuffs you are interested in text editors or anywhere else quickly. – Like Finding Needle in a haystack.

Now lets discuss power of real regex stuff.

Irrespective of your language of choice be it java/ php/ python or anything this guide will help you a lot.**

Simple Regex: Let’s start our journey with most simple regex.

I am in notepad++ with some demo paragraph text. to start searching with regular expressions Hit ctrl + F or command F and select regular Expression radio button. And typing simply **ab will search for first occurrence of ab**. You can give find next for next match and in programming languages we need to call the regex again for next occurrences.

Special meaning Chars:

Backslash => \
Dollar => $
Caret => ^
Period => .
Pipe => |
Question mark => ?
Asterisk => *
Plus => +
Opening and closing parenthesis => ( )
Opening bracket => [
Opening curly brace => {

These characters are used to enhance our search query and are also known as meta-characters. So, if we want to use them in our searches we need to have proper caution to while skipping them ( By backslash ‘\’ ) as they need not to be detected by regex engine. For example if we need to search foo*bar we need to search for foo*bar as asterisk alone has special meaning. Hence we are skipping it via backslash.

Use case 1 : Searching some common spelled words.

Searching for b[ue]tter will find all butter as well as better. But it will not find buetter or beutter or any other combination.

Use case 2: Using hyphen inside brackets to specify the range.

example: To specify a single digit number we can say [0-9]. This will find any single digit number from 0-9 and if you want to search a mobile number 9876543210 which is 10 you can specify the repetition for 10 times is curly braces digits you can put regex like [0-9]{10} This will search all 10 digit numbers. So now are you getting this whole jargons related to regex?? Great lets proceed further.

Question: how will you search a two digit hexadecimal number..??? Answer: [0-9a-fA-F]{2}

Use case 3: To search anything by their position using anchors.

^is use to find match at start of string.
$is used to find match at end of string.
\bis used to find match at both start and end of string.
\Bis used to find matches were \b couldn’t.

example:

Searching for ^tweak will search from beginning of text.
Searching for tweak$ will search for text which ends with tweak.
Searching for \btweak will search both start and end of text.
Searching for \Btweak will search matches were \b couldn’t.

Some shorthand characters

\s equivalent to [ \t\r\n\f] i.e it matches a space, a tab, a line break, or a form feed.
\d equivalent to [0-9] i.e matches single digit number.
\w equivalent to [a-zA-Z0-9_] i.e matches all single digit word or number or underscore.

and we can mix them too. So [\da-fA-F] means any single digit hexadecimal number. These shorthand’s also have their negated versions.

\S is eqv to ^\s
\D is eqv to ^\d
\W is eqv to ^\w

We can also search for some special characters like:

\r for carriage return
\n for line feed
\t to match a tab character

Use case 4: To search anything by wildcard dot `.`.

The Dot Character acts like a wildcard which can match anything.

example: b.tter matches butter, bitter, better etc.

Use case 5: Search Using the ‘OR’ Check.

We can check if a string matches set of values from a text using pipe symbol.

search for single letters:
using let|us|tweak will select let in ‘let us tweak’ if the search is applied again it’ll search us and the tweak.

search for group of words:

searching for group of text we can use parenthesis like (peanut | normal) butter to find occurrences of both normal butter and peanut butterexample: Bat(man|sman|tle) will search all Batman, Batsman and Battle one by one.

Use case 6: Search Using optional characters and Repetition

? tells the token preceding is optional
* tells that the token preceding can be repeated one or more times.

example 1: <[A-Za-z][A-Za-z0-9]*> matches all HTML tags like <H1> etc.

we know you must be thinking why not <[A-Za-z0-9]*> right..? because it matches invalid tags like 123

example 2: searching for flavou?r matches flavour and flavor both

we can make use of curly braces to specify the amount of repetition like \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999

While <.+> can be used to search <input type=”text” id=”text_field”></input> HTML tags.But its recommended to avoid dot’s whenever possible from your search so in this case we can use <^<>> to find HTML tags which will save few CPU cycles for searches performed in bulk data.

Miscellaneous

Suppose if you have to search any two digit character starting with pe but not ending with ‘t‘ You can type pe^t then search results will not contain ~~pet~~ but will have pen and pepper** .

credits : regular-expressions.info