What a regex actually is
A regular expression is a pattern that describes a set of strings. When you run a regex against an input, the engine asks: 'is there a substring here that matches this pattern, and if so, where?' That single question powers searching, validation, splitting, and substitution. The pattern itself is just a small language with literals, operators, and grouping — once you have the building blocks, the rest is composition.
Literals and escapes
Most characters in a regex match themselves: 'cat' matches the substring 'cat' anywhere in the input. A handful of characters have special meaning and must be escaped with a backslash to match literally: . * + ? ( ) [ ] { } | ^ $ \. So '3\.14' matches the literal string '3.14', while '3.14' would match '3' followed by any character followed by '14' (the dot is a metacharacter).
Character classes
- [abc] — any one of a, b, or c.
- [a-z] — any lowercase letter (range shortcut).
- [^0-9] — any character that is NOT a digit (negation with ^ at the start).
- \d — any digit, equivalent to [0-9].
- \w — word character: letters, digits, underscore.
- \s — whitespace: spaces, tabs, newlines.
- . — any character except a newline by default (use the 's' flag to include newlines).
Quantifiers
- * — zero or more of the previous element.
- + — one or more of the previous element.
- ? — zero or one of the previous element (makes it optional).
- {3} — exactly three.
- {3,} — three or more.
- {3,5} — between three and five.
- By default, quantifiers are 'greedy' — they match as much as possible. Add ? after the quantifier to make it 'lazy' (.*? matches as little as possible).
Anchors
Anchors do not match characters; they match positions. '^' matches the start of the string (or line, with the 'm' flag), '$' matches the end of the string (or line). '\b' matches a word boundary — the position between a word character and a non-word character. So '^hello' matches only when 'hello' is at the start, and '\bcat\b' matches 'cat' as a whole word, not the 'cat' inside 'category' or 'concatenate'.
Groups and alternation
Parentheses '(...)' group part of a pattern so that quantifiers and alternation apply to the whole group, and they also capture the matched text for use later. '(cat|dog)' matches either 'cat' or 'dog'. 'gr(a|e)y' matches 'gray' or 'grey'. Non-capturing groups use '(?:...)' when you want grouping without capturing — useful for performance and to keep capture indices clean.
Flags
- i — case-insensitive: /hello/i matches 'Hello', 'HELLO', etc.
- g — global: find all matches, not just the first.
- m — multiline: ^ and $ match start/end of each line, not just the whole input.
- s — dotall: . matches newlines too.
- u — Unicode: enables full Unicode handling (essential for non-ASCII input).
Common practical patterns
- Trim whitespace at start and end: ^\s+|\s+$
- Match an integer (possibly negative): -?\d+
- Match a decimal number: -?\d+(?:\.\d+)?
- Loose email check (use a proper library for serious validation): \S+@\S+\.\S+
- URL: https?://\S+
- Hex color: #(?:[0-9a-f]{3}|[0-9a-f]{6})\b
- ISO 8601 date: \d{4}-\d{2}-\d{2}
Things regex is bad at
Regex is not the right tool for parsing nested structures like HTML, XML, or programming languages — those have rules a flat pattern cannot express. Validating an email address strictly against RFC 5322 requires a parser, not a regex. For URL parsing, file path manipulation, or JSON, use your language's dedicated library. Use regex for matching patterns in text where the grammar is regular.
Practice and debug iteratively
Even experienced engineers write regex incrementally. Start with the most distinctive part of the pattern, test against examples that should match and examples that should not, then add boundary conditions and edge cases. A regex tester (like ours, which runs entirely in your browser) gives instant feedback on what is matching and what is being captured — far more efficient than writing the regex and then running your whole program.