Regex Basics

Updated:

Regular expressions are one of the highest-leverage tools any developer or analyst can learn. A 30-character regex can replace dozens of lines of imperative code, validate user input, find patterns in logs, or rewrite filenames in bulk. They are also famously cryptic. This guide builds up the practical pieces — literals, classes, quantifiers, anchors, groups, and flags — so that you can read most regexes you encounter and write the ones you need without consulting reference cards for every problem.

What a regex actually is

A regular expression is a pattern that describes a set of strings. When you run a regex against an input, the engine asks: 'is there a substring here that matches this pattern, and if so, where?' That single question powers searching, validation, splitting, and substitution. The pattern itself is just a small language with literals, operators, and grouping — once you have the building blocks, the rest is composition.

Literals and escapes

Most characters in a regex match themselves: 'cat' matches the substring 'cat' anywhere in the input. A handful of characters have special meaning and must be escaped with a backslash to match literally: . * + ? ( ) [ ] { } | ^ $ \. So '3\.14' matches the literal string '3.14', while '3.14' would match '3' followed by any character followed by '14' (the dot is a metacharacter).

Character classes

  • [abc] — any one of a, b, or c.
  • [a-z] — any lowercase letter (range shortcut).
  • [^0-9] — any character that is NOT a digit (negation with ^ at the start).
  • \d — any digit, equivalent to [0-9].
  • \w — word character: letters, digits, underscore.
  • \s — whitespace: spaces, tabs, newlines.
  • . — any character except a newline by default (use the 's' flag to include newlines).

Quantifiers

  • * — zero or more of the previous element.
  • + — one or more of the previous element.
  • ? — zero or one of the previous element (makes it optional).
  • {3} — exactly three.
  • {3,} — three or more.
  • {3,5} — between three and five.
  • By default, quantifiers are 'greedy' — they match as much as possible. Add ? after the quantifier to make it 'lazy' (.*? matches as little as possible).

Anchors

Anchors do not match characters; they match positions. '^' matches the start of the string (or line, with the 'm' flag), '$' matches the end of the string (or line). '\b' matches a word boundary — the position between a word character and a non-word character. So '^hello' matches only when 'hello' is at the start, and '\bcat\b' matches 'cat' as a whole word, not the 'cat' inside 'category' or 'concatenate'.

Groups and alternation

Parentheses '(...)' group part of a pattern so that quantifiers and alternation apply to the whole group, and they also capture the matched text for use later. '(cat|dog)' matches either 'cat' or 'dog'. 'gr(a|e)y' matches 'gray' or 'grey'. Non-capturing groups use '(?:...)' when you want grouping without capturing — useful for performance and to keep capture indices clean.

Flags

  • i — case-insensitive: /hello/i matches 'Hello', 'HELLO', etc.
  • g — global: find all matches, not just the first.
  • m — multiline: ^ and $ match start/end of each line, not just the whole input.
  • s — dotall: . matches newlines too.
  • u — Unicode: enables full Unicode handling (essential for non-ASCII input).

Common practical patterns

  • Trim whitespace at start and end: ^\s+|\s+$
  • Match an integer (possibly negative): -?\d+
  • Match a decimal number: -?\d+(?:\.\d+)?
  • Loose email check (use a proper library for serious validation): \S+@\S+\.\S+
  • URL: https?://\S+
  • Hex color: #(?:[0-9a-f]{3}|[0-9a-f]{6})\b
  • ISO 8601 date: \d{4}-\d{2}-\d{2}

Things regex is bad at

Regex is not the right tool for parsing nested structures like HTML, XML, or programming languages — those have rules a flat pattern cannot express. Validating an email address strictly against RFC 5322 requires a parser, not a regex. For URL parsing, file path manipulation, or JSON, use your language's dedicated library. Use regex for matching patterns in text where the grammar is regular.

Practice and debug iteratively

Even experienced engineers write regex incrementally. Start with the most distinctive part of the pattern, test against examples that should match and examples that should not, then add boundary conditions and edge cases. A regex tester (like ours, which runs entirely in your browser) gives instant feedback on what is matching and what is being captured — far more efficient than writing the regex and then running your whole program.

FAQ

Why does my regex match more than I expect?
Almost always because of greedy quantifiers. '.*' matches as much as possible. If you want the shortest match, use '.*?'. Anchors and word boundaries also help narrow matches.
How do I match a literal dot?
Escape it with a backslash: '\.'. Inside a character class, '[.]' also works because most metacharacters lose their special meaning inside [ ].
Do different languages have different regex syntax?
Yes, slightly. JavaScript, Python (re), PCRE, .NET, and POSIX all share most features but differ on advanced ones like lookbehind, named groups, and Unicode property classes. The basics in this guide work everywhere.
Can I use regex to validate any input?
Use regex for shape checks (does this look like a phone number, a date, a hex color?) — but not for semantic validation. A regex can tell you the date is shaped like 2026-13-45; a date parser will tell you it is not a real date.

In summary

Master the building blocks — literals, classes, quantifiers, anchors, groups, flags — and you can read and write the regexes you need most days. Reach for our Regex Tester to iterate on patterns in your browser, with flag toggles and capture-group visualization, before pasting them into your code.

Ähnliche Tools