Normalize whitespace in text
Collapse multiple spaces / tabs into single spaces and strip line-leading / trailing whitespace — for diff-friendly text, CSV cleanup, and config-file canonicalization.
How to normalize whitespace in text in each shell
awk '{$1=$1; print}' input.txtThe `$1=$1` is a SIDE EFFECT — assigning to a field forces awk to rebuild the record using OFS (output field separator, default space), collapsing internal whitespace AND stripping leading/trailing. Brevity vs clarity: `sed -E "s/[[:space:]]+/ /g; s/^ //; s/ $//"` is more explicit.
awk '{$1=$1; print}' input.txtawk '{$1=$1; print}' input.txt(Get-Content input.txt) -replace "\s+", " " -replace "^\s+|\s+$", ""pwsh `-replace` is regex by default. `\s` matches `[ \t\n\r\f]` plus Unicode whitespace. Chain replaces: collapse internal first, then trim ends. Single-pass alternative: `.Trim() -replace "\s+", " "` (`.Trim()` is faster than regex anchors).
powershell -NoProfile -Command "(Get-Content input.txt) -replace \"\\s+\", \" \""cmd has no per-line text transform. Shell out to pwsh.
Equivalents listed for Bash, Zsh, Fish, PowerShell, cmd.exe.
Gotchas & notes
- **The `awk \'{$1=$1; print}\'` idiom** is one of the most underappreciated awk tricks. The mechanism: when awk reads a line, it splits it into `$1, $2, …` by FS (input field separator). The reconstructed line `$0` is the original. But ANY assignment to a field (`$1=$1`, `$2=$2`, etc.) marks the record dirty, and awk rebuilds `$0` using OFS (output field separator, default single space) — collapsing runs of whitespace into single OFS, AND dropping leading/trailing. Two characters of awk replace a sed pipeline. The trade-off: it COLLAPSES INTERNAL whitespace, not just trims — `"foo bar"` becomes `"foo bar"` (lost the gap). If you want to preserve internal: use sed `s/^[[:space:]]*//;s/[[:space:]]*$//`.
- **Tabs vs spaces**: many "normalize whitespace" tasks really mean "convert tabs to spaces (or vice versa)". `expand` converts tabs → spaces (`expand -t 4 file` uses 4-space tabs). `unexpand` does the reverse (`unexpand -a file`). These ship on every Unix. pwsh: `(Get-Content file) -replace "`t", " "`. cmd: shell out. The Unicode `U+00A0` (non-breaking space, NBSP) is NOT matched by `\s` in some sed implementations (POSIX BRE) but IS matched by GNU sed `[[:space:]]` and by pwsh `\s` — NBSP-vs-space mismatches cause "looks identical but diff says different" bugs in copy-pasted text from PDF / Word.
- **Multi-line / paragraph-fill normalization**: collapsing whitespace LINE-BY-LINE is one thing; collapsing across newlines (so wrapped paragraphs become single lines) is different. `tr "\n" " "` joins all lines but loses paragraph boundaries. `fmt -u input.txt` reflows paragraphs to ~75 chars and is the cleanest tool for prose normalization. `par` (better paragraph reflow, package `par`) handles bullets and quote marks better. pwsh: `(Get-Content -Raw input.txt) -replace "(?<!\n)\n(?!\n)", " "` (replace single newlines but not blank-line paragraph breaks — `?<!` and `?!` are lookbehind/lookahead).
- **Zero-width characters** (U+200B zero-width space, U+200C zero-width non-joiner, U+200D zero-width joiner, U+FEFF byte-order mark) are NOT matched by `\s` but DO affect comparison and length. They\'re common in copy-pasted text from web pages, Word docs, and Slack. To strip aggressively: `sed "s/[\xe2\x80\x8b\xe2\x80\x8c\xe2\x80\x8d\xef\xbb\xbf]//g"` (UTF-8-encoded byte sequences). pwsh handles Unicode codepoints natively: `$s -replace "[\u200B-\u200D\uFEFF]", ""`. For "why does my diff show identical lines as different", grep for these first — they\'re invisible to the eye.
Related commands
Related tasks
- Strip ANSI color codes from output— Remove ESC[…m terminal color sequences from text — for clean log archives, machine-parseable output, and pasting to non-color-aware destinations.
- Trim leading and trailing whitespace— Remove only the whitespace at the start and end of each line (preserving internal spaces) — for cleaning user input, config-file values, and form fields.
- Dedupe lines while preserving order— Remove duplicate lines from input but KEEP the first occurrence in its original position — for unique-but-sorted-by-recency lists, `$PATH` cleanup, and history dedup.