Skip to content
shellmap

Find duplicate files by content

Identify files with identical contents across a directory tree (regardless of name) — for cleanup, deduplication, or media library audits.

How to find duplicate files by content in each shell

Bashunix
find . -type f -exec md5sum {} \; | sort | uniq -d -w32

md5sum prints `<hash> <filename>` per line. `sort` orders by hash. `uniq -d -w32` keeps only lines where the first 32 chars (the MD5 hash) are duplicated. For a faster sweep on huge trees, prefilter by size FIRST: `find . -type f -printf "%s %p\n" | sort -n | awk '{ if ($1==prev) print; prev=$1 }'` shows same-size files; only hash those. macOS doesn't have `md5sum` — it has `md5 -r` instead (or install GNU coreutils via `brew install coreutils`).

Zshunix
find . -type f -exec md5sum {} \; | sort | uniq -d -w32

Same external. Two robust tools worth installing for repeated dedup work: `fdupes` (cross-platform, fast, has interactive deletion mode) and `rmlint` (also handles symlink-replacement, hardlink-replacement, and JSON reporting). `brew install fdupes` on macOS, `apt install fdupes` on Debian. Once installed: `fdupes -r .` is the one-liner.

Fishunix
find . -type f -exec md5sum {} \; | sort | uniq -d -w32

Same external. Capture with fish syntax: `set dupes (find . -type f -exec md5sum {} \; | sort | uniq -d -w32)`. For interactive deletion, `fdupes -dN` (delete duplicates, keep first, no prompt) — the `-N` skips confirmation; pair with a dry-run `-r` first.

PowerShellwindows
Get-ChildItem -Recurse -File | Group-Object { (Get-FileHash $_.FullName -Algorithm MD5).Hash } | Where-Object Count -gt 1

`Group-Object` clusters files by their hash; `Where-Object Count -gt 1` keeps only groups with duplicates. For huge trees, the size-prefilter optimisation is critical (computing MD5 of every file is I/O-bound): `Get-ChildItem -Recurse -File | Group-Object Length | Where-Object Count -gt 1 | ForEach-Object { $_.Group | Group-Object { (Get-FileHash $_.FullName -Algorithm MD5).Hash } | Where-Object Count -gt 1 }`. For cryptographic-strength hashing (defensive against deliberate collisions), use `-Algorithm SHA256` — slightly slower but no false-duplicate risk.

cmd.exewindows
for /r %f in (*) do @certutil -hashfile "%f" MD5 | findstr /v "hash" | findstr /v "CertUtil"

cmd has no native group-by. `certutil -hashfile` is the built-in hash tool (slow per file due to process spawn overhead). For real dedup work on Windows from cmd, shell out to pwsh: `powershell -NoProfile -Command "Get-ChildItem -Recurse -File | Group-Object { (Get-FileHash $_.FullName).Hash } | Where-Object Count -gt 1"`. Or install `fdupes` via Scoop / Chocolatey.

Equivalents listed for Bash, Zsh, Fish, PowerShell, cmd.exe.

Gotchas & notes

  • MD5 is FINE for finding accidental duplicates (no adversary deliberately crafting collisions). It is NOT cryptographically secure since the 2004 collision attacks. For "is this file the same as that file" in a trusted context, MD5 is faster (~500MB/s) than SHA-256 (~200MB/s) and the false-positive rate from random collisions is astronomically low. For dedup of untrusted input (uploaded files where someone might craft collisions), use SHA-256 or BLAKE3.
  • The size-prefilter optimisation is the difference between O(seconds) and O(hours) on large trees. Two files of different sizes are GUARANTEED non-duplicate (different content → different size). Group by size first, hash only within size-groups of 2+. Most duplicate-finder tools (`fdupes`, `rmlint`, `dupeguru`) do this automatically.
  • Hardlinks vs duplicates — `find . -type f` counts hardlinked entries as separate files (same inode, same content). If your dedup tool finds them as "duplicates" and offers to delete, deleting one just removes a name; the data persists (the other hardlink still references the same inode). `find` with `-not -links 1` skips files with >1 hardlink. On Windows, hardlinks are rare in practice but exist (`fsutil hardlink list <file>`).
  • For media files specifically (photos, videos, music), HASH-based dedup catches byte-identical duplicates only — re-encoded or recompressed versions of the same image won't match. For perceptual deduplication, install `dupeguru` (GUI, cross-platform) or `findimagedupes` (CLI, Linux/macOS) — they compute perceptual hashes (resilient to size/quality changes) rather than cryptographic hashes.

Related commands

Related tasks