How To Remove Duplicate Lines
📖 Bu rehber ToolPazar ekibi tarafından hazırlanmıştır. Tüm araçlarımız ücretsiz ve reklamsızdır.
Exact vs normalized dedup
Dedup looks like the simplest text operation in the world: remove lines that appear more than once. In reality “duplicate” is a spectrum. Is leading whitespace significant? Does case matter? Should the first occurrence win, or the last? Do trailing spaces make two lines different or the same? And what about a file that’s 10 GB and won’t fit in memory? The right answer depends entirely on what you’re cleaning — email lists, log files, source code, shopping lists — and picking the wrong one can silently discard data you needed. This guide walks through every dedup decision and the patterns that handle each.
Case-insensitive dedup
Exact dedup compares bytes. Normalized dedup compares after a transformation — lowercase, trim, collapse whitespace, etc. Real-world lists almost always need some normalization, because real-world sources have inconsistent formatting.
Trimmed comparison
Common for emails, usernames, domains. Build a key by lowercasing, keep the original for output:
Preserve-first vs preserve-last
Leading and trailing whitespace silently differentiates identical content. Trim for the comparison, keep whichever version you prefer for output:
Unique vs all-duplicates
For really aggressive matching, also collapse internal whitespace:
Unix: sort | uniq
When two lines match, which copy do you keep? Default is preserve-first: walk the list, skip anything you’ve seen. Preserve-last requires a second pass:
Preserving order with awk
Preserve-first is right for logs (earliest record matters). Preserve-last is right for change feeds (last state wins).
Hash-based keys for long lines
Three possible outputs for a deduplication job:
Dedup with count column
SHA-1 collisions on human text are vanishingly rare. For adversarial input, use SHA-256.
CSV dedup by key column
For tabular data, “duplicate” usually means “same value in the key column,” not full-row match. Use a CSV-aware tool: