Locality-Sensitive Hashing
Also known as: lsh, approximate near duplicate, locality sensitive hashing, lsh index
Locality-Sensitive Hashing (LSH) is a technique that hashes similar items into the same buckets with high probability, so near-duplicate search scales without comparing every pair. It turns an O(n^2) all-pairs problem into a fast bucket lookup over huge collections.
- LSH hashes similar items into the same buckets, replacing O(n^2) all-pairs comparison with fast candidate lookup.
- Hash families are chosen per similarity metric — e.g. MinHash for Jaccard, bit-sampling for Hamming distance.
- Band/bit tuning trades recall vs. precision; a final exact check confirms candidate matches.
The problem LSH solves
Finding near-duplicates by comparing every item to every other item is O(n^2) — for a library of tens of thousands of photos that is hundreds of millions of comparisons. LSH avoids this by grouping likely-similar items together up front, so each item only has to be compared against a small candidate set.
Unlike a cryptographic hash, where one changed bit scrambles the entire output, an LSH family is designed so that similar inputs collide (land in the same bucket) far more often than dissimilar ones. The exact family depends on the similarity measure — for example, MinHash for Jaccard set similarity, or random-projection / bit-sampling hashes for Hamming distance between perceptual hashes.
How it works in practice
A common approach for binary fingerprints (like a 64-bit perceptual hash) is bit sampling: split the hash into several bands and index each band separately. Two images that share any band end up in the same bucket and become candidates, so even images with a few differing bits are still found.
Tuning the number of bands and bits per band trades recall against precision: more, smaller bands catch more true near-duplicates but generate more candidates to verify. After bucketing, a final exact check (such as Hamming distance on the full hash) confirms which candidates are real matches.
This is what lets a duplicate-photo scanner stay responsive on a large camera roll: LSH narrows millions of potential comparisons down to a handful of likely matches per image before any expensive verification runs.