Reference

Entropy Coding

Entropy coding is the final, lossless stage of compression that assigns shorter bit patterns to common symbols and longer ones to rare symbols, squeezing out statistical redundancy. Huffman and arithmetic coding are the classic methods used in JPEG, PNG, ZIP, and video codecs.

Photos & videoGeneral

Entropy Coding

Also known as: huffman arithmetic coding, entropy coder, entropy coding

Entropy coding is the final, lossless stage of compression that assigns shorter bit patterns to common symbols and longer ones to rare symbols, squeezing out statistical redundancy. Huffman and arithmetic coding are the classic methods used in JPEG, PNG, ZIP, and video codecs.

  • Final lossless stage of compression: short codes for common symbols, long for rare ones.
  • Huffman is used in JPEG/PNG/ZIP; arithmetic coding (CABAC) is used in modern video codecs.
  • Re-zipping already-compressed media gains little because it is near its entropy limit.

What entropy coding does

Compression usually happens in two phases. Earlier stages transform and reduce data (for example a frequency transform and quantization in JPEG, or motion compensation in video). Entropy coding is the last step: it takes the resulting stream of symbols and packs it into the fewest bits without losing any information, so it is lossless even inside an otherwise lossy format.

The core idea, rooted in Claude Shannon's information theory, is that frequent symbols should cost fewer bits than rare ones. A symbol that appears constantly might be coded in two bits while a rare one takes ten, so the average bit length drops toward the data's true entropy, its theoretical information content.

Huffman, arithmetic, and modern coders

Huffman coding builds a tree that gives each symbol a whole number of bits; it is fast and used in JPEG, PNG (via DEFLATE), ZIP, and gzip. Arithmetic coding and its relative range coding represent an entire message as a single fractional number and can use non-integer bit lengths, so they compress slightly better at higher CPU cost.

Video codecs lean on advanced entropy coders: H.264 offers CABAC (Context-Adaptive Binary Arithmetic Coding), and HEVC and AV1 use context-adaptive arithmetic and range coders. Newer general-purpose formats use ANS (Asymmetric Numeral Systems), as in Zstandard and AVIF, blending Huffman-like speed with arithmetic-like efficiency.

Why it matters for storage

Entropy coding is one reason a JPEG, HEIC, MP4, or ZIP file is smaller than its raw contents: after the lossy transforms have done their work, the entropy coder removes the remaining statistical slack. It cannot, however, recover already-discarded detail, so re-zipping an already-compressed photo or video gains almost nothing.

This is why Cleanor focuses on finding duplicate and similar files and re-encoding huge videos to efficient codecs rather than promising to shrink everything: well-compressed media is already near its entropy limit, and the real space wins come from removing redundant copies, not from compressing them again.

Related terms

Keep reading the reference.

Act on it

Guides and tools for this topic.