Huffman Coding
Huffman Coding is a Greedy algorithm used for lossless data compression.
It assigns variable-length binary codes to input characters based on their frequencies — characters that occur more frequently get shorter codes, and less frequent ones get longer codes.
It minimizes the total number of bits used to represent the original data without losing any information.
Key Properties:
- Uses a binary tree to assign codes.
- Ensures prefix-free codes (no code is a prefix of another).
- Based on Greedy strategy: combines the two least frequent symbols at each step.
How Huffman Coding Works:
- Count the frequency of each character.
- Create a leaf node for each character and insert it into a min-heap (priority queue).
- While the heap has more than one node:
- Remove the two nodes with the lowest frequency.
- Combine them into a new node with frequency = sum of both.
- Insert the new node back into the heap.
- The remaining node is the root of the Huffman tree.
- Traverse the tree to assign binary codes:
- Left → 0
- Right → 1
Example:
Input characters and frequencies:
|
Character |
Frequency |
|
A |
5 |
|
B |
9 |
|
C |
12 |
|
D |
13 |
|
E |
16 |
|
F |
45 |
Possible Huffman Codes:
|
Character |
Code |
|
F |
0 |
|
C |
100 |
|
D |
101 |
|
A |
1100 |
|
B |
1101 |
|
E |
111 |
Uses fewer bits than fixed-length coding.
Huffman Coding Python code:
import heapq
def huffman(freq):
heap = [[wt, [sym, “”]] for sym, wt in freq.items()]
heapq.heapify(heap)
while len(heap) > 1:
lo = heapq.heappop(heap)
hi = heapq.heappop(heap)
for pair in lo[1:]: pair[1] = ‘0’ + pair[1]
for pair in hi[1:]: pair[1] = ‘1’ + pair[1]
heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
return sorted(heap[0][1:], key=lambda p: (len(p[-1]), p))
Time Complexity:
- O (n log n) where n is the number of unique characters
- Due to heap operations and tree building
Advantages:
- Produces optimal prefix codes.
- Efficient for text compression (used in ZIP, PNG, MP3 formats).