1. Concept
Data compression is types of compress not lossless data . It based frequency of characters that appear to building the binary code for those characters so that the number of bits is minimal.
The algorithm was proposed by David A. Huffman, and published in 1952.
To encode these characters , we subsitute characters in binary code such as 256-characters ASCII encoded . With this encoding , we have to use 8 bits to performance a characters ., which is very wasting memory . In addition , when encode data may not need 256 characters and in string can have characters that appear again 2 times ,…
In the above sequence , we analyze the following : s (8), e (7), space (5), h (4), l (4), a (2), b (1), y (1) , O (1), r (1), dot (1).
We realize that the ‘s’ appear up to 8 times , but ‘r’ just only appear once . If If we use 8 fixed bits to perform , we are tired … . Therefore , we can express by the number of “more flexible” bits , which characters appear more , use less bits , and vice versa .
However , one problem that arises is that , if we encode with ‘ flex bits ‘ how can we distinguish which bit sequence represents a character ? We can use commas to separate bits , or to use symbols . So , the comms will occupy some memory in the array . We have the concept of prefix code
2. Prefix-free binary code
Prefix -free binary code is the assemble symbols so that each symbol is not a prefix of the other symbols in the set .
EX : Suppose we encode the word ” ARRAY ”
We see in this array have ‘A’ , ‘R’ , ‘Y’ .
- If encoded by a fixed-length bit , we must use least 2 bits for 1 character (‘A’ => 00 , ‘R’ => 01 , ‘Y’ => 11) . Then the bit would be 0001010010 . For encoding , we only need 2 bits/times and compare it with the encoding .
- If encoding ‘A’ => 0 , ‘R’ => 01 , ‘Y’ => 11 , this code is not a prefix ( because ‘R’ => 01 can be interpreted as ‘A’ and bit 1 ) . To encode , we need to put the commas tin the middle (0, 01, 01, 0, 11).
- If encoding ‘A’ => 0, ‘R’ => 10, ‘Y’ => 11, this code is the prefix code. At that time, the bit sequence would be 01010011.
We descibe tree prefix code :
For each character , we can represent a sequence of bits : When going from root to the leaf containing that character . if pass on the left side , we add a zero on this sequence of bits , if pass the right side add the number 1 . ( ‘A’ => 0 ; ‘R’ => 10 , ‘Y’ => 11 )
We can see that , with nomal expression , we take about 40 bits ( 5 bytes ) to represent the “ARRAY ” . However , using compress the file algorithm , we need 10 bits ( 0001010010 ) , saving 75% .That amazing !!
3. Static Huffman Algorithm
Example : We have the array : s = “AAAAAABCCCCCCDDEEEEE“
Frequency :A(6), B(1), C(6), D(2), E(5) .
Nomal expression : sizeof (s) = (6 + 1 + 6 + 2 + 5)*8 = 160 bits . However , if we apply prefix code above , we will obtain ‘A‘ => 00, ‘C‘ => 01, ‘E‘ => 10, ‘B‘ => 110, ‘D‘ => 111 . So , this way we have sizeof (s) = 6 * 2 + 1 * 3 + 6 * 2 + 2 * 3 + 5 * 2 = 43 bits.
4. Code Huffman Algorithm
Algorithm to create a Huffman tree:
[B1] Sort the frequency table of characters.
[B2] Select the first two elements to create the node x, y weighted by frequency. Creating node z is the parent of node x, y weighting the sum of the weights of the two children.
[B3] Remove the two nodes x, y from the table, then add the node z so that the frequency table stays up / down.
[B4] Repeat B1-3 so that only one element in the table is left.
We need 2 struct :
struct
Node
{
char
c;
int
freq;
Node* left;
Node* right;
Node()
{
c =
'\0'
;
freq = -1;
left = NULL;
right = NULL;
}
};
struct
Bit
{
char
c;
string bit;
};
class
HuffmanCompression
{
private
:
string data;
Node* root;
vector bit;
string bitTree;
void
convertTree();
void
visit(Node* curr, string bit);
void
generateTree(Node* curr);
public
:
HuffmanCompression() { bitTree =
""
; }
HuffmanCompression(string filePath);
void
compression(string outputPath);
};
void
HuffmanCompression::convertTree()
{
vector<Node*> tree;
for
(
int
i = 0; i < data.length(); i++)
{
bool
existed =
false
;
for
(
int
j = 0; j < tree.size(); j++)
{
if
(tree[j]->c == data[i])
{
tree[j]->freq++;
existed =
true
;
break
;
}
}
if
(!existed)
{
Node* node =
new
Node();
node->c = data[i];
node->freq = 1;
tree.push_back(node);
}
}
for
(
int
i = 0; i < tree.size() - 1; i++)
{
for
(
int
j = i + 1; j < tree.size(); j++)
{
if
(tree[i]->freq > tree[j]->freq)
{
Node* temp = tree[i];
tree[i] = tree[j];
tree[j] = temp;
}
}
}
while
(
true
)
{
Node* tmp =
new
Node();
tmp->left = tree[0];
tmp->right = tree[1];
tmp->freq = tmp->left->freq + tmp->right->freq;
tree.erase(tree.begin(), tree.begin() + 2);
tree.resize(tree.size() + 1);
if
(tree.size() == 1)
{
tree[0] = tmp;
break
;
}
else
{
for
(
int
j = 0; j < tree.size() - 1; j++)
{
bool
isMax =
true
;
if
(tree[j]->freq > tmp->freq)
{
for
(
int
k = tree.size() - 1; k > j; k--)
{
tree[k] = tree[k - 1];
}
tree[j] = tmp;
isMax =
false
;
break
;
}
if
(isMax) tree[tree.size() - 1] = tmp;
}
}
}
root = tree[0];
}
4. Extract
We need to build the class to extract:
class
HuffmanExtraction
{
private
:
string bitTree;
string data;
Node* root;
void
generateTree(Node* curr);
char
visit(Node* curr);
public
:
HuffmanExtraction() { root =
new
Node(); }
HuffmanExtraction(string filePath);
void
extraction(string outputPath);
};
HuffmanExtraction::HuffmanExtraction(string filePath)
{
root =
new
Node();
string bitSequence =
""
;
ifstream fi(filePath, ios::binary);
char
c;
while
(fi >> noskipws >> c)
{
bitset bit(c);
bitSequence += bit.to_string();
}
char
numChar = convertBitToChar(bitSequence.substr(0, 8));
bitSequence.erase(0, 8);
char
addBit = convertBitToChar(bitSequence.substr(0, 8));
bitSequence.erase(0, 8);
bitSequence.erase(0, addBit);
int
sizeBit = 10 * numChar - 1;
bitTree = bitSequence.substr(0, sizeBit);
bitSequence.erase(0, sizeBit);
data = bitSequence;
fi.close();
}
Next, we need to rebuild the Huffman tree based on the existing bitTree sequence:
If it is a ‘0’ bit, we construct two left and right child nodes, and call recursively to that child node.
If the ‘1’ bit, we read the next 8 bits and save it to the leaf node.
void
HuffmanExtraction::generateTree(Node* curr)
{
while
(bitTree.length() > 0)
{
if
(curr->left != NULL && curr->right != NULL)
return
;
Node* node =
new
Node();
if
(bitTree[0] ==
'0'
)
{
bitTree.erase(0, 1);
if
(curr->left == NULL)
{
curr->left = node;
generateTree(curr->left);
}
else
{
curr->right = node;
generateTree(curr->right);
}
}
else
{
string temp = bitTree.substr(1, 8);
bitTree.erase(0, 9);
node->c = convertBitToChar(temp);
if
(curr->left == NULL) curr->left = node;
else
curr->right = node;
}
}
}
char
HuffmanExtraction::visit(Node* curr)
{
if
(curr->left == NULL && curr->right == NULL)
{
return
curr->c;
}
if
(data[0] ==
'0'
)
{
data.erase(0, 1);
visit(curr->left);
}
else
{
data.erase(0, 1);
visit(curr->right);
}
}
void
HuffmanExtraction::extraction(string outputPath)
{
this
->generateTree(root);
root = root->left;
string result =
""
;
while
(data.length() > 0) result +=
this
->visit(root);
ofstream fo(outputPath);
fo << result;
fo.close();
}
So we have successfully built the static Huffman compression algorithm. However, we recognize that this algorithm has one disadvantage is to browse twice the file to compress, so it will cost more. We also have to store the Huffman tree in compressed data, thus increasing the size of the archive. In the next section, I will guide on Adaptive Huffman Compression, which will overcome the disadvantages.
The above is an article on Huffman static compression, thank you for watching attention! See you in the next section!