algorithm-logo

1. Concept 

Data compression is types of compress not lossless data . It based frequency of characters that appear to building the binary code for those characters so that the number of bits is minimal.

The algorithm was proposed by David A. Huffman, and published in 1952.

David A. Huffman

To encode these characters , we subsitute characters in binary code such as 256-characters ASCII encoded . With this encoding , we have to use 8 bits to performance a characters ., which is very wasting memory  . In addition , when encode data may not need 256 characters and in string can have characters that appear again 2 times ,…

Cùng nhau phân tích thử chuỗi trên nhé!

In the above sequence , we analyze the following : s (8), e (7), space (5), h (4), l (4), a (2), b (1), y (1) , O (1), r (1), dot (1).

We realize that the ‘s’ appear up to 8 times , but ‘r’ just only appear once . If If we use 8 fixed bits to perform , we are tired … . Therefore , we can express by the number of “more flexible” bits , which characters appear more , use less bits , and vice versa .

However , one problem that arises is that , if we encode with ‘ flex bits ‘ how can we distinguish which bit sequence represents a character ? We can use commas to separate bits , or to use symbols . So , the comms will occupy some memory in the array . We have the concept of prefix code

2. Prefix-free binary code

Prefix -free binary code is the assemble symbols so that each symbol is not a prefix of the other symbols in the set .

EX : Suppose we encode the word ” ARRAY ”

We see in this array have ‘A’ , ‘R’ , ‘Y’ .

  1. If encoded by a fixed-length bit , we must use least 2 bits for 1 character (‘A’ => 00 , ‘R’ => 01 , ‘Y’ => 11) . Then the bit would be 0001010010 . For encoding , we only need 2 bits/times and compare it with the encoding .
  2. If encoding ‘A’ => 0 , ‘R’ => 01 , ‘Y’ => 11 , this code is not a prefix ( because ‘R’ => 01 can be interpreted as ‘A’ and bit 1 ) . To encode , we need to put the commas tin the middle (0, 01, 01, 0, 11).
  3. If encoding ‘A’ => 0, ‘R’ => 10, ‘Y’ => 11, this code is the prefix code. At that time, the bit sequence would be 01010011.

We descibe tree prefix code :

Biểu diễn bằng cây nhị phân.

For each character , we can represent a sequence of bits : When going from root to the leaf containing that character . if pass on the left side , we add  a zero on this sequence of bits , if pass the right side add the number 1 . ( ‘A’ => 0 ; ‘R’ => 10 , ‘Y’ => 11 )

We can see that , with nomal expression , we take about 40 bits ( 5 bytes ) to represent the “ARRAY ” . However , using compress the file algorithm , we need 10 bits ( 0001010010 ) , saving 75% .That amazing !!

3. Static Huffman Algorithm

Example : We have the array : s = “AAAAAABCCCCCCDDEEEEE“

Frequency :A(6), B(1), C(6), D(2), E(5) .

Nomal expression : sizeof (s) = (6 + 1 + 6 + 2 + 5)*8 = 160 bits . However , if we apply prefix code above , we will obtain ‘A‘ => 00, ‘C‘ => 01, ‘E‘ => 10, ‘B‘ => 110, ‘D‘ => 111 . So , this way we have sizeof (s) = 6 * 2 + 1 * 3 + 6 * 2 + 2 * 3 + 5 * 2 = 43 bits.

4. Code Huffman Algorithm

Algorithm to create a Huffman tree:

[B1] Sort the frequency table of characters.

[B2] Select the first two elements to create the node x, y weighted by frequency. Creating node z is the parent of node x, y weighting the sum of the weights of the two children.

[B3] Remove the two nodes x, y from the table, then add the node z so that the frequency table stays up / down.

[B4] Repeat B1-3 so that only one element in the table is left.

We need 2 struct :

structNode
{
    charc;
    intfreq;
    Node* left;
    Node* right;
    Node()
    {
        c = '\0';
        freq = -1;
        left = NULL;
        right = NULL;
    }
};
structBit
{
    charc;
    string bit;
};
Building class for compress
class HuffmanCompression
{
private:
    string data; 
    Node* root;
    vector bit; 
    string bitTree; 
    void convertTree();
    void visit(Node* curr, string bit);
    void generateTree(Node* curr);
public:
    HuffmanCompression() { bitTree = ""; }
    HuffmanCompression(string filePath);
    void compression(string outputPath);
};
Next we need to set the frequency table to generate the Huffman tree:
void HuffmanCompression::convertTree()
{
    vector<Node*> tree;
    for (int i = 0; i < data.length(); i++)
    {
        bool existed = false;
        for (int j = 0; j < tree.size(); j++)
                {
            if (tree[j]->c == data[i])
            {
                tree[j]->freq++;
                existed = true;
                break;
            }
        }
        if (!existed)
        {
            Node* node = new Node();
            node->c = data[i];
            node->freq = 1;
            tree.push_back(node);
        }
    }
    for (int i = 0; i < tree.size() - 1; i++)
    {
        for (int j = i + 1; j < tree.size(); j++)
                {
                        if (tree[i]->freq > tree[j]->freq)
            {
                Node* temp = tree[i];
                tree[i] = tree[j];
                tree[j] = temp;
            }
        }
    }
    
    while (true)
    {
        Node* tmp = new Node();
        tmp->left = tree[0];
        tmp->right = tree[1];
        tmp->freq = tmp->left->freq + tmp->right->freq;
        tree.erase(tree.begin(), tree.begin() + 2);
        tree.resize(tree.size() + 1);
        if (tree.size() == 1)
        {
            tree[0] = tmp;
            break;
        }
        else
        {
            for (int j = 0; j < tree.size() - 1; j++)
                        {
                                bool isMax = true;
                                if (tree[j]->freq > tmp->freq)
                {
                    for (int k = tree.size() - 1; k > j; k--)
                    {
                        tree[k] = tree[k - 1];
                    }
                    tree[j] = tmp;
                    isMax = false;
                    break;
                }
                if (isMax) tree[tree.size() - 1] = tmp;
            }
        }
    }
    root = tree[0];
}

4. Extract

huffman-tree-example

We need to build the class to extract:

class HuffmanExtraction
{
private:
    string bitTree;
    string data;
    Node* root;
    void generateTree(Node* curr);
    char visit(Node* curr);
public:
    HuffmanExtraction() { root = new Node(); }
    HuffmanExtraction(string filePath);
    void extraction(string outputPath);
};
To read an encrypted file:
HuffmanExtraction::HuffmanExtraction(string filePath)
{
    root = newNode();
    string bitSequence = "";
    ifstream fi(filePath, ios::binary);
    charc;
    while(fi >> noskipws >> c)
    {
        bitset bit(c);
        bitSequence += bit.to_string();
    }
    
    charnumChar = convertBitToChar(bitSequence.substr(0, 8));
    bitSequence.erase(0, 8);
    charaddBit = convertBitToChar(bitSequence.substr(0, 8));
    bitSequence.erase(0, 8);
    bitSequence.erase(0, addBit);
    intsizeBit = 10 * numChar - 1;
    bitTree = bitSequence.substr(0, sizeBit);
    bitSequence.erase(0, sizeBit);
    data = bitSequence;
    fi.close();
}

Next, we need to rebuild the Huffman tree based on the existing bitTree sequence:

If it is a ‘0’ bit, we construct two left and right child nodes, and call recursively to that child node.
If the ‘1’ bit, we read the next 8 bits and save it to the leaf node.

voidHuffmanExtraction::generateTree(Node* curr)
{
    while(bitTree.length() > 0)
    {
        if(curr->left != NULL && curr->right != NULL) return;
        Node* node = newNode();
        if(bitTree[0] == '0')
        {
            bitTree.erase(0, 1);
            if(curr->left == NULL)
            {
                curr->left = node;
                generateTree(curr->left);
            }
            else
            {
                curr->right = node;
                generateTree(curr->right);
            }
        }
        
        else
        {
            string temp = bitTree.substr(1, 8);
            bitTree.erase(0, 9);
            
            node->c = convertBitToChar(temp);
            if(curr->left == NULL) curr->left = node;
            elsecurr->right = node;
        }
    }
}
Next, we retrieve the sequence based on the Huffman algorithm according to the algorithm we wrote above
charHuffmanExtraction::visit(Node* curr)
{
    if(curr->left == NULL && curr->right == NULL)
    {
        returncurr->c;
    }
    if(data[0] == '0')
    {
        data.erase(0, 1);
        visit(curr->left);
    }
    else
    {
        data.erase(0, 1);
        visit(curr->right);
    }
}
Finally, write to the file
voidHuffmanExtraction::extraction(string outputPath)
{
    this->generateTree(root);
    root = root->left;
    string result = "";
    while(data.length() > 0) result += this->visit(root);
    ofstream fo(outputPath);
    fo << result;
    fo.close();
}

So we have successfully built the static Huffman compression algorithm. However, we recognize that this algorithm has one disadvantage is to browse twice the file to compress, so it will cost more. We also have to store the Huffman tree in compressed data, thus increasing the size of the archive. In the next section, I will guide on Adaptive Huffman Compression, which will overcome the disadvantages.

The above is an article on Huffman static compression, thank you for watching attention! See you in the next section!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s