C++ 哈希表使用与底层实现原理

一、unordered 系列关联式容器

在 C++98 中，STL 提供了底层为红黑树结构的一系列关联式容器，查询效率为 O(logN)。在 C++11 中，STL 提供了 4 个 unordered 系列的关联式容器，底层结构不同，本文主要介绍 unordered_map 和 unordered_set。

1.1 unordered_map

1.1.1 unordered_map 的文档介绍

unordered_map 是存储键值对的关联式容器，允许通过 keys 快速索引到对应的 value。
键值通常用于惟一地标识元素，映射值是一个对象。键和映射值的类型可能不同。
内部无序，相同哈希值的键值对放在相同的桶中，以便常数范围内找到 key 对应的 value。
通过 key 访问单个元素比 map 快，但遍历元素子集的范围迭代方面效率较低。
实现了直接访问操作符 operator[]，允许使用 key 作为参数直接访问 value。
迭代器至少是前向迭代器。

1.1.2 unordered_map 的接口说明

构造
容量
迭代器
元素访问 注意：该函数实际调用哈希桶的插入操作，用参数 key 与 V() 构造一个默认值往底层哈希桶中插入。如果 key 不在哈希桶中，插入成功，返回 V()；插入失败，说明 key 已经在哈希桶中，将 key 对应的 value 返回。
查询注意：unordered_map 中 key 不能重复，因此 count 函数的返回值最大为 1。
修改操作
桶操作

1.2 标准库中的 unordered_map

1.2.1 unordered_map 的介绍

无序性：元素没有特定顺序，排列由哈希函数决定。
快速访问：提供非常快速的查找、插入和删除操作。
唯一性：键是唯一的，尝试插入已存在的键会覆盖旧值。
哈希函数：默认提供哈希函数，也可自定义。
迭代器：提供迭代器遍历，但顺序与插入顺序无关。

unordered_map 和 unordered_set 的接口说明差不多，在此就不做过多的介绍了。

二、底层结构

unordered 系列的关联式容器之所以效率高，是因为其底层使用了哈希结构。

2.1 哈希概念

顺序结构及平衡树中，元素关键码与其存储位置无对应关系，查找需多次比较。顺序查找时间复杂度为 O(N)，平衡树中为 O(log₂N)。

理想的搜索方法：可以不经过任何比较，一次直接从表中得到要搜索的元素。如果构造一种存储结构，通过某种函数 (hashFunc) 使元素的存储位置与它的关键码之间能够建立一一映射的关系，那么在查找时通过该函数可以很快找到该元素。

例如：数据集合 {1, 7, 6, 4, 5, 9}；哈希函数设置为：hash(key) = key % capacity; capacity 为存储元素底层空间总的大小。文章配图

2.2 哈希冲突

对于两个数据元素的关键字 ki 和 kj(i != j)，有 ki != kj，但有 Hash(ki) == Hash(kj)，即不同关键字通过相同哈希函数计算出相同的哈希地址，该种现象称为哈希冲突或哈希碰撞。把具有不同关键码而具有相同哈希地址的数据元素称为'同义词'。

2.3 哈希函数

引起哈希冲突的一个原因可能是哈希函数设计不够合理。

哈希函数设计原则：

定义域必须包括需要存储的全部关键码，值域必须在 0 到 m-1 之间。
计算出来的地址能均匀分布在整个空间中。
应该比较简单。

常见哈希函数：

取关键字的某个线性函数：Hash(Key) = A*Key + B。优点简单、均匀；缺点需事先知道关键字分布。
除留余数法（常用）：Hash(key) = key % p(p 为质数)。
平方取中法：抽取中间位作为哈希地址。
折叠法：分割关键字叠加求和，取后几位。
随机数法：H(key) = random(key)。
数学分析法：选择分布均匀的若干位作为散列地址。

注意：哈希函数设计的越精妙，产生哈希冲突的可能性就越低，但是无法避免哈希冲突。

2.4 哈希冲突解决

解决哈希冲突两种常见的方法是：闭散列和开散列。

2.4.1 闭散列

闭散列：也叫开放定址法，当发生哈希冲突时，如果哈希表未被装满，可以把 key 存放到冲突位置中的'下一个'空位置中去。

线性探测 从发生冲突的位置开始，依次向后探测，直到寻找到下一个空位置为止。
- 插入：通过哈希函数获取位置，若冲突则线性探测找空位。
- 删除：采用标记的伪删除法来删除一个元素，不能物理删除，否则影响其他元素搜索。

template<class K> struct DefaultHashFunc { 
    size_t operator()(const K& key) { return (size_t)key; } 
}; 

template<> struct DefaultHashFunc<string> { 
    size_t operator()(const string& str) { 
        size_t hash = 0; 
        for (auto ch : str) { hash *= 131; hash += ch; } 
        return hash; 
    } 
}; 

namespace open_address { 
    enum STATE { EXIST, EMPTY, DELETE }; 
    template<class K, class V> struct HashData { 
        pair<K, V> _kv; 
        STATE _state = EMPTY; 
    }; 
    template<class K, class V, class HashFunc = DefaultHashFunc<K>> class HashTable { 
    public: 
        HashTable() { _table.resize(10); } 
        bool Insert( pair<K, V>& kv) { 
             ((kv.first)) {  ; } 
             (_n *  / _table.() >= ) { 
                 newSize = _table.() * ; 
                HashTable<K, V, HashFunc> newHT; 
                newHT._table.(newSize); 
                 ( i = ; i < _table.(); i++) { 
                     (_table[i]._state == EXIST) { newHT.(_table[i]._kv); } 
                } 
                _table.(newHT._table); 
            } 
            HashFunc hf; 
             hashi = (kv.first) % _table.(); 
             (_table[hashi]._state == EXIST) { ++hashi; hashi %= _table.(); } 
            _table[hashi]._kv = kv; 
            _table[hashi]._state = EXIST; 
            ++_n; 
             ; 
        } 
        { 
            HashFunc hf; 
             hashi = (key) % _table.(); 
             (_table[hashi]._state != EMPTY) { 
                 (_table[hashi]._state == EXIST && _table[hashi]._kv.first == key) { 
                     (HashData< K, V>*) & _table[hashi]; 
                } 
                ++hashi; hashi %= _table.(); 
            } 
             ; 
        } 
        { 
            HashData< K, V>* ret = (key); 
             (ret) { ret->_state = DELETE; --_n;  ; } 
             ; 
        } 
    : 
        vector<HashData<K, V>> _table; 
         _n = ; 
    }; 
}

线性探测优点：实现非常简单。线性探测缺点：一旦发生哈希冲突，所有的冲突连在一起，容易产生数据'堆积'，导致搜索效率降低。

二次探测 为了避免数据堆积，找下一个空位置的方法为：Hi = (H0 + i²) % m 或 Hi = (H0 - i²) % m。研究表明：当表的长度为质数且表装载因子 a 不超过 0.5 时，新的表项一定能够插入。

2.4.2 开散列

开散列概念 开散列法又叫链地址法，首先对关键码集合用散列函数计算散列地址，具有相同地址的关键码归于同一子集合（桶），各个桶中的元素通过一个单链表链接起来。
开散列实现

namespace hash_bucket { 
    template<class T> struct HashNode { 
        T _data; 
        HashNode<T>* _next; 
        HashNode(const T& data) :_data(data), _next(nullptr) {} 
    }; 
    template<class K, class T, class KeyOfT, class HashFunc> class HashTable; 
    template<class K, class T, class Ptr, class Ref, class KeyOfT, class HashFunc> struct HTIterator { 
        typedef HashNode<T> Node; 
        typedef HTIterator<K, T, Ptr, Ref, KeyOfT, HashFunc> Self; 
        typedef HTIterator<K, T, T*, T&, KeyOfT, HashFunc> Iterator; 
        Node* _node; 
        const HashTable<K, T, KeyOfT, HashFunc>* _pht; 
        HTIterator(Node* node, const HashTable<K, T, KeyOfT, HashFunc>* pht) :_node(node), _pht(pht) {} 
        Ref operator*() { return _node->_data; } 
        Ptr operator->() { return &_node->_data; } 
        Self& operator++() { 
            if (_node->_next) { _node = _node->_next; } 
            else { 
                KeyOfT kot; HashFunc hf; 
                 hashi = ((_node->_data)) % _pht->_table.(); 
                ++hashi; 
                 (hashi < _pht->_table.()) { 
                     (_pht->_table[hashi]) { _node = _pht->_table[hashi];  *; } 
                     { ++hashi; } 
                } 
                _node = ; 
            } 
             *; 
        } 
         !=( Self& s) {  _node != s._node; } 
         ==( Self& s) {  _node == s._node; } 
    }; 
}

开散列的思考：

非整形 key 处理：需提供将 key 转化为整形的方法。
素数扩容：最好模一个素数，可预置素数列表快速取类似两倍关系的素数。

size_t GetNextPrime(size_t prime) { 
    static const int __stl_num_primes = 28; 
    static const unsigned long __stl_prime_list[__stl_num_primes] = { 
        53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593, 49157, 98317, 
        196613, 393241, 786433, 1572869, 3145739, 6291469, 12582917, 25165843, 
        50331653, 100663319, 201326611, 402653189, 805306457, 1610612741, 
        3221225473, 4294967291 
    }; 
    size_t i = 0; 
    for (; i < PRIMECOUNT; ++i) { 
        if (__stl_prime_list[i] > prime) return __stl_prime_list[i]; 
    } 
    return __stl_prime_list[i]; 
}

2.5 开散列与闭散列比较

应用链地址法处理溢出，需要增设链接指针，似乎增加了存储开销。事实上：由于开地址法必须保持大量的空闲空间以确保搜索效率，所以使用链地址法反而比开地址法节省存储空间。

三、模拟实现

3.1 哈希表的改造

3.1. 模板参数列表的改造

template<class K, class T, class KeyOfT, class HashFunc = DefaultHashFunc<K>> 
class HashTable { 
    typedef HashNode<T> Node; 
    template<class K, class T, class Ptr, class Ref, class KeyOfT, class HashFunc> 
    friend struct HTIterator; 
};

3.2. 增加迭代器操作

template<class K, class T, class KeyOfT, class HashFunc> class HashTable; 
template<class K, class T, class Ptr, class Ref, class KeyOfT, class HashFunc> 
struct HTIterator { 
    // ... 同上 HTIterator 实现 
};

3.3 增加通过 key 获取 value 操作

template<class K, class T, class KeyOfT, class HashFunc = DefaultHashFunc<K>> 
class HashTable { 
    typedef HashNode<T> Node; 
    template<class K, class T, class Ptr, class Ref, class KeyOfT, class HashFunc> 
    friend struct HTIterator; 
public: 
    typedef HTIterator<K, T, T*, T&, KeyOfT, HashFunc> iterator; 
    typedef HTIterator<K, T, const T*, const T&, KeyOfT, HashFunc> const_iterator; 
    iterator begin() { 
        for (size_t i = 0; i < _table.size(); i++) { 
            Node* cur = _table[i]; 
            if (cur) { return iterator(cur, this); } 
        } 
        return iterator(nullptr, this); 
    } 
    iterator end() { return iterator(nullptr, ); } 
    { 
         ( i = ; i < _table.(); i++) { 
            Node* cur = _table[i]; 
             (cur) {  (cur, ); } 
        } 
         (, ); 
    } 
    {  (, ); } 
    {  } 
    () { _table.((), ); } 
    ~() { 
         ( i = ; i < _table.(); i++) { 
            Node* cur = _table[i]; 
             (cur) { Node* next = cur->_next;  cur; cur = next; } 
            _table[i] = ; 
        } 
    } 
    { 
        KeyOfT kot; iterator it = ((data)); 
         (it != ()) {  (it, ); } 
        HashFunc hf; 
         (_n == _table.()) { 
             newSize = (_table.()); 
            vector<Node*> newTable; 
            newTable.(newSize, ); 
             ( i = ; i < _table.(); i++) { 
                Node* cur = _table[i]; 
                 (cur) { 
                    Node* next = cur->_next; 
                     hashi = ((cur->_data)) % newSize; 
                    cur->_next = newTable[hashi]; 
                    newTable[hashi] = cur; 
                    cur = next; 
                } 
                _table[i] = ; 
            } 
            _table.(newTable); 
        } 
         hashi = ((data)) % _table.(); 
        Node* newnode =  (data); 
        newnode->_next = _table[hashi]; 
        _table[hashi] = newnode; 
        ++_n; 
         ((newnode, ), ); 
    } 
    { 
        HashFunc hf; KeyOfT kot; 
         hashi = (key) % _table.(); 
        Node* cur = _table[hashi]; 
         (cur) { 
             ((cur->_data) == key) {  (cur, ); } 
            cur = cur->_next; 
        } 
         (); 
    } 
    { 
        HashFunc hf; KeyOfT kot; 
         hashi = (key) % _table.(); 
        Node* prev = ; Node* cur = _table[hashi]; 
         (cur) { 
             ((cur->_data) == key) { 
                 (prev == ) { _table[hashi] = cur->_next; } 
                 { prev->_next = cur->_next; } 
                --_n;  cur;  ; 
            } 
            prev = cur; cur = cur->_next; 
        } 
         ; 
    } 
: 
    vector<Node*> _table; 
     _n = ; 
};

3.2 unordered_map

#include"HashsTable.h" 
namespace lxp { 
    template<class K, class V> class unordered_map { 
        struct MapKeyOfT { 
            const K& operator()(const pair<const K, V>& kv) { return kv.first; } 
        }; 
    public: 
        typedef typename hash_bucket::HashTable<K, pair<const K, V>, MapKeyOfT>::iterator iterator; 
        typedef typename hash_bucket::HashTable<K, pair<const K, V>, MapKeyOfT>::const_iterator const_iterator; 
        iterator begin() { return _ht.begin(); } 
        iterator end() { return _ht.end(); } 
        const_iterator begin() const { return _ht.begin(); } 
        const_iterator end() const { return _ht.end(); } 
        pair<iterator, bool> insert( pair<K, V>& kv) {  _ht.(kv); } 
        V& []( K& key) { 
            pair<iterator, > ret = _ht.((key, ())); 
             ret.first->second; 
        } 
    : 
        hash_bucket::HashTable<K, pair< K, V>, MapKeyOfT> _ht; 
    }; 
}

3.3 unordered_set

#include"HashsTable.h" 
namespace lxp { 
    template<class K> class unordered_set { 
        struct SetKeyOfT { 
            const K& operator()(const K& key) { return key; } 
        }; 
    public: 
        typedef typename hash_bucket::HashTable<K, K, SetKeyOfT>::const_iterator iterator; 
        typedef typename hash_bucket::HashTable<K, K, SetKeyOfT>::const_iterator const_iterator; 
        const_iterator begin() const { return _ht.begin(); } 
        const_iterator end() const { return _ht.end(); } 
        pair<const_iterator, bool> insert(const K& key) { 
            pair<typename hash_bucket::HashTable<K, K, SetKeyOfT>::iterator, bool> ret = _ht.Insert(key); 
            return pair<const_iterator, bool>(ret.first, ret.second); 
        } 
    private: 
        hash_bucket::HashTable<K, K, SetKeyOfT> _ht; 
    }; 
}

四、哈希的应用

4.1 位图

4.1.1 位图概念

面试题：给 40 亿个不重复的无符号整数，没排过序。给一个无符号整数，如何快速判断一个数是否在这 40 亿个数中。

遍历，时间复杂度 O(N)
排序 (O(NlogN))，利用二分查找
位图解决

所谓位图，就是用每一位来存放某种状态，适用于海量数据，数据无重复的场景。通常是用来判断某个数据存不存在的。文章配图

4.1.2 位图的实现

#include<vector> #include<iostream> using namespace std; 
namespace lxp { 
    template<size_t N> class bitset { 
    public: 
        bitset() { _a.resize(N / 32 + 1); } 
        void set(size_t x) { 
            size_t i = x / 32; size_t j = x % 32; 
            _a[i] |= (1 << j); 
        } 
        void reset(size_t x) { 
            size_t i = x / 32; size_t j = x % 32; 
            _a[i] &= (~(1 << j)); 
        } 
        bool test(size_t x) { 
            size_t i = x / 32; size_t j = x % 32; 
            return _a[i] & (1 << j); 
        } 
    private: 
        vector<int> _a; 
    }; 
    template<size_t N> class twobitset { 
    public: 
        void  { 
             (!_bs(x) && !_bs(x)) { _bs(x); } 
              (!_bs(x) && _bs(x)) { _bs(x); _bs(x); } 
        } 
        {  !_bs(x) && _bs(x); } 
    : 
        bitset<N> _bs1; bitset<N> _bs2; 
    }; 
}

4.1.3 位图的应用

快速查找某个数据是否在一个集合中
排序 + 去重
求两个集合的交集、并集等
操作系统中磁盘块标记

4.2 布隆过滤器

4.2.1 布隆过滤器提出

新闻客户端推荐系统如何实现推送去重？用服务器记录了用户看过的所有历史记录。如何快速查找呢？

用哈希表存储用户记录，缺点：浪费空间
用位图存储用户记录，缺点：位图一般只能处理整形
将哈希与位图结合，即布隆过滤器

4.2.2 布隆过滤器概念

布隆过滤器是由 Burton Howard Bloom 在 1970 年提出的一种紧凑型的、比较巧妙的概率型数据结构，特点是高效地插入和查询，可以用来告诉你'某样东西一定不存在或者可能存在'，它是用多个哈希函数，将一个数据映射到位图结构中。文章配图

4.2.3 布隆过滤器的插入

文章配图

struct BKDRHash { 
    size_t operator()(const string& s) { 
        size_t value = 0; 
        for (auto ch : s) { value *= 31; value += ch; } 
        return value; 
    } 
}; 
struct APHash { 
    size_t operator()(const string& s) { 
        size_t hash = 0; 
        for (long i = 0; i < s.size(); i++) { 
            if ((i & 1) == 0) { hash ^= ((hash << 7) ^ s[i] ^ (hash >> 3)); } 
            else { hash ^= (~((hash << 11) ^ s[i] ^ (hash >> 5))); } 
        } 
        return hash; 
    } 
}; 
struct DJBHash { 
    size_t operator()(const string& s) { 
        size_t hash = 5381; 
        for (auto ch : s) { hash += (hash << 5) + ch; } 
        return hash; 
    } 
}; 
template<size_t N, size_t X = ,  K = string,  HashFunc1 = BKDRHash,  HashFunc2 = APHash,  HashFunc3 = DJBHash> 
 BloomFilter { 
: 
     ( K& key) { 
         len = X * N; 
         index1 = ()(key) % len; 
         index2 = ()(key) % len; 
         index3 = ()(key) % len; 
        _bs.(index1); _bs.(index2); _bs.(index3); 
    } 
    { 
         len = X * N; 
         index1 = ()(key) % len; 
         (_bs.(index1) == )  ; 
         index2 = ()(key) % len; 
         (_bs.(index2) == )  ; 
         index3 = ()(key) % len; 
         (_bs.(index3) == )  ; 
         ; 
    } 
    ; 
: 
    bitset<X* N> _bs; 
};

4.2.4 布隆过滤器的查找

布隆过滤器的思想是将一个元素用多个哈希函数映射到一个位图中，因此被映射到的位置的比特位一定为 1。所以可以按照以下方式进行查找：分别计算每个哈希值对应的比特位置存储的是否为零，只要有一个为零，代表该元素一定不在哈希表中，否则可能在哈希表中。注意：布隆过滤器如果说某个元素不存在时，该元素一定不存在，如果该元素存在时，该元素可能存在，因为有些哈希函数存在一定的误判。

4.2.5 布隆过滤器删除

布隆过滤器不能直接支持删除工作，因为在删除一个元素时，可能会影响其他元素。一种支持删除的方法：将布隆过滤器中的每个比特位扩展成一个小的计数器，插入元素时给 k 个计数器加一，删除元素时，给 k 个计数器减一，通过多占用几倍存储空间的代价来增加删除操作。缺陷：无法确认元素是否真正在布隆过滤器中；存在计数回绕。

4.2.6 布隆过滤器优点

增加和查询元素的时间复杂度为 O(K)，与数据量大小无关
哈希函数相互之间没有关系，方便硬件并行运算
不需要存储元素本身，在某些对保密要求比较严格的场合有很大优势
在能够承受一定的误判时，比其他数据结构有很大的空间优势
数据量很大时，可以表示全集
使用同一组散列函数的布隆过滤器可以进行交、并、差运算

4.2.7 布隆过滤器缺陷

有误判率，即存在假阳性 (False Position)
不能获取元素本身
一般情况下不能从布隆过滤器中删除元素
如果采用计数方式删除，可能会存在计数回绕问题

C++ 哈希表使用与底层实现原理

一、unordered 系列关联式容器

1.1 unordered_map

1.1.1 unordered_map 的文档介绍

1.1.2 unordered_map 的接口说明

1.2 标准库中的 unordered_map

1.2.1 unordered_map 的介绍

二、底层结构

2.1 哈希概念

2.2 哈希冲突

2.3 哈希函数

2.4 哈希冲突解决

2.4.1 闭散列

2.4.2 开散列

2.5 开散列与闭散列比较

三、模拟实现

3.1 哈希表的改造

3.1. 模板参数列表的改造

3.2. 增加迭代器操作

3.3 增加通过 key 获取 value 操作

3.2 unordered_map

3.3 unordered_set

四、哈希的应用

4.1 位图

4.1.1 位图概念

4.1.2 位图的实现

4.1.3 位图的应用

4.2 布隆过滤器

4.2.1 布隆过滤器提出

4.2.2 布隆过滤器概念

4.2.3 布隆过滤器的插入

4.2.4 布隆过滤器的查找

4.2.5 布隆过滤器删除

4.2.6 布隆过滤器优点

4.2.7 布隆过滤器缺陷

更多推荐文章

相关免费在线工具

C++ 哈希表使用与底层实现原理

一、unordered 系列关联式容器

1.1 unordered_map

1.1.1 unordered_map 的文档介绍

1.1.2 unordered_map 的接口说明

1.2 标准库中的 unordered_map

1.2.1 unordered_map 的介绍

二、底层结构

2.1 哈希概念

2.2 哈希冲突

2.3 哈希函数

2.4 哈希冲突解决

2.4.1 闭散列

2.4.2 开散列

2.5 开散列与闭散列比较

三、模拟实现

3.1 哈希表的改造

3.1. 模板参数列表的改造

3.2. 增加迭代器操作

3.3 增加通过 key 获取 value 操作

3.2 unordered_map

3.3 unordered_set

四、哈希的应用

4.1 位图

4.1.1 位图概念

4.1.2 位图的实现

4.1.3 位图的应用

4.2 布隆过滤器

4.2.1 布隆过滤器提出

4.2.2 布隆过滤器概念

4.2.3 布隆过滤器的插入

4.2.4 布隆过滤器的查找

4.2.5 布隆过滤器删除

4.2.6 布隆过滤器优点

4.2.7 布隆过滤器缺陷

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具