Map (映射) 实现

| 2021-10-23

Map 的数据结构

『Map 是一种抽象的数据结构，它包含着类似于（键，值）的有序对。』这是维基百科上的解释。

具体实现一般用 HashTable 或者 Search Tree。很多编程语言或者存储软件都内置了 map 这个基本数据类型。

Hash Table 原理

哈希表提供了 O(1) 的读写性能和键值之间的映射关系。但实现一个哈希表还需要解决两个关键问题：哈希函数和冲突解决。

哈希函数

实现哈希表的关键点在于哈希函数的选择，哈希函数很大程度上决定哈希表的读写性能。在理想情况下，通过分布均匀哈希函数对一个键的访问，能够在 O(1) 的时间内被定位。Go 语言里如果 CPU 支持 aes 算法就用 aes 否则用 memhash 。Redis 里哈希函数用的 DJB 。Java 对象内置了 hashCode 方法。

冲突解决

开放地址法（open addressing）

h a s h_{i} = \frac{h a s h ( k e y ) + d _{i}}{m}, i = 1, 2 . . . k (k \leq m - 1)

其中，hash(key) 为哈希函数， m 为哈希表长， $d_{i}$ 为增量序列，i 为已发生冲突的次数。增量序列有下列取法：

$d_{i} = 1, 2, 3 . . . (m - 1)$ 称为线性探测（Linear Probing）；即 $d_{i} = i$ , 或者为其他线性函数。相当于逐个探测，直到找到索引不为空的单元，将地址放入。

$d_{i} = \pm 1^{2}, \pm 2^{2}, \pm 3^{2} . . . \pm 1^{k} (k \leq m / 2)$ 称为平方探测（Quadratic Probing）。

$d_{i} = i * h a s h_{2} (k e y)$ , 此时 $h a s h_{i} = \frac{h a s h _{1} ( k e y ) + i * h a s h _{2} ( k e y )}{m}$ , 称为双重哈希（Double hashing）；

拉链法

拉链法常见的实现是用双向链表数组。相同 index 的 key 会存在这个 index 对应的双向链表中。

写入和读取一般都要通过计算哈希 h = hash(key)、定位桶 index = h/m、遍历链表三个步骤。

装载因子

load fator = $\frac{n}{k}$ , n 哈希表实际元素个数，k 桶的数量。

无论开发地址法还是拉链法，装载因子越大，哈希的读写性能越差。一般拉链放装载因子不会超过 1。

动态扩容

当负载因子超过哈希表设定阀值时，一般就会出发哈希表动态扩容，这个过程常叫做 rehashing。

触发 rehash 的过程在实现时会有多个因素。Rehashing 一般包含两个步骤：增加哈希表的大小，重新映射现有的元素到新的桶里。

为了避免内存浪费，工程实践上 rehash 并不一定会申请更大的内存空间。但发生频繁删除大部分元素时，大部分实现里哈希表 buckets 数量不会自动缩容，这部分内存占用通常相比 key、value 的占用要少的多。

在哈希表扩容完判断和执行之后后，就会进行元素的迁移。可以一次性 STW 进行全量或者非全量迁移。有些哈希表的实现比如实时系统、基于磁盘的哈希表一般都会选择非全量迁移。非全量常见的有渐进式、单调键、线性哈希、分布式哈希表。单调键方式就是对 key 按照 range 进行分片，cocurrentHashMap 也是类似的思路。这里我们重点讨论渐进式。

渐进式 rehash

申请一个新的哈希表，保持旧的表不变。
在每个查找或者删除操作，检查新旧两张表。
插入操作放放在新表里。
每次插入操作都迁移 r 个旧表元素到新表
当旧表所有元素都迁移完成时，释放旧表。

Go map 实现

数据结构

// A header for a Go map.
type hmap struct {
	// Note: the format of the hmap is also encoded in cmd/compile/internal/reflectdata/reflect.go.
	// Make sure this stays in sync with the compiler's definition.
	count     int // # live cells == size of map.  Must be first (used by len() builtin)
	flags     uint8
	B         uint8  // log_2 of # of buckets (can hold up to loadFactor * 2^B items)
	noverflow uint16 // approximate number of overflow buckets; see incrnoverflow for details
	hash0     uint32 // hash seed


	buckets    unsafe.Pointer // array of 2^B Buckets. may be nil if count==0.
	oldbuckets unsafe.Pointer // previous bucket array of half the size, non-nil only when growing
	nevacuate  uintptr        // progress counter for evacuation (buckets less than this have been evacuated)


	extra *mapextra // optional fields
}


// mapextra holds fields that are not present on all maps.
type mapextra struct {
	// If both key and elem do not contain pointers and are inline, then we mark bucket
	// type as containing no pointers. This avoids scanning such maps.
	// However, bmap.overflow is a pointer. In order to keep overflow buckets
	// alive, we store pointers to all overflow buckets in hmap.extra.overflow and hmap.extra.oldoverflow.
	// overflow and oldoverflow are only used if key and elem do not contain pointers.
	// overflow contains overflow buckets for hmap.buckets.
	// oldoverflow contains overflow buckets for hmap.oldbuckets.
	// The indirection allows to store a pointer to the slice in hiter.
	overflow    *[]*bmap
	oldoverflow *[]*bmap


	// nextOverflow holds a pointer to a free overflow bucket.
	nextOverflow *bmap
}

count 当前哈希表中元素数量；
B buckets 大小的对数，即 $l o g_{2}$ (len(buckets));
buckets 指向保存 bucket 的数组，一个 bucket 是个 bmap 结构体，保存了 bucketCnt（8）个元素；
oldbuckets 是哈希在扩容时候保存之前 buckets 的字段，它的大小是当前 buckets 的一半。

// A bucket for a Go map.
type bmap struct {
	// tophash generally contains the top byte of the hash value
	// for each key in this bucket. If tophash[0] < minTopHash,
	// tophash[0] is a bucket evacuation state instead.
	tophash [bucketCnt]uint8
	// Followed by bucketCnt keys and then bucketCnt elems.
	// NOTE: packing all the keys together and then all the elems together makes the
	// code a bit more complicated than alternating key/elem/key/elem/... but it allows
	// us to eliminate padding which would be needed for, e.g., map[int64]int8.
	// Followed by an overflow pointer.
}

go 1.17.1 之后 bmap 其它字段可以更具编译期间的 cmd/compile/internal/refleactdata/reflect.go 里的 MapBucketType 重构出它的结构：

type bmap struct {
	topbits  [8]uint8
  keys     [8]uint8
  elems    [8]elemtype
  overflow uintptr
}

当一个 bucket（bmap）中的元素溢出时，会创建新的 bmap ，用 overflow 指针关联，形成链表。

运行时的类型表示

type maptype struct {
	typ    _type
	key    *_type
	elem   *_type
	bucket *_type // internal type representing a hash bucket
	// function for hashing keys (ptr to key, seed) -> hash
	hasher     func(unsafe.Pointer, uintptr) uintptr
	keysize    uint8  // size of key slot
	elemsize   uint8  // size of elem slot
	bucketsize uint16 // size of bucket
	flags      uint32 // 
}
// flags 是在 in ../cmd/compile/internal/reflectdata/reflect.go:writeType 函数里按位或填充的。

访问

底 B 位表示 index 用来确定在 hmap.buckets 的哪个 bucktet 中，高 8 位用来定位在 bmap 桶内的位置。

在 Go 语言中，hash[key] 这类操作会在编译的类型检测期间转换成哈希的 OINDEXMAP 操作，中间代码生成阶段，OINDEXMAP 会被转换为如下代码：

v := hash[key] // => v := *mapaccess1(maptype, hash, &key)
v, ok := hash[key] // v, ok := *mapaccess2(maptype, hash, &key)

func mapaccess1(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer {
	if raceenabled && h != nil {
		callerpc := getcallerpc()
		pc := funcPC(mapaccess1)
		racereadpc(unsafe.Pointer(h), callerpc, pc)
		raceReadObjectPC(t.key, key, callerpc, pc)
	}
	if msanenabled && h != nil {
		msanread(key, t.key.size)
	}
	if h == nil || h.count == 0 {
		if t.hashMightPanic() {
			t.hasher(key, 0) // see issue 23734
		}
		return unsafe.Pointer(&zeroVal[0])
	}
  // Go map 不支持并发读写，如果需要可以用 sync.Map 或者参考 Java 的 coccurrentHashMap
	if h.flags&hashWriting != 0 { 
		throw("concurrent map read and map write")
	}
	hash := t.hasher(key, uintptr(h.hash0)) // 计算 key 哈希值
	m := bucketMask(h.B) // m = 1<<b - 1
  // 跟具 hash&m 判断在第几个桶并取出bmap
	b := (*bmap)(add(h.buckets, (hash&m)*uintptr(t.bucketsize))) 
  // 如果还有旧元素没有迁移，并且 key 对应的桶没有迁移则覆盖 b
	if c := h.oldbuckets; c != nil { 
		if !h.sameSizeGrow() {
			// There used to be half as many buckets; mask down one more power of two.
			m >>= 1
		}
		oldb := (*bmap)(add(c, (hash&m)*uintptr(t.bucketsize)))
		if !evacuated(oldb) {
			b = oldb
		}
	}
	top := tophash(hash) 
  // 外层循环 b 和 b.overflow，内层循环 b.tophash 判断 hash 高 8 位、key 是否匹配
bucketloop:
	for ; b != nil; b = b.overflow(t) {
		for i := uintptr(0); i < bucketCnt; i++ {
			if b.tophash[i] != top {
				if b.tophash[i] == emptyRest {
					break bucketloop
				}
				continue
			}
			k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
			if t.indirectkey() {
				k = *((*unsafe.Pointer)(k))
			}
			if t.key.equal(key, k) {
				e := add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))
				if t.indirectelem() {
					e = *((*unsafe.Pointer)(e))
				}
				return e
			}
		}
	}
	return unsafe.Pointer(&zeroVal[0])
}

写入或修改

mapassign 函数只返回要赋值的内存地址，并不会

// Like mapaccess, but allocates a slot for the key if it is not present in the map.
func mapassign(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer {
	if h == nil {
		panic(plainError("assignment to entry in nil map"))
	}
	if raceenabled {
		callerpc := getcallerpc()
		pc := funcPC(mapassign)
		racewritepc(unsafe.Pointer(h), callerpc, pc)
		raceReadObjectPC(t.key, key, callerpc, pc)
	}
	if msanenabled {
		msanread(key, t.key.size)
	}
	if h.flags&hashWriting != 0 {
		throw("concurrent map writes")
	}
	hash := t.hasher(key, uintptr(h.hash0))


	// Set hashWriting after calling t.hasher, since t.hasher may panic,
	// in which case we have not actually done a write.
	h.flags ^= hashWriting


	if h.buckets == nil {
		h.buckets = newobject(t.bucket) // newarray(t.bucket, 1)
	}


again:
	bucket := hash & bucketMask(h.B)
	if h.growing() {
		growWork(t, h, bucket)
	}
	b := (*bmap)(add(h.buckets, bucket*uintptr(t.bucketsize)))
	top := tophash(hash)


	var inserti *uint8 // inserti 目标元素在 buckets 中的索引。
	var insertk unsafe.Pointer  // insertk 目标键在桶内的地址。
	var elem unsafe.Pointer  //  elem 目标值在桶内的地址。
bucketloop:
	for { // 双层循环查找 key 
		for i := uintptr(0); i < bucketCnt; i++ {
			if b.tophash[i] != top {
				if isEmpty(b.tophash[i]) && inserti == nil {
					inserti = &b.tophash[i] 
					insertk = add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize)) 
					elem = add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))
				}
				if b.tophash[i] == emptyRest {
					break bucketloop
				}
				continue
			}
			k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
			if t.indirectkey() {
				k = *((*unsafe.Pointer)(k))
			}
			if !t.key.equal(key, k) {
				continue
			}
			// already have a mapping for key. Update it.
      // 判断 key 的类型是否需要更新。底层是根据 maptype.flags&8!=判断的。
      // flags 在上面提到的 writeType 函数里按位或填充的。
			if t.needkeyupdate() { 
				typedmemmove(t.key, k, key)
			}
			elem = add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))
			goto done
		}
		ovf := b.overflow(t)
		if ovf == nil {
			break
		}
		b = ovf
	}


	// Did not find mapping for key. Allocate new cell & add entry.


	// If we hit the max load factor or we have too many overflow buckets,
	// and we're not already in the middle of growing, start growing.
	if !h.growing() && (overLoadFactor(h.count+1, h.B) || tooManyOverflowBuckets(h.noverflow, h.B)) {
		hashGrow(t, h)
		goto again // Growing the table invalidates everything, so try again
	}


	if inserti == nil {
		// The current bucket and all the overflow buckets connected to it are full, allocate a new one.
		newb := h.newoverflow(t, b)
		inserti = &newb.tophash[0]
		insertk = add(unsafe.Pointer(newb), dataOffset)
		elem = add(insertk, bucketCnt*uintptr(t.keysize))
	}


	// store new key/elem at insert position
	if t.indirectkey() {
		kmem := newobject(t.key)
		*(*unsafe.Pointer)(insertk) = kmem
		insertk = kmem
	}
	if t.indirectelem() {
		vmem := newobject(t.elem)
		*(*unsafe.Pointer)(elem) = vmem
	}
	typedmemmove(t.key, insertk, key)
	*inserti = top
	h.count++


done:
	if h.flags&hashWriting == 0 {
		throw("concurrent map writes")
	}
	h.flags &^= hashWriting
	if t.indirectelem() {
		elem = *((*unsafe.Pointer)(elem))
	}
	return elem
}

Map (映射) 实现

Map 的数据结构

Hash Table 原理

哈希函数

冲突解决

开放地址法（open addressing）

拉链法

装载因子

动态扩容

渐进式 rehash

Go map 实现

数据结构

访问

写入或修改

删除

扩容

Redis hashDict 实现

Java hashTable 实现

Search Tree 实现

哈希表应用

你也许感兴趣的：

发表回复取消回复

Map (映射) 实现

Map 的数据结构

Hash Table 原理

哈希函数

冲突解决

开放地址法（open addressing）

拉链法

装载因子

动态扩容

渐进式 rehash

Go map 实现

数据结构

访问

写入或修改

删除

扩容

Redis hashDict 实现

Java hashTable 实现

Search Tree 实现

哈希表应用

你对本文的反应是：

看样子你已经点过这个了！

抱歉，你最多只能点三个！

你也许感兴趣的：

发表回复 取消回复

发表回复取消回复