Author | XY

guide

Funnel optimization is a constant topic in the search system. In the past year, advertising funnel optimization has changed from the previous “addition” and simplified the funnel to improve the consistency of the entire system. For such a huge advertising library scale, high traffic scale and complex business rules as Baidu, to achieve a minimal funnel level requires the most efficient strategy design and the most extreme engineering implementation. This article focuses on how Baidu Geekers “picked out the details” on the inverted data structure to achieve no truncation of the inverted recall, and it will also inspire everyone to build high-performance systems.

The full text is 6162 words, and the expected reading time is 16 minutes.

Everyone knows that the advertising funnel includes three parts: recall, rough row, and fine row. The ideal funnel is very regular in width at the top and narrow at the bottom, but in reality, due to various reasons, the funnel is already a little floating. Many businesses continue to develop complexities. We hope to achieve:The model is consistent, the funnel is streamlined, and the whole system is Limitless.

picture

What we know about Limitless:See the real chapter in the details, and challenge the performance limit of software engineering, so that the funnel can be approximated without truncation.

Today I want to talk to you about how we dig out the details in the “BS Limitless” project. The whole project is actually very challenging, involving all aspects of network, computing, and storage. It is difficult to explain in a short article, so I decided to choose a data structure. -Inverting the table, let everyone feel the “extreme” optimization.

Let me first introduce the inverted list before optimization. Its composition is more classic, HashMap. A retrieval pv scans the zipper (SkipList) according to the triggered N words (keysign). The characteristics of advertising business delivery will naturally have long chains and super long chains. Therefore, the linked list needs to be orderly. Students who have done funnels know that in the reverse stage There is actually very little information available for sorting, which also shows the high value of scanning Limitless to the business.

picture

Such a small data structure bears the requirements of all parties. 1. The read performance determines how much can be released in a limited time. 2. Real-time insertion determines whether the customer’s deployment can take effect immediately. 3. The single database is huge in size and low in memory consumption.Reasonable requirements for engineering, indeedboth…and…and….In the context of Limitless, weTo significantly improve 1 to scan limitless, but not damage 2 and 3.

Looking back, what is the bottleneck of Limitless with such a simple data structure? Let’s recall the content of computer architecture. The CPU does not directly access the memory, but reaches the memory through layers of caches. The lower the storage level, the greater its capacity but the higher the delay. There is an order of magnitude gap between mem and L2 and L1[1]. The data structure of List obviously lacks spatial locality.

picture

How bad is the scan between a cache-incompatible List and a cache-friendly Array? We made the following evaluation, in which the horizontal axis represents the chain length or the length of the array, and the vertical axis represents the average scanning time to a single element, and the result is a gap of 10+. Replacing it with an array is also not acceptable. The customer requires real-time effect, and the growth rate of O(2N) is required for inefficient copy loss, and the memory cost requirement cannot be met.

picture

According to the characteristics of the business and the requirements of Limitless, we have designed and optimized it to the extreme, and launched a new generation of in-memory data structure HybridIndexTable, referred to as HIT.upgradedinverted list:

1. Use HashMap to index keysign,

2. The short chain adopts continuous storage, and the long chain is a prefix tree with continuous storage of leaves. The prefix tree refers to the industryAdaptive RadixTreereferred to as ART,

3. The continuous storage of short chain and leaf adopts self-developedRowContainerreferred to as RC.

In addition to the short data structure, there are two key details to share with you:

1. The key sequence is ordered by the ultra-long chain at the bottom of the bag,

2. There is no order among RCs, and RC is used as a unit to scan.

A data structure such as HIT has the following three advantages, which just meet the previous three requirements.

  • High read performance, continuous storage + prefix tree to reduce cache miss, thread-safe approach reader-friendly, and load-oriented optimization;

  • High update timeliness, continuous storage but append-only/mark-delete;

  • The memory loss is small, and a large number of adaptive algorithms are applied.

After HIT went online, it has achieved considerable business benefits. Technology on the road of Limitless is productivity!

picture

The field of memory indexing has been deeply cultivated for Modern CPUs. Since Modern CPUs run very fast, the impact of cache consistency and memory access has become a bottleneck for high performance.In-memory indexing in the 2000s also had someStaged landmark achievementsincluding FAST[4]it is a typical architecture-oriented data structure design, both ideas and results are very suitable for static data; CSBtree[2][3]it proposes that the data structure is separated by KV, so that multiple keys can be compared in one memory access, and it also proposes a lock-free solution for SMO; there is also an ART prefix tree[5]. Most of the BTree in the ordered index, why did we notice the prefix tree?

The cache invalidation of the linked list type occurs every time the next node is accessed, and the cache invalidation complexity is O(n). The cache invalidation of the sorting tree type occurs when accessing the next layer of nodes, and the cache invalidation complexity is O(lgn). For the prefix tree, the cache invalidation is only related to the key length k(len)/fan-out s(pan), and the cache invalidation complexity is O(k/s).As shown below[5]assuming that k is the number of bits of the key length, as the amount of stored data increases by 2^(k/8), 2^(k/4), …, the height of the fully balanced tree continues to increase, respectively k/8, k /4, …, and for a prefix tree, the tree height is constant for a given span, for example, for span=8 and 4, the tree heights are k/8 and k/4 respectively.Prefix trees are more cache friendly.

picture

Here is a brief introduction to the prefix tree. English is Trie, Prefix tree or Digit tree. The left picture is an example of Wikipedia. This is a typical prefix tree without any optimization. level, such as inn; on the other hand, since the memory of 2 ^ s * pointer needs to be allocated in the branch node, the memory loss is very large for the scene with very little actual fan-out.

RadixTree is a prefix tree that optimizes memory by merging prefixes (PathCompression). The merge prefix refers to compressing the branch nodes with only one child, so that the existing nodes are either leaf nodes or branch nodes with forks. The radix=2^span in the name, it is also called Patricia tree in the case of radix=2[6]. Linux PageCache uses RadixTree for page caching, and the core data structure of Ethereum currency ETH is Merkle Patricia tree. The middle graph is reorganized according to RadixTree.

Even with merge prefixes, space is wasted since most of the fanouts are not filled. For example, RadixTree(radix = 256, s=8) in the above example, even if there are only t, A, and i in the first level, it needs to allocate 253 * 8 ~ 1Kbytes more.Adjusting radix(span = lg(radix)) is a memory optimization method, but this increases the tree height and increases cache invalidation[5]. In 2013, A(daptive)R(adix)T(ree)[6]Use adaptive node types to solve problems, and use the most compact node representation suitable for data distribution instead of fixed Node256. In the paper, InnerNode is divided into four types: Node4, Node16, Node80, and Node256. According to the number of forks actually needed, it can be proved that the memory complexity of this tree is O (52N), where N is the amount of stored data. ART papers provide a way,While increasing the fan-out and reducing the tree height, it can also greatly reduce memory overhead. Reorganize the RadixTree in the above example according to ART, as shown in the figure.

picture

We define RC_x as a continuous storage space of no more than x elements, which supports Append-Only and Mark-Delete operations, and the additional cost of a single element does not exceed 8byte. Next, look at the design points of RC.

Since RC is designed to only support data types modified by Append-Only/Mark-Delete, cursor and valids are required for this metadata. At the same time, for RCs with different capacities, in order to control the loss of a single data entry in theory to no more than 8 bytes, it needs to be designed and implemented separately. We do not want to introduce the memory overhead of inheriting virtual pointers, and adopt the implementation scheme of distribution according to type.

picture

RC1 metadata is 8 bytes, which can easily accommodate cursor and valids, so what is the next-level RC_x, x? According to the apportionment analysis method, RC_x has at least 2 elements, that is, a 16-byte budget. Before allocation, you still need to look at the selection. RC_x needs to be longer, so there is still 8 bytes left for dataptr for metadata. In this case, valids cannot use std ::bitset, because the implementation of bitset requires at least one word, which is 8byte. It seems that valids and cursor are allocated by bit field method is relatively sufficient, and it is no problem to store 58 data elements. In fact, considering the characteristics of the system allocator, we did not do this.

Most mainstream system allocators are slab-based. In practical applications, in addition to theoretical single data loss, alignment loss caused by memory pools also needs to be considered. On the one hand, the chain where RC1 is located accounts for about 80% of the whole, so a large number of small objects are suitable for management by the memory pool. In order to avoid one more conversion of the memory pool address during retrieval, we store vaddr in the metadata and use it when releasing . On the other hand, distributing N*sizeof(data) in a distributed manner will cause insufficient internal use of each type of slab. For this reason, RC16+ adopts the organization method of Array of data_array.

RC design has many implementation details, including when to apply for more space at one time, controlling memory cost and operating efficiency. I won’t introduce much about the time relationship.

picture

Prefix tree is better than BTree in terms of cache invalidation. ART further adopts adaptive node type, which can not only increase fan-out to optimize cache access, but also control memory loss. However, in practical applications, ART is still affected by the sparse distribution of key values. This shows that the fan-out of some subtrees is very small and deep (compared to Node256), and the tree height cannot be fully reduced, thus affecting the cache of the enumeration. HOT[7]Propose a method of adaptive dynamic span to increase the average fan-out, and then reduce the tree height. Specifically, define the maximum fan-out k of nodes, and find a tree division. The number of branch nodes in each division is not greater than k-1 Under the premise of , the maximum value of the number of partitions along the leaf to the root is minimized. The actual effect of this is that the span for the sparse segment is large enough, and the span for the dense segment is small enough, so that the overall fanout approaches the maximum fanout k. As shown in the middle picture.

For the continuous application scenarios of ART+ targets, it is not enough to increase the fan-out and reduce the tree height. We found that the fan-out is high, but the leaf data is very sparse after fan-out, and the total data volume is not high, which obviously affects scanning performance. We propose LeafCompaction, which includes a decider and an operation algorithm. Given a branch node, if it is determined to be merged, replace it with a leaf node, the prefix of the leaf node is the same as the prefix of the tree, and the content is the data of the tree, and the subsequent insertion/deletion process follows the operation mode of the leaf; If it is judged unchanged, keep it. Given a leaf node, if it is judged to be split, replace it with a tree whose prefix is ​​the same as that of the leaf. The content is generated by BulkLoad, and if it is judged to be unchanged, it will be kept. The default algorithm of the decider is based on the average and total number of subtrees, and the node compression rate is as high as 90%. as the picture shows.

picture

The actual evaluation results show that the average scanning performance of a single piece of data, the LC version is twice as good as the normal version in sparse scenarios.

picture

Generally, deep-optimized fine-grained locks are used to protect the parallel operations of the inverted data structure, but the performance jitter caused by lock competition completely obliterates memory access optimization, and we must explore a lock-free (lock-free) security model. When it comes to lock-free, everyone thinks it is CAS. In fact, on the one hand, CAS is not a silver bullet. The common way of writing CAS is while loop until it succeeds. If there are 10 threads modifying the tail of a linked list at high speed, at this time, CAS just says to put Falling into the kernel is saved, but it still needs to be redone continuously. This does not fully release the parallel ability, and it is best to break it up from the business. On the other hand, CAS also has problems. The CPU cache coherence protocol bus arbitration under multi-core leads to the destruction of the pipeline.

Read-Copy-Update is a synchronization mechanism in the Linux kernel. It is used to reduce the lock overhead on the reader side and provide a safe write mechanism. Specifically, multiple readers (readers) access critical resources in parallel, write When updating a critical resource, the writer (writer) copies a copy as the basis for modification, and atomically replaces it after modification. When all readers have released this critical resource (Grace period), then release the resource (reclaimer).

Linux also requires a more complex mechanism:

1. Detect the quiescent state Quiescent Status. When all CPUs have experienced at least one quiescent state, the Grace period is considered to be over;

2. Synchronize between multiple writers to avoid losing modifications.

For the retrieval service, it has the following three features, which greatly reduce the complexity of our design:

1. For single writers, we may have other preparation threads to do heavier preparation work in parallel, but only Reload single thread is responsible for material update;

2. Non-transactional, there is no strict constraint between multiple pieces of data recalled by the retrieval thread;

3. The holding time of readers is limited, and the retrieval thread has strict timeout requirements. These features greatly reduce our design complexity.

picture

Finally, I will introduce the work in progress L2I. Just now, the optimized cache of the single chain was invalidated, which has achieved good results, but from a more macro and overall perspective, our system still has room for exploration: on the one hand, there are more ultra-short chains naturally caused by advertising, and on the other hand On the one hand, the scanning process requires cross-table queries to do various filtering logic and so on. These are not just problems of data distribution, but the matching of online traffic and customer delivery, which needs to be solved by more intelligent means.

In the industry, JeffDean & TimK jointly published the Learned Index[8]After introducing RMI and CDF models, the TimK team will develop in various directions such as dynamics, multidimensional indexing, and sagedb. The essence is to build a load-oriented semi-automatic optimization system. We need to introduce load characteristics, but because the scanning process is very light, we cannot make heavy predictions. At the same time, as a commercial system with commitment to customers, we cannot make mistakes. Drawing on the idea of ​​L2I, we have also done two things. On the one hand, by triggering the exit to calculate the co-occurrence relationship offline, it is used to merge ultra-short chains and short chains, using HIT’s high-performance continuous scanning capability; on the other hand, Take the combination with the largest entropy, and extract the bit of the posting list. It is determined that it is not more than 1, otherwise it is 0. For the latter, fallback to point-check calculation.

——END——

References:

[1] Software Engineering Advice from Building Large-Scale Distributed Systems, 2002

[2] Cache conscious indexing for decision-support in main memory, 1999

[3] Making B+-trees cache conscious in main memory, 2000

[4] FAST: fast architecture sensitive tree search on modern CPUs and GPUs, 2010

[5] The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases, 2013

[6] PATRICIA –Practical Algorithm To Retrieve Information Coded in Alphanumeric, 1968

[7] HOT: A height optimized trie index for main-memory database systems, 2018

[8] The Case for Learned Index Structures, 2018

Recommended reading:

Why the video FPS calculated by OpenCV is wrong

Baidu Android Live Streaming Experience Optimization

iOS SIGKILL semaphore crash capture and optimization practice

How to implement flexible scheduling strategies in gateway services with millions of qps

Simple DDD programming

Baidu APP iOS terminal memory optimization practice-memory control scheme

#Advertisement #Inverted #Service #Extremely #Optimized #Personal #Space #Baidu #Geek #News Fast Delivery

Leave a Comment

Your email address will not be published. Required fields are marked *