Guide: Apache Doris is one of the most widely used OLAP engines in Xiaomi Group. This paper mainly analyzes the performance status of A/B experimental scene query from the perspective of data, and discusses the performance optimization solution based on Apache Doris. After a series of performance optimization and testing based on Doris, the query performance improvement in A/B experiment scenarios has exceeded our expectations. I hope this sharing can provide some reference for friends in need.
Author｜Le Tao, Big Data Engineer of Xiaomi Group
A/B experiment is an important means to compare the advantages and disadvantages of strategies in the Internet scene. In order to verify the effect of a new strategy, two schemes, the original strategy A and the new strategy B, need to be prepared. Then take a small part of the overall users, and divide these users into two groups completely randomly, so that there is no statistical difference between the two groups of users. Show the original strategy A and the new strategy B to different user groups. After a period of time, analyze the data with statistical methods to obtain the results of the changes in the indicators after the two strategies take effect, and judge whether the new strategy B meets expectations.
The Xiaomi A/B Experimental Platform is an operation tool product that uses A/B experiments to assist scientific business decisions and ultimately achieve business growth with the help of experimental grouping, traffic splitting, and scientific evaluation. It is widely used in all aspects of the product development life cycle:
This article mainly analyzes the performance status of A/B experimental scene query from the perspective of data, and discusses the performance optimization solution based on Apache Doris. The architecture of the A/B experiment platform is shown in the figure below:
- The data used by the platform mainly includes the experimental configuration data and metadata used by the platform itself, as well as the log data reported by the business party.
- Since the business party introduces the SDK and interacts with the distribution service, the log data contains the ID information of the experimental group it participates in.
- Users configure, analyze, and query on the experimental platform to obtain report conclusions to meet business demands.
In view of the fact that the links reported by the various business parties in the AB experiment report are generally similar, we will takeHead business party advertising businessFor example, the data flow is shown in the figure below:
As can be seen from the above figure, the entire data link is not complicated. After the log data is imported, it enters Talos (Xiaomi self-developed message queue) after necessary data processing and cleaning work, and writes it to Doris in real time in the form of detailed data through Flink tasks. At the same time, Talos data will also be synchronized to the Hive table for backup, so as to troubleshoot and repair data.
In consideration of efficient writing and field increase and decrease requirements,The Doris schedule is modeled with the Duplicate model:
CREATE TABLE `dwd_xxxxxx` (
`olap_date` int(11) NULL COMMENT "分区日期",
`user_id` varchar(256) NULL COMMENT "用户id",
`exp_id` varchar(512) NULL COMMENT "实验组ID",
`dimension1` varchar(256) NULL COMMENT "",
`dimension2` varchar(256) NULL COMMENT "",
`dimensionN` bigint(20) NULL COMMENT "",
`index1` decimal(20, 3) NULL COMMENT "",
`indexN` int(11) NULL COMMENT "",
DUPLICATE KEY(`olap_date`, `user_id`)
PARTITION BY RANGE(`olap_date`)
PARTITION p20221101 VALUES [("20221101"), ("20221102")),
PARTITION p20221102 VALUES [("20221102"), ("20221103")),
PARTITION p20221103 VALUES [("20221103"), ("20221104"))
DISTRIBUTED BY HASH(`user_id`) BUCKETS 300
Before speeding up, the P95 time for the Xiaomi A/B experimental platform to complete the query of the experimental report ishour classThere are many performance problems in the way of using data in the experimental report, which directly affects the efficiency of business operations and decision-making.
Report Query Based on Detail
The data source of the current report query is the detailed table, and the data volume of the detailed table is huge:
Moreover, the time range in the query conditions of the experiment report often spans multiple days. Based on the statistics of historical query reports, 69.1% of the reports in the query conditions have a time range greater than one day. The specific time span distribution is as follows:
The huge amount of detailed data scanned puts a lot of pressure on the cluster, and due to the concurrency of report queries and the splitting of SQL, if an SQL request cannot quickly return the result to release resources, it will also affect the queuing status of the request. Therefore, during the working hours, the CPU load of the BE node of the Doris cluster is basically continuously fully loaded, and the disk IO is also continuously under high load, as shown in the following figure:
BE node CPU usage
BE node disk IO
- All queries in the current report are based on detailed data, and the average query time span is 4 days, and the query scan data volume is tens of billions. Due to the large amount of scanned data and high computing costs, it puts a lot of pressure on the cluster, resulting in low data query efficiency.
- If the data is pre-aggregated, Scan Rows and Scan Bytes are controlled, and the pressure on the cluster is reduced, the query performance will be greatly improved.
Hierarchical distribution of field query popularity
Due to the relatively loose process control mechanism before, all buried fields added by users will be entered into the detailed table, resulting in more redundant fields. The statistical historical query report found that the commonly used dimensions and indicators in the detailed table are only concentrated in some fields, and the query heat is distributed in layers:
The indicators involved in the calculation are also concentrated in some fields, and most of them are aggregated calculations (sum) or can be converted into aggregated calculations (avg):
- Dimensions used in detailed tables accounted for only 54.3%, and frequently used dimensions accounted for only 15.2%, and the frequency of dimension queries was distributed in layers.
- Data aggregation requires selection of dimension fields in the detailed table, and selecting some dimensions for rollup to achieve the purpose of merging, but discarding some fields will inevitably affect the coverage of aggregated data for query requests. The scene of hierarchical distribution of dimension query frequency is very suitable for data aggregation at different levels according to the popularity of dimension fields, while taking into account the degree of aggregation and coverage of the aggregation table.
Experimental group ID matching efficiency is low
The format of the current detailed data is:
The experiment group IDs in the detailed data are gathered in one field as a comma-separated string, and each query statement in the experiment report will use the
exp_idWhen filtering and querying data, use the LIKE method to match, and the query efficiency is low.
- The experimental group ID is modeled as a separate dimension, the exact match can be used instead of the LIKE query, and the Doris index can be used to improve the efficiency of data query.
- Flattening the comma-separated experimental group IDs will cause a rapid expansion of the data volume, so it is necessary to design a reasonable plan while taking into account the data volume and query efficiency.
The calculation of the number of people entering the group needs to be improved
The query of the number of people in the group is a must-check indicator of the experiment report, so its query speed greatly affects the overall query efficiency of the experiment report. The current main problems are as follows:
- When the number of people in the group is calculated as an independent indicator, use the approximate calculation function
APPROX_COUNT_DISTINCTProcessing is to improve query efficiency by sacrificing accuracy.
- When the number of people entering the group is calculated as the denominator of the composite index, use
COUNT DISTINCTProcessing, this method is less efficient in large data computing scenarios.
- The data conclusion of the AB experiment report will affect the user’s decision-making. It is not advisable to improve the query efficiency by sacrificing accuracy, especially in business occasions such as advertising that involve money and performance, and users cannot accept similar results.
- The number of people in the group is used
COUNT DISTINCTCalculations need to rely on detailed information, which is also an important factor for previous queries based on detailed data. A new scheme must be designed for such scenarios to improve the efficiency of the calculation of the number of people entering the group while ensuring the accuracy of the data.
Based on the above data status, the core point of our optimization is to pre-aggregate the detailed data, and control the Scan Rows and Scan Bytes of the Doris query by compressing the data. At the same time, make the aggregated data cover as much of the reporting query as possible. So as to achieve the purpose of reducing the pressure on the cluster and improving query efficiency.
The new data flow is shown in the figure below:
The entire process adds aggregated links on the basis of detailed links. On the one hand, Talos data is written into the Doris detailed table, and on the other hand, it is incrementally placed into the Iceberg table. The Iceberg table is used for both backtracking detailed data and generating aggregated data. We Through the real-time integration and offline integration of Workshop Alpha (Xiaomi’s self-developed data development platform), the stable operation of tasks and the consistency of data are guaranteed.
Choose to use dimension aggregation frequently
In the process of generating data aggregation, the degree of aggregation is negatively correlated with the request coverage. The fewer dimensions you use, the fewer requests you can cover, but the higher the degree of data aggregation; the more dimensions you use, the more requests you cover, but the finer the data granularity, the lower the degree of aggregation. Therefore, a balance needs to be struck in the process of modeling aggregate tables.
What we do is: Pull historical (nearly half a year) query logs for analysis, and sort according to the frequency of use of dimension fields to confirm the priority of entering the aggregation table. On this basis, the curve of the coverage rate and data volume of the aggregation table changing with the increase of the modeling field is obtained, as shown in the following figure:
The coverage rate is calculated by substituting historical request logs into the aggregation table.
Our principle is: for OLAP queries, the amount of data in the aggregation table should be controlled within a single day as much as possible 100 millionWithin , the request coverage is as far as possible More than 80%.
Therefore, it is not difficult to draw a conclusion: it is ideal to choose 14 dimension fields to model the aggregation table, and the amount of data can be controlled to a single day 80 millionbars, and the request coverage is about 83% .
use materialized view
When analyzing the report historical query logs, we found that the query frequency of different dimension fields has obvious stratification:
The Top7 dimension field appears in almost all query conditions of the report. For such high-frequency queries, it is worthwhile to make further investment to improve the query efficiency to the best possible level.Doris’ materialized views can serve such scenarios well.
What is a materialized view?
A materialized view is a specialphysical tablewhich saves some fields based on the base table (base table) furtherrollup aggregationthe result of.
Although physically separate storage, it is theuser transparentof. After configuring a materialized view for a base table, no additional work is required for writing and querying it:
- When writing and updating data to the base table, the cluster will automatically synchronize to the materialized view and ensure data consistency through transactions.
- When querying the base table, the cluster will automatically determine whether to route to the materialized view to obtain the result. When the query field can be completely covered by the materialized view, the materialized view will be used first.
So our query routing looks like this:
The user’s query request will be routed to the materialized view of the aggregate table as much as possible, then the base table of the aggregate table, and finally the detail table.
so usedmulti-gradientThe fit of the aggregation model to deal withheat stratificationquery requests, so that the efficiency of the aggregated data can be maximized.
Exact match replaces LIKE query
Since the materialized view is so easy to use, why don’t we configure the materialized view based on the Doris schedule, but develop the aggregation table separately? It is because the storage and query method of the experimental group ID field in the detailed data is unreasonable, and the aggregated data is not suitable to be obtained by directly rolling up the detailed data.
As mentioned above,
exp_id(Experimental group ID) is stored in the detailed table as a comma-separated string, and the LIKE method is used to match when querying data.Query as an AB lab reportMust check conditionsthis query method is undoubtedlyinefficientof.
The aggregation method we want is shown in the following figure:
we need to
exp_idThe field is disassembled, the data is flattened, and the exact match is used to replace the LIKE query to improve the efficiency of the query.
Control aggregate table data volume
If only the processing of splitting and flattening will inevitably lead to a surge in the amount of data, it may not be able to achieve the effect of positive optimization, so we still need to find a way to compress
exp_idThe amount of data after the tie:
- When selecting dimension fields for aggregation table modeling, in addition to the frequency of use of fields as the basis mentioned above, attention should also be paid to the fieldvalue base, make a comprehensive trade-off. If the dimension field with too high value cardinality enters the aggregation table, it will inevitably hinder the control of the data volume of the aggregation table.Therefore, under the premise of ensuring the coverage of the aggregation table request, we discard some of them as appropriatehigh base(there are more than 100,000 values).
- From the perspective of business, filter invalid data as much as possible (for example, if the traffic of an experimental group is 0% or 100%, there is no comparison meaning in business, and users will not check it, so such data does not need to be entered into the aggregation table).
After this series of steps, the data volume of the final aggregation table is controlled at about 80 million in a single day, and there is no reason for
exp_idFlatten and swell.
It is worth mentioning that,
exp_idAfter the fields are split, in addition to the query changing from LIKE matching to exact matching, two additional benefits are brought:
- field from
IntType, the comparison efficiency becomes higher when used as a query condition.
- Can use Dorisprefix indexandbloom filterand other capabilities to further improve query efficiency.
Use BITMAP to deduplicate instead COUNT DISTINCT
To speed up the query of experimental reports, the optimization of the number of people entering the group (the number of deduplicated users) is a very important part. As an indicator that is strongly dependent on detailed data, how can we achieve efficient pre-aggregation calculations like Sum, Min, Max and other indicators without losing detailed information?BITMAP deduplication calculation can well meet our needs.
What is BITMAP deduplication?
To put it simply, BITMAP deduplication is to establish a data structure, which is expressed as continuous binary bits (bits) in memory. Each element (must be an integer) participating in deduplication calculation can be mapped into a bit of this data structure. The subscript of the bit, as shown in the figure below:
When calculating the number of deduplicated users, the data is
bit_ormerged in a way to
bit_countway to get the result. More importantly, this can achieve pre-aggregation of the number of deduplicated users. The performance advantages of BITMAP are mainly reflected in two aspects:
- Compact space: Whether a bit is set or not indicates whether a number exists, which can save a lot of space. Taking Int32 as an example, the traditional storage space is 4 bytes, but it only needs to allocate 1/8 byte (1 bit) space for BITMAP calculation.
- Efficient calculation: BITMAP deduplication calculation includes setting the bit of a given subscript and counting the number of BITMAP settings, which are O(1) and O(n) operations, and the latter can use CLZ, CTZ and other instructions to efficiently calculate. In addition, BITMAP deduplication can also be accelerated in parallel in MPP execution engines such as Doris. Each node calculates its own local sub-BITMAP and then merges them.
Of course, the above is just a simplified introduction. This technology has been developed so far and has been optimized a lot, such as RoaringBitmap. Interested students can take a look at: https://github.com/RoaringBitmap/RoaringBitmap
To implement BITMAP deduplication calculation, it is necessary to ensure that the elements participating in the calculation are UInt32 / UInt64, and our
Stringtype, so we also need to design and maintain a global dictionary that will
user_idIt is mapped to a number to realize BITMAP deduplication calculation.
Since aggregated data currently only serves offline queries, we choose to implement a global dictionary based on the Hive table, and the process is as follows:
When generating the Doris aggregation table, the
user_idAs a query index, it is stored in the BITMAP type, and other general query indexes are aggregated by COUNT/SUM/MAX/MIN, etc.:
The corresponding relationship between the index calculation of the detailed table and the aggregated table is as follows:
After the query request is converted into SQL, the performance comparison between the detailed table and the aggregated table is as follows:
- Needless to say, the performance improvement of regular aggregation index query (speed improvement 50~60 times)
- The performance improvement of the group number query is also very impressive (the speed increase 10 about times)
The fast in and fast out of SQL queries can quickly release the resources occupied by queries, which also has a positive effect on alleviating the pressure on the cluster.
The changes in the CPU usage and disk IO status of the BE node of the Doris cluster have a significant effect:
It should be noted that the improvement of the cluster status (including the P95 improvement of the experimental report query) is not all due to the data pre-aggregation optimization work, but the joint efforts of all parties (such as product business form adjustment, back-end query engine queuing optimization, cache optimization , Doris cluster tuning, etc.).
Due to the variety of business query requirements, when querying the detailed table, a field will appearBoth as a dimension and as an indicatorto use the situation.
as in the advertising business table
targetConvNum(Number of target conversions) field, the value of this field is 0 and 1, the query scenario is as follows:
select targetConvNum,count(distinct user_id)
where olap_date = 20221105
and exp_id like '%154556%'
group by targetConvNum;
where olap_date = 20221105
and exp_id like '%154556%';
If this field is selected into the aggregation table, how should it be handled?
Our approach is:
- Model such fields in aggregate tables asdimension
- A count indicator cnt is needed in the aggregation table, indicating that a piece of data in the aggregation table is aggregated from how many pieces of data in the detail table
- When this type of field is queried as an index, it can be calculated in conjunction with the cnt index to get the correct result
where olap_date = 20221105
and exp_id like '%154556%';
select sum(targetConvNum * cnt)
where olap_date = 20221105
and exp_id = 154556;
After this series of Doris-based performance optimization and testing, the query performance improvement in A/B experimental scenarios has exceeded our expectations. It is worth mentioning that,Doris’ high stability and complete monitoring and analysis tools have also greatly improved our optimization work. I hope this sharing can provide some reference for friends in need.
at last,Thanks to the SelectDB company and the Apache Doris community for their support. Apache Doris is one of the most widely used OLAP engines within the Xiaomi Group. Currently, the group is promoting the latest vectorized version upgrade. In the future, we will adapt the business optimization work to the latest vectorized version of Doris to further facilitate the positive development of the business.
— end —
Finally, more open source technology enthusiasts are welcome to join the Apache Doris community to grow together and build a community ecology.The Apache Doris community currently accommodates tens of thousands of developers and users, hosting 30+ communication communitiesif you are also a fan of Apache Doris, scan the code to join the Apache Doris community user exchange group, where you can get:
- Professional full-time team technical support
- Communicate directly with community experts and get free and professional responses
- Meet developers from different industries, gain knowledge and cooperation opportunities
- Priority access to the latest version of Apache Doris
- Obtain first-hand dry goods and information and priority participation in activities
#Xiaomi #Experimental #Scenario #Based #Apache #Doris #Query #Speed #Optimization #Practice #SelectDBs #Personal #Space #News Fast Delivery