Apache Druid It is a distributed data processing system that supports real-time multi-dimensional OLAP analysis. It supports both high-speed real-time data ingestion processing and real-time and flexible multi-dimensional data analysis queries. Therefore, the most commonly used scenario of Druid is flexible and fast multi-dimensional OLAP analysis in the context of big data.
In addition, Druid has a key feature: it supports pre-aggregation ingestion and aggregation analysis of data based on timestamps, so some users often use it in scenarios with time-series data processing and analysis.
Currently Apache Druid 24.0.0 is released, this versionContains over 300 new features, bug fixes, performance enhancements, documentation improvements and additional tests from 67 contributors. Here are some of the new features:
Multi-stage query task engine
SQL-based ingestion of Apache Druid (ingestion) uses a distributed multi-stage query architecture that includes a query engine called the Multi-Phase Query Task Engine (MSQ Task Engine). The MSQ task engine extends Druid’s query capabilities so that queries that reference external data can be written and ingested using SQL INSERT and REPLACE.
As of Druid 24.0.0, SQL-based ingestion using the multi-stage query task engine is the most recommended solution, while alternative ingestion solutions such as native batch processing and Hadoop-based ingestion systems are still supported.
Druid now supports storing nested data structures directly in the newly added COMPLEX
Update Java support
Java 11 is fully supported, with improved Java 17 support.
query engine update
Updated query handling for column indexes and filters
The redesigned column index is very flexible, allowing various index types to be modeled. Added a mechanism to build filters that use updated indexes, while also allowing other column implementations to implement built-in index types to provide adapters to use indexes in the current collection filters provided by Druid.
time filter operator
You can now use the Druid SQL operator TIME_IN_INTERVAL to filter query results based on time. Use TIME_IN_INTERVAL in preference to the SQL BETWEEN operator to filter by time. For more information, see Date and Time Functions.
Null values and the “in” filter
null, the “in” filter matches null values. Unlike SQL IN filters that do not match null values.
For more information, see Query Filters and SQL Data Types.
Virtual columns in search queries
Previously, search queries could only search for dimensions present in the data source, now virtual columns are supported as parameters in the query.
Optimizing simple MIN/MAX SQL queries on __time
Simple query like now
select max(__time) from dsas
timeBoundaryThe query runs to take advantage of the time dimension ordering in the segment. A feature flag can be set to enable this feature.
String aggregation result
First/Last string aggregators now compare based on value only.Previously, the value of the first/last string aggregator was first based on
_timeColumns are compared, and then by value.
If you have an existing query and want to keep using it
_timecolumn and value, update the query to use ORDER BY MAX(timeCol).
Introduced and implemented new helper functions
Additionally, by default the
GroupByQueryToolChest Backward compatibility for mapped rows, which eliminates copy heavyweights
ObjectMapper. Introduced a configuration option that allows administrators to explicitly enable backwards compatibility.
Updated IPAddress Java library
A new IPAddress Java library dependency has been added to handle IP addresses, the library includes IPv6 support, and IPv4 functions have been migrated to use the new library.
Others include lots of performance improvements, this is a big release, check out the update announcement for more details.
#Apache #Druid #released #News Fast Delivery