light up ⭐️ Star illuminates the road to open source

https://github.com/apache/incubator-seatunnel

version release

Today, Apache SeaTunnel (incubating) officially launched the official version 2.3.0, and officially released its own core synchronization engine Zeta! In addition, SeaTunnel 2.3.0 also brings many long-awaited new features, including support for CDC and nearly a hundred kinds of Connectors.

document
https://seatunnel.apache.org/docs/2.3.0/about

download link

https://seatunnel.apache.org/download/

01 major update

SeaTunnel’s Own Synchronization Engine—Zeta Officially Released

Zeta Engine is a data synchronization engine specially designed and developed for data synchronization scenarios. It is faster, more stable, more resource-efficient and easier to use. In the comparison of various open source synchronization engines around the world, Zeta performance is far ahead. The SeaTunnel Zeta engine has undergone several R&D versions, and the beta version will be released in October 2022. After community discussion, it was decided to name it Zeta (the fastest star in the universe, and the community students believe that this fully reflects the characteristics of the engine) , thanks to the efforts of community user contributors, we officially released the production-available version of Zeta Engine today, its features include:

  1. easy to use, the new engine minimizes the dependence on third-party services, and can realize cluster management, snapshot storage and cluster HA functions without relying on big data components such as Zookeeper and HDFS. This is very useful for users who do not have a big data platform or are unwilling to rely on a big data platform for data synchronization.

  2. save resources, at the CPU level, Zeta Engine internally uses Dynamic Thread Sharing (dynamic thread sharing) technology. In real-time synchronization scenarios, if the number of tables is large but the amount of data in each table is small, Zeta Engine will perform these synchronization tasks in Running in a shared thread, this method can reduce unnecessary thread creation and save system resources. On the read and data write side, the design goal of the Zeta Engine is to minimize the number of JDBC connections. In the CDC scenario, Zeta Engine will try to reuse log reading and parsing resources as much as possible.

  3. more stable,In this version, Zeta Engine uses Pipeline as the minimum granularity of Checkpoint and fault tolerance for the task of data synchronization. The failure of a task will only affect the tasks that have upstream and downstream relationships with it. Try to avoid the failure or rollback of the entire job caused by task failure . At the same time, for scenarios where the source data has a storage time limit, Zeta Engine supports enabling the data cache to automatically cache the data read from the source, and then the downstream task reads the cached data and writes it to the target. In this scenario, even if the target end fails and data cannot be written, it will not affect the normal reading of the source end, preventing the source end data from being deleted due to expiration.

  4. quicker, Zeta Engine’s execution plan optimizer will optimize the execution plan with the goal of reducing the possible network transmission of data, thereby reducing the loss of overall synchronization performance caused by data serialization and deserialization, and completing data synchronization faster operate. Of course, it also supports speed limiting, so that sync jobs can be performed at a reasonable speed.

  5. Full scene data synchronization support. The goal of SeaTunnel is to support full synchronization and incremental synchronization under offline batch synchronization, and support real-time synchronization and CDC.

Nearly 100 kinds of Connector support

Support ClickHouse, S3, Redshift, HDFS, Kafka, MySQL, Oracle, SQLserver, Teradata, PostgreSQL, AmazonDynamoDB, Greenplum, Hudi, Maxcompute, OSSfile, etc. 97 kinds of Connector (see: https://seatunnel.apache.org/docs/ 2.3.0/Connector-v2-release-state).

In this version, under the feedback of a large number of users and the testing of community contributors, many Connectors have been perfected to production-available standards. For Connectors that are still in the Alpha and Beta stages, everyone is welcome to join the test.

Support CDC Connector

Change data capture (CDC) is the process of identifying and capturing changes to data in a database, and then communicating those changes to downstream processes or systems in real time. This is a very important function in data integration, and it is also a long-awaited function. In version 2.3.0, it also supports CDC Connector for the first time, mainly JDBC-Connector (including MySQL, SQLServer, etc.).

SeaTunnel CDC is a concentrated solution based on the advantages and disadvantages of existing CDC components on the market, as well as related pain points obtained from a large number of user interviews. It has the following characteristics:

The following functions are still in the development stage, and I believe they will meet you soon:

Zeta Engine Metrics support

SeaTunnel version 2.3.0 also supports Zeta Metrics. Users can obtain various indicators after job execution is completed, including job execution time, job execution status, and the amount of data executed by the job. In the future, we will provide more and more comprehensive indicators to facilitate users to better monitor the running status of jobs.

Zeta engine supports persistent storage

SeaTunnel 2.3.0 version provides the function of persistent storage, users can store the metadata of the job in the persistent storage, which ensures that the metadata of the job will not be lost after restarting SeaTunnel.

Zeta Engine CheckPoint supports S3 storage plugin

Amazon S3 provides cloud object storage for a variety of use cases, and it is also one of the Checkpoint storage plugins that have received a lot of attention from the community recently. Therefore, we specifically support the S3 Checkpoint storage plug-in and are compatible with the S3N and S3A protocols.

02 Change Log

New Features

Core

  • [Core] [Log] Integrate slf4j and log4j2 for unified log management #3025

  • [Core] [Connector-V2] [Exception] Unified Connector exception format #3045

  • [Core] [Shade] [Hadoop] Add hadoop-shade package #3755

Connector-V2

  • [Connector-V2] [Elasticsearch] Add Source

    Connector #2821

  • [Connector-V2] [AmazondynamoDB] Add AmazondynamoDB Source & Sink Connector #3166

  • [Connector-V2] [StarRocks] Add StarRocks Sink Connector #3164

  • [Connector-V2] [DB2] Added DB2 source & sink connector #2410

  • [Connector-V2] [Transform] Added transform-v2 API #3145

  • [Connector-V2] [InfluxDB] Add influxDB Sink Connector #3174

  • [Connector-V2] [Cassandra] Added Cassandra Source & Sink Connector #3229

  • [Connector-V2] [MyHours] Added MyHours Source Connector #3228

  • [Connector-V2] [Lemlist] Added Lemlist Source Connector #3346

  • [Connector-V2] [CDC] [MySql] Add MySql CDC Source Connector #3455

  • [Connector-V2] [CDC] [SqlServer] Added SqlServer CDC Source Connector #3686

  • [Connector-V2] [Klaviyo] Added Klaviyo Source Connector #3443

  • [Connector-V2] [OneSingal] Added OneSingal Source Connector #3454

  • [Connector-V2] [Slack] Added Slack Sink Connector #3226

  • [Connector-V2] [Jira] Added Jira Source Connector #3473

  • [Connector-V2] [Sqlite] Added Sqlite Source & Sink Connector #3089

  • [Connector-V2] [OpenMldb] Add OpenMldb Source Connector #3313

  • [Connector-V2] [Teradata] Added Teradata Source & Sink Connector #3362

  • [Connector-V2] [Doris] Added Doris Source & Sink Connector #3586

  • [Connector-V2] [MaxCompute] Add MaxCompute Source & Sink Connector #3640

  • [Connector-V2] [Doris] [Streamload] Added Doris streamload Sink Connector #3631

  • [Connector-V2] [Redshift] Added Redshift Source & Sink Connector #3615

  • [Connector-V2] [Notion] Added Notion Source Connector #3470

  • [Connector-V2] [File] [Oss-Jindo] Add OSS Jindo Source & Sink Connector #3456

Zeta engine

  • Support print job metrics when job completes #3691

  • Add Metris information statistics #3621

  • Support IMap file storage (including local files, HDFS, S3) #3418 #3675

  • Support saving job restart status information #3637

E2E

Bug Fixes

Connector-V2

  • [Connector-V2] [Jdbc] Fix Jdbc Source cannot be stopped in batch mode #3220,

  • [Connector-V2] [Jdbc] Fix Jdbc connection reset error #3670

  • [Connector-V2] [Jdbc] Fix NPE in Jdbc connector exactly-once #3730

  • [Connector-V2] [Hive] Fix NPE during Hive data writing #3258

  • [Connector-V2] [File] Fix the NPE that occurs when File Connector gets FileSystem #3506

  • [Connector-V2] [File] Fix NPE thrown when File Connector user doesn’t configure fileNameExpression #3706

  • [Connector-V2] [Hudi] Fix the bug that the split owner of Hudi Connector may be negative #3184

  • [Connector-V2] [Jdbc] Fix the error that the resource is not closed after the execution of Jdbc Connector #3358

    Zeta engine

  • [ST-Engine] Fix the problem that the data file name is repeated when using the Zeta engine #3717

  • [ST-Engine] Fix the problem that data cannot be read normally from Imap persistence when the node fails #3722

  • [ST-Engine] Fix Zeta Engine Checkpoint #3213

  • [ST-Engine] Fix the bug that Zeta engine Checkpoint failed #3769

optimization

Core

  • [Core] [Starter] [Flink] Modify Starter API to be compatible with Flink version #2982

  • [Core] [Pom] [Package] Optimize the packaging process #3751

  • [Core] [Starter] Optimize Logo printing logic to adapt to higher version JDK #3160

  • [Core] [Shell] Optimize binary plugin download script #3462

Connector-V1

Connector-V2

  • [Connector-V2] Add Connector Split base module to reuse logic #3335

  • [Connector-V2] [Redis] Support cluster mode & user authentication #3188

  • [Connector-V2] [Clickhouse] Support nest and array data types #3047

  • [Connector-V2] [Clickhouse] Support geo type data #3141

  • [Connector-V2] [Clickhouse] Improve double data type conversion #3441

  • [Connector-V2] [Clickhouse] Improve Float, Long type data conversion #3471

  • [Connector-V2] [Kafka] Support setting the starting offset or message time for reading and obtaining #3157

  • [Connector-V2] [Kafka] Support specifying multiple partition keys #3230

  • [Connector-V2] [Kafka] Support dynamic discovery of partitions and topics #3125

  • [Connector-V2] [Kafka] Support Text format #3711

  • [Connector-V2] [IotDB] Add parameter validation #3412

  • [Connector-V2] [Jdbc] Support setting data acquisition size #3478

  • [Connector-V2] [Jdbc] Support Upsert configuration #3708

  • [Connector-V2] [Jdbc] Optimize the submission process of Jdbc Connector #3451

  • [Connector-V2] [Oracle] Improve datatype mapping for Oracle connector #3486

  • [Connector-V2] [Http] Support extracting complex Json strings in Http connector #3510

  • [Connector-V2] [File] [S3] Support S3A protocol #3632

  • [Connector-V2] [File] [HDFS] Support using hdfs-site.xml #3778

  • [Connector-V2] [File] Support file splitting #3625

  • [Connector-V2] [CDC] Support writing CDC changelog events in Jdbc ElsticSearch #3673

  • [Connector-V2] [CDC] Support writing CDC changelog events in Jdbc ClickHouse #3653

  • [Conncetor-V2] [CDC] Support writing CDC changelog events in Jdbc Connector #3444

Zeta engine

CI

DS

E2E

DS

  • [E2E] [Flink] Support to execute command line on task manager #3224

  • [E2E] [Jdbc] Optimize JDBC e2e to improve the stability of test code #3234

  • [E2E] [Spark] Corrected Spark version in e2e container to 2.4.6 #3225

See the specific Change log: https://github.com/apache/incubator-seatunnel/releases/tag/2.3.0

03 thank you

Behind every version release is the efforts of countless people in the community. In the dead of night, during holidays, after work, and in countless fragmented times, they have made their own contributions to the development of the project. Special thanks (@Jun Gao, @ChaoTian) and other students conducted multiple rounds of performance testing and stability testing for the candidate version. We sincerely thank everyone for their contributions. The following is the list of contributors (GitHub ID) for this version, in no particular order:

Eric Joy2048
TaoZex
Hisoka-X
Tyrant Lucifer
ic4y
liugddx
Calvin Kirs
ashulin
hailin0
Carl-Zhou-CN
FW Lamb
wuchunfu
john8628
lightzhao

15531651225
zhaoliang01
harveyyue
Monster Chenzhuo
hx23840
Solomon-aka-beatsAll
matesoul
lianghuan-xatu
skyoct
25Mr-LiuXu
iture123
FlechazoW
mans2singh

Special thanks to this release manager @TyrantLucifer. Although it is the first time to assume the role of Release Manager, he has actively communicated with the community on version planning, spared no effort to track issues before release, deal with blocking issues, and manage version quality. He is perfectly qualified for this release. version task. Thank him for his contribution to the community, and welcome other Committers and PPMCs to take the initiative to claim the task of Release Manager to help the community complete releases more quickly and with high quality.

#SeaTunnel #Zeta #engine #choice #massive #data #synchronization #officially #released #News Fast Delivery

Leave a Comment

Your email address will not be published. Required fields are marked *