In the more than ten years of rapid development of open source big data technology, we have witnessed the rise and change of diversified technologies. How to make deep insights into the past, present and future of open source big data technology from massive data through data processing and visualization? How to provide useful reference for developers to learn, select and develop technologies in the field of open source big data technology?With this kind of thinking, the Open Atom Open Source Foundation, X-Lab Open Lab, and Alibaba Open Source Committee jointly initiated the“2022 Open Source Big Data Heat Report”project.
Project Description
“2022 Open Source Big Data Heat Report”Collect relevant public data for correlation analysis, draw a heat map based on the big data technology stack through core indicators such as Star, Issue, and open PR, and study the technical trends of open source big data after entering a new stage, as well as the operating mode of the open source community on the technology trend. boosting effect. The project research follows the following 7 stages: preliminary screening of public data -> project technical classification -> expert review -> finalist announcement & solicitation correction -> caloric value calculation and correlation analysis -> data insight and project research -> report review.
Data Sources
Github and Jira public data from January 2015 to September 2022, including project id, Star, Issue, open PR, review comment, merge PR, etc.
Data screening
The project initially screened open-source big data projects with Topic Tag on Github that meet the following conditions:
Topic Tag: big-data, etl, data-ingestion, data-collection, data-pipeline, data-analysis, data-analytics, analytics, data visualization, business-intelligence, data science, data-engineering
Technical classification
According to the framework of the modern technology stack of big data, the technical classification of the preliminary screening projects is carried out. Technical categories include:
Data integration, stream processing, data storage, data query and analysis, data development, data scheduling and orchestration, data management/security/middleware, data visualization.
illustrate:
- Data query and analysis classification focuses on big data analysis type projects, excluding OLTP databases, HTAP databases and NoSQL databases with OLTP capabilities
- Data source linking and processing capabilities are required in data visualization classification, excluding visualization framework tool projects
- In the data management/security/middleware category, there are fewer items and functions overlap each other, so they are grouped into one category
- This report focuses on the field of big data, excluding big data AI integration projects
Project announcement
The shortlisted projects (92 in total) are now announced, and the publicity period is from October 10 to October 16, 2022.
Technical classification | project name |
data integration | airbytehq/airbyte alibaba/DataX apache/camel apache/flume apache/incubator-seatunnel apache/inlong apache/sqoop dbt-labs/dbt-core debezium/debezium ververica/flink-cdc-connectors |
stream processing | apache/beam apache/flink apache/incubator-heron apache/incubator-streampark apache/kafka apache/pulsar apache/samza apache/storm |
Data query and analysis | apache/arrow-datafusion apache/calcite apache/cassandra apache/doris apache/drill apache/druid apache/hawq apache/hbase apache/hive apache/impala apache/incubator-kyuubi apache/kylin apache/lucene apache/phoenix apache/pig apache/pinot apache/solr apache/spark apache/tez ClickHouse/ClickHouse duckdb/duckdb elastic/elasticsearch eventql/eventql greenplum-db/gpdb opensearch-project/OpenSearch prestodb/presto StarRocks/starrocks trinodb/trino uber/aresdb |
data storage | apache/avro apache/bookkeeper apache/carbondata apache/hadoop-hdfs apache/hudi apache/iceberg apache/incubator-pegasus apache/kudu apache/ozone apache/parquet-format delta-io/delta hazelcast/hazelcast juicedata/juicefs |
Data Management/Security/Middleware | apache/ambari apache/arrow apache/atlas apache/bigtop apache/hadoop apache/knox apache/ranger cube-js/cube.js datahub-project/datahub |
data development | apache/incubator-devlake apache/zeppelin jupyter/notebook pachyderm/pachyderm |
data visualization | apache/superset dataease/dataease edp963/davinci elastic/kibana getredash/redash grafana/grafana keplergl/kepler.gl metabase/metabase shzlw/poli |
Data Scheduling and Orchestration | Alluxio/alluxio apache/airflow apache/dolphinscheduler apache/incubator-linkis apache/nifi apache/oozie apache/zookeeper dagster-io/dagster kestra-io/kestra PrefectHQ/prefect |
Supplementary Call for Other Projects
If you are also a fan of open source projects, if your well-known projects are not in the above list, but meet the following criteria, you can scan the QR code below to participate in the project submission during the publicity period.
Participation Criteria:
1. Open source big data projects with clear open source protocols and complete documents; new versions have been released within half a year
2. One of the following Topic Tags on Github: big-data, etl, data-ingestion, data-collection, data-pipeline, data-analysis, data-analytics, analytics, data visualization, business-intelligence, data science, data-engineering
way of participation:
Scan the QR code above to participate in the solicitation
Deadline: 24:00 on October 16, 2022
release notice
“Open Source Big Data Heat Report 2022”It will be officially released at the Yunqi Conference in November 2022.
Special thanks
- Co-sponsors: Open Atom Open Source Foundation, X-Lab Open Lab, Alibaba Open Source Committee
- Strategic cooperation: Open Source China, InfoQ, Alibaba Cloud Developer Community
- Cooperative media: CSDN, Datafun, SegmentFault
#Open #Source #Big #Data #Heat #Report #Finalist #Project #AnnouncementNews Fast Delivery