Compass is an open source project based on OPPO’s internal big data diagnostic platformwhich can be used to diagnose big data tasks running on scheduling platforms such as DolphinScheduler and Airflow.

Compass Core Functions

Compass currently supports the following functions and features:

  • Non-invasive, instant diagnosis, no need to modify the existing scheduling platform, you can experience the diagnosis effect.

  • Supports a variety of mainstream scheduling platforms, such as DolphinScheduler, Airflow or self-developed, etc.

  • Support multi-version Spark, Hadoop 2.x and 3.x task log diagnosis and analysis.

  • Support abnormal diagnosis of workflow layer to identify various failures and baseline time-consuming abnormal problems.

  • Support engine layer exception diagnosis, including 14 types of exceptions such as data skew, large table scan, memory waste, etc.

  • It supports the writing of various log matching rules and the adjustment of abnormal thresholds, and can be optimized according to actual scenarios.

Overview of diagnostic types supported by Compass:
























Diagnostic Dimensions

Diagnosis type

type description

failure analysis

failed to run

Finally run the failed task

first failure

retries greater than1successful mission

long term failure

recent10Days to run failed tasks

time-consuming analysis

abnormal baseline time

Compared with the historical normal end time, tasks that end earlier or later

Baseline time-consuming anomalies

Tasks that are running too long or too short relative to their historical uptime

Takes a long time to run

run time exceeds2hour task

Error analysis

sqlfail

becausesqlTasks that failed due to execution problems

shufflefail

becauseshuffleTasks that failed due to execution problems

out of memory

Tasks that fail due to out-of-memory issues

cost analysis

memory waste

Tasks with a low ratio of peak memory usage to total memory

CPUwaste

driver/executorCalculation time and totalCPUTasks with a low proportion of computing time

Efficiency Analysis

large table scan

Tasks that do not limit partitions causing too many rows to be scanned

OOMearly warning

Cumulative memory for broadcast tables vs.driverorexecutorAny task with a high memory usage

data skew

stageexists intaskTasks that process the largest amount of data much larger than the median

jobtime-consuming exception

jobfree time withjobTasks that take too much time to run

stagetime-consuming exception

stagefree time withstageTasks that take too much time to run

tasklong tail

stageexists intaskTasks whose maximum running time is much greater than the median

HDFSStuck and stopped

stageexists intaskProcessing tasks that are too slow

speculative executiontaskexcessive

stageappear frequently intaskspeculatively executed tasks

global sort exception

Global sorting causes tasks that take too long to run

Compass Technical Architecture

Compass is mainly composed of task metadata module for synchronous workflow layer, synchronous Yarn/Spark App metadata module, associated workflow layer/engine layer App metadata module, workflow task anomaly detection module, engine layer anomaly detection module, and Portal display module .

Overall Architecture Diagram

The overall architecture is divided into 3 layers:

  • The first layer is to interface with external systems, including scheduler, Yarn, HistoryServer, HDFS and other systems, to synchronize metadata, cluster status, operating environment status, logs, etc. until diagnostic system analysis;

  • The second layer is the architecture layer, including data collection, metadata association & model standardization, anomaly detection, and diagnostic Portal module;

  • The third layer is the basic component layer, including MySQL, Elasticsearch, Kafka, Redis and other components.

Specific module process stages:

(1)data collection stage: Synchronize workflow metadata such as users, DAGs, jobs, and execution records from the scheduling system to the diagnostic system; regularly synchronize metadata from Yarn ResourceManager and Spark HistoryServer App to the diagnostic system, mark the storage path of job running indicators, and make preparations for subsequent data processing stages Base;

(2)Data Association & Model Standardization Phase: Associate the workflow execution records collected step by step, Spark App, Yarn App, cluster operating environment configuration and other data through the ApplicationID medium. At this time, the metadata of the workflow layer and the engine layer have been associated, and the data standard model is obtained (user, dag, task, application, clusterConfig, time);

(3)Workflow layer & engine layer anomaly detection stage: So far, the data standard model has been obtained, and the Workflow anomaly detection process is further aimed at the standard model. At the same time, the platform maintains a set of data governance knowledge base that has been accumulated for many years, and the knowledge base is loaded into the standard model. Through heuristic rules, the index data of the standard model, Logs are simultaneously excavated for exceptions, combined with the status of the cluster and the state of the running environment, and analyzed to obtain abnormal results at the workflow layer and engine layer;

(4)business view: Store and analyze data, provide user task overview, workflow layer task diagnosis, engine layer job Application diagnosis, workflow layer displays exceptions caused by scheduler execution tasks, such as task failure, loopback task, baseline deviation task, etc., calculation The engine layer shows the time consumption, resource usage, and runtime issues caused by Spark job execution.

#Compass #Homepage #Documentation #Downloads #Big #Data #Task #Diagnosis #Platform #News Fast Delivery

Leave a Comment

Your email address will not be published. Required fields are marked *