[Spark] Spark Cluster Manager 종류

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

데이터를 걷는 선비

[Spark] Spark Cluster Manager 종류 본문

BigData/Data Engineering

[Spark] Spark Cluster Manager 종류

세미제로 2024. 1. 17. 16:23

[순서]

0) Spark 구조 및 용어
1) Spark Local mode VS Spark Deploy mode
2) Local Mode
3) Deploy Mode 중 Client Mode
4) Deploy Mode 중 Cluster Mode
5) Cluster Manager
- Standalone
- YARN
- Mesos
- Kubernetes

https://semizero.tistory.com/56

[Spark] Spark Local mode와 Deploy Mode(local이랑 standalone 차이!!)

[순서] 0) Spark 구조 및 용어 설명 1) Spark Local mode VS Spark Deploy mode 2) Local Mode 3) Deploy Mode 4) Deploy Mode 중 Client Mode 5) Deploy Mode 중 Cluster Mode 6) Cluster Manager 0. Spark 구조 및 용어 설명 Spark는 하나의 중앙

semizero.tistory.com

5. Cluster Mangager(클러스터 매니저)

https://spark.apache.org/docs/latest/cluster-overview.html

Cluster Mode Overview - Spark 3.5.0 Documentation

Cluster Mode Overview This document gives a short overview of how Spark runs on clusters, to make it easier to understand the components involved. Read through the application submission guide to learn about launching applications on a cluster. Components

spark.apache.org

Spark는 Cluster를 사용한다면, Cluster의 머신들에 접근할 수 있는 여러 종류의 Cluster Manager 위에서 동작 가능합니다.
만약, 몇 대 정도의 머신 위에서 Spark 자체만 돌린다면, Standalone 모드가 가장 설정하기 쉬운 방법이지만,
다른 분산 애플리케이션과 공유해서 쓰는 클러스터를 사용한다면(예 : 스파크 작업과 하둡 맵리듀스 작업을 같이 운용), Spark는 세 가지 인기 클러스터 매니저(Yarn, Mesos, Kubernetes)와도 운영 가능합니다
특히, 요즘에는 Kubernetes 컨테이너 기반의 플랫폼이 증가하는 추세에 따라, Spark를 Kubernetes 환경에서 활용하고자 하는 사례가 늘어나고 있습니다.

1. Standalone

Standalone 모드를 사용하면 스파크에 포함된 전용 클러스터 관리 시스템을 활용해, 별도의 클러스터 매니저 없이 분산 모드를 가능케 합니다.
클러스터 환경에서 각 머신은 마스터 노드 또는 워커 노드로 실행됩니다.
스파크 애플리케이션만 실행할 수 있다는 큰 단점이 있지만, 클러스터 환경을 빠르게 구축해 스파크를 실행해야 하거나, YARN이나 다른 클러스터 매니저를 사용해 본 경험이 없을 때 용이하게 활용할 수 있습니다.

https://spark.apache.org/docs/latest/spark-standalone.html

Spark Standalone Mode - Spark 3.5.0 Documentation

Spark Standalone Mode In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided laun

spark.apache.org

Standalone 모드에서 Master는 Cluster Manager의 역할을 담당해서, 클라이언트의 요청을 받아 필요한 서버 자원을 할당하고, Worker의 작업 실행을 관리하는 기능을 수행합니다.
Worker는 Executor 와 Task 를 실행해, 데이터에 대한 실제 처리와 저장을 수행합니다.
- Standalnoe 모드의 경우 Executor는 Worker 노드 하나당 한 개씩 동작합니다.
또한, Standalone은 application을 실행하는 driver가 어디에서 실행되는지에 대해 두 가지의 배포 모드를 지원합니다.
- Client mode
  - Driver가 spark-submit을 실행하는 머신에서 spark-submit의 일부로 실행됩니다.
  - 이는 드라이버 프로그램의 출력을 직접 확인 가능하며 입력도 가능하다는 뜻이지만(대화형 셸 같은 경우)
  - 애플리케이션을 제출하고, 작업 노드들에 빠른 연결이 가능하며, 애플리케이션 실행 동안 계속 동작 가능한 상태로 있어야 하는 머신이 따로 필요합니다.
- Cluster mode
  - Driver가 단독 클러스터 내의 Worker들 중 하나에서 별개의 프로세스로 실행됩니다.
  - 이 모드에서 spark-submit은 "실행 후 개입하지 않는 방식"으로 실행되므로 애플리케이션이 실행 중이더라도 사용하던 PC를 꺼 버릴 수 있습니다.

2. YARN

YARN은 하둡의 클러스터 관리 시스템이며, HDFS와 함께 사용될 때 데이터 지역성을 활용해 효육적인 I/O 처리를 가능하게 합니다.

https://spark.apache.org/docs/latest/running-on-mesos.html

Running Spark on Mesos - Spark 3.5.0 Documentation

Running Spark on Mesos Note: Apache Mesos support is deprecated as of Apache Spark 3.2.0. It will be removed in a future version. Spark can run on hardware clusters managed by Apache Mesos. The advantages of deploying Spark with Mesos include: dynamic part

spark.apache.org

1) YARN Client 모드

애플리케이션을 제출한 Client 머신(예를 들어, 쓰고 있는 노트북)에서 애플리케이션을 위한 Driver Program이 실행되며
Application master는 단순히 NodeManager에게 필요한 자원을 요청하는 역할만 합니다.
주로 개발과정에서 대화형 디버깅을 할 때 의미가 있습니다.

2) YARN Cluster 모드

Driver Program이 Cluster의 NodeManager상에서 동작합니다.
순서
1. Client가 Spark Application을 Resource Manager에게 제출합니다.
2. Resource Manager는 NodeManager 중 하나를 선정해서 Application master(Driver)를 실행할 컨테이너를 할당하라고 지시합니다.
3. NodeManager는 Application master(Driver)의 컨테이너를 시작합니다.
4. Application master(Driver)는 스파크 executor에 사용할 컨테이너들을 리소스 매니저에 추가로 요청합니다.
5. Resource Manager가 리소스 할당을 ok 하면, Application master(Driver)는 Node Manger에게 컨테이너를 시작하라고 합니다.
6. NodeManager는 스파크 executor에서 사용할 컨테이너를 시작합니다.
7. 이제 driver와 executor는 직접 통신하면서 스파크 어플리케이션을 수행합니다.

3. Mesos

Mesos는 동적인 CPU 코어 할당 변경 등 세부적인 제어를 허용하는 범용 클러스터 관리 시스템입니다.

https://spark.apache.org/docs/latest/running-on-yarn.html

Running Spark on YARN - Spark 3.5.0 Documentation

Running Spark on YARN Support for running on YARN (Hadoop NextGen) was added to Spark in version 0.6.0, and improved in subsequent releases. Security Security features like authentication are not enabled by default. When deploying a cluster that is open to

spark.apache.org

아파치 메소스는 CPU, 메모리, 저장소 그리고 다른 연산 자원을 머신에서 추상화합니다.
메소스는 스파크에서 지원하는 클러스터 매니저 중에서 가장 무겁기에, 대규모의 메소스 배포 환경이 있는 경우에만 사용하는 것이 좋습니다.

4. Kubernetes

https://spark.apache.org/docs/latest/running-on-kubernetes.html

Running Spark on Kubernetes - Spark 3.5.0 Documentation

Running Spark on Kubernetes Spark can run on clusters managed by Kubernetes. This feature makes use of native Kubernetes scheduler that has been added to Spark. Security Security features like authentication are not enabled by default. When deploying a clu

spark.apache.org

최근 각광을 받고 있는 Spark on Kubernetes입니다.
3점대 초반 버전의 Spark on Kubernetes 를 사용할 경우 여러 가지 문제점이 있었습니다. 예를 들어, HDFS를 사용하지 않고 PVC를 사용했기에 속도 측면에서 매우 떨어지는 문제가 있었습니다.
그러나 최신 버전으로 오면서 이러한 문제점이 상당히 개선되는 모습을 보이고 있습니다.
제출 메커니즘
- 스파크는 쿠버네티스 파드 안에서 실행되는 Spark 드라이버를 생성
- 드라이버는 쿠버네티스 파드 안에서도 실행되는 익스큐러를 생성하고 연결하여 애플리케이션 코드 실행
- 애플리케이션이 완료되면 익스큐터 파드가 종료되지만 드라이버 파드는 로그를 유지
- 가비지가 수집되거나, 수동으로 정리될 때까지 드라이버 파드는 "완료"상태를 유지하며, 메모리 CPU 리소스는 사용하지 않음

https://techblog.woowahan.com/10291/

Spark on Kubernetes로 이관하기 | 우아한형제들 기술블로그

{{item.name}} 안녕하세요, 우아한형제들 데이터플랫폼팀 박준영입니다. 이번 글에서는 Spark on Kubernetes 환경 도입 과정과 운영 경험에 대해 소개해 드리려 합니다. 데이터플랫폼팀은 다양한 서비스

techblog.woowahan.com

https://blog.banksalad.com/tech/spark-on-kubernetes/

Spark on Kubernetes로 가자! | 뱅크샐러드

안녕하세요. 저희는 뱅크샐러드 Data Platform 팀 김민수, 김태일 입니다. 이번 글에서는 뱅크샐러드 데이터 분석환경 컴퓨팅을 EMR, YARN 기반 Spark에서 Self-hosted Kubernetes…

blog.banksalad.com

'BigData > Data Engineering' 카테고리의 다른 글

[Hadoop] 하둡 (v3.3.5버전) 설치하기 (0)	2024.02.11
[Hive] 하이브 개념과 아키텍처 (0)	2024.02.02
[Hive] Hive 중요 개념 둘러보기(MetaStore, 파티션) (0)	2024.01.28
[SQL] DELETE, TRUNCATE, DROP 차이 (0)	2024.01.18
[Spark] Spark Local mode와 Deploy Mode(local이랑 standalone 차이!!) (0)	2024.01.17