dhkdn9192/data_engineer_career: DE직무에 필요한 모든 것

데이터 엔지니어가 알아야 할 모든 것들을 정리합니다. 자료 출처는 각 문서의 하단을 참조하시기 바랍니다.

본 레포 문서는 기술블로그(https://dhkdn9192.github.io) 에서도 보실 수 있습니다.

1. Data Engineering
- 1-1. Hadoop Ecosystem
- 1-2. ELK Stack
- 1-3. Kubernetes and Docker
- 1-4. AWS
2. Computer Science
- 2-1. Operation System
- 2-2. Database
- 2-3. Network
- 2-4. Programming Language
- 2-5. Data Structure and Algorithm
- 2-6. common sense
3. Designing Data-Intensive Application
4. Fields of Study

1. Data Engineering

데이터 엔지니어가 알아야 할 기술 질문

1-1. Hadoop Ecosystem

Apache Hadoop
- HDFS의 replication-factor를 3->5로 변경하면 최대 몇 번의 장애까지 견딜 수 있는가?
- Hadoop 3.x의 Erasure Coding
- YARN이 도입된 이유
- HA consensus of HDFS
- 손상된 블록을 탐지하고 처리하는 프로세스
- Parquet와 칼럼 기반 스토리지
- Parquet의 압축 알고리즘
- Standby Namenode vs Secondary Namenode
- YARN scheduler
- HDFS의 read/write/replication 절차
- RDBMS의 SQL과 Hadoop MapReduce의 차이점
- MapReduce spilling
- Hadoop 서버의 vm.swappiness 설정
- 클라이언트에서 hdfs write를 위한 옵션을 설정하려면 어떤 xml 설정파일을 수정해야될까?
- 클러스터로 구성된 서비스를 무중단으로 업데이트하려면?(Rolling Restart)
Apache Spark
- RDD, DataFrame, Dataset
- SparkContext and SparkSession
- Spark Executor의 메모리 구조
- PySpark에서 Scala UDF / Python UDF 성능 비교
- 언어별 Spark API 성능 차이
- RDD 커스텀 파티셔닝
- RDD Aggregation: groupByKey vs reduceByKey
- repartition과 coalesce의 차이점
- Spark access first n rows: take() vs limit()
- 효율적인 DataFrame Join 전략
- Spark의 memoryOverhead 설정과 OutOfMemoryError
- memoryOverhead만 높여주면 해결 가능한 exceeding memory limits 문제 (parquet )
- spark.executor.memoryOverhead와 spark.memory.offHeap.size 설정은 어떻게 다른가?
- Project Tungsten의 주요 Spark 성능 개선 사항은 무엇인가?
- Java 직렬화 vs Kryo 직렬화
- ORC, Parquet 등 Spark에서 사용할 수 있는 데이터 소스 포맷과 압축 알고리즘
- k8s에서 Spark Job을 수행한다면 종료 후 로그는 어떻게 확인해야될까? (Spark History Server? AWS S3 logging?)
- Spark Job에 과도하게 많은 Memory/CPU를 할당해주면 무슨 일이 일어날까?
- Spark bucketing이란?
Apache Flink
- 배치처리와 스트림처리
Apache Druid
- Druid의 주요 특징
- Druid의 아키텍처
Apache HBase
- Major Compaction vs Minor Compaction
- Region Server architecture
- Time series Row key design: Salting, Empty region
- Region's locality
Apache Hive
- Partition, Bucket, Index
- Why isn't the metastore in hdfs?
- Which is faster, SORT BY or ORDER BY in HiveQL?
- What is HCatalog?
- Hive UDF란?
Apache Kafka
- Kafka의 partition은 많을 수록 좋을까?
- Kafka Streams Topology
- Kafka에서 Zookeeper의 역할
- Kafka + Spark Streaming : 2가지 Integration 방법 비교
- Kafka + Spark Streaming : 파티션 수와 컨슈머 수 정하기
- Kafka의 exactly-once delivery
- Burrow와 Telegraf로 Kafka Lag 모니터링하기
- ISR (In Sync Replica)
- Kafka의 Controller Broker(KafkaController)란 무엇인가?
Apache Oozie
- Oozie를 사용하면서 불편했던 점들
Apache Airflow
- Executor Types: Local vs Remote (link)
- Celery 개념과 Celery Excutor
CDH setup
- Set up Virtual Box
- Install Cloudera Manager
Common Questions
- Top 50 Hadoop Interview Questions You Must Prepare In 2020
- Top Hadoop Interview Questions To Prepare In 2020 – HDFS
- Top 20 Apache Spark Interview Questions 2019
- Top 62 Data Engineer Interview Questions & Answers
- Hadoop MapReduce Interview Questions In 2020
- Top Hadoop Interview Questions To Prepare In 2020 – Apache Hive
- Lambda architecture

1-2. ELK Stack

Elasticsearch
- 성능 튜닝하기 : Shard, Replica의 개수와 사이즈 등
Logstash
es의 ingest pipeline을 이용한 전처리

1-3. Kubernetes and Docker

Docker
- Container vs VM
- Difference between Docker and process
Kubernetes Cluster
- Pod
- Replica Set
- Deployment
- Service
- Namespace

1-4. AWS

Amazon EC2
Amazon S3
- S3 vs EFS vs EBS
- s3, s3n, s3a 차이점
Amazon Redshift
- Amazon Redshift가 지원하지 않는 것들
Amazon EMR
- Node Types: Master, Core, Task Nodes

2. Computer Science

2-1. Operation System

멀티스레드와 멀티프로세스
교착상태(deadlock)의 발생조건
다익스트라의 은행원 알고리즘
세마포어와 뮤텍스
프로세스 스케줄러
CPU 스케줄링 알고리즘
페이지 교체 알고리즘
페이징과 세그먼테이션, 그리도 단편화
Big-endian, Little-endian
캐시 메모리와 버퍼 메모리
페이지 캐시와 버퍼 캐시
Polling과 Interrupt
Sync와 Async, Blocking과 Non-blocking
Context Switching이 진행되는 단계
하이퍼스레딩과 코어 수

2-2. Database

데이터 무결성 (Data Integrity)
데이터베이스 인덱스
데이터베이스 정규화
파티셔닝과 샤딩의 차이
트랜잭션과 ACID
DDL / DML / DCL / TCL
DELETE / TRUNCATE / DROP
Top 50 SQL Interview Questions
PostgreSQL vs MariaDB
MongoDB 고가용성 아키텍처

2-3. Network

TCP and UDP
TCP's 3-way handshake, 4-way handshake
HTTP 요청 메소드: GET과 POST의 차이
웹 브라우저가 웹 페이지의 이미지를 보여주기까지의 과정

2-4. Programming Language

Java
- 인터페이스와 추상클래스의 차이, 그리고 다형성
- JVM, JIT Compiler, GC
- GC 정리
- Java 메모리 누수
- On-heap과 Off-heap
- String 대신 StringBuffer, StringBuilder를 쓰는 이유
- static 선언과 GC
- Primitive type, Reference type, Wrapper class
Scala
- Scala의 함수형 프로그래밍 성질
- Scala의 pass-by-name
- 동반 객체 (Companion Object)
- 케이스 클래스 (case class)
Python
- GIL(Global Interpreter Lock)

2-5. Data Structure and Algorithm

Array vs Linked List
Stack and Queue
- Stack으로 Queue 구현하기
Tree
- Binary Search Tree (BST)
- AVL Tree
- Heap
Hash Table
Graph
- Dijkstra algorithm
Sorting
Recursion
Dynamic Programming

2-6. common sense

MVC Pattern
객체지향의 DTO, DAO, VO 개념 용어
Idempotence(멱등성)
테스트 도구와 절차
트래픽/트랜잭션량 측정
Singleton 패턴을 사용하는 이유

3. Designing Data-Intensive Application

데이터 중심 애플리케이션 설계

OLTP와 OLAP

4. Fields of Study

머신러닝, 데이터분석 등 관심있는 연구 분야와 수행 프로젝트 정리

Anomaly Detection
Churn Prediction
NLP
Recommender System
ideas
- PySpark 클러스터 환경에서 각 노드별 python package 일괄 관리 툴
- Apache Nutch의 streaming 버전, Spark 기반의 웹 크롤러

Reference

https://www.edureka.co/blog/interview-questions/hadoop-interview-questions-hdfs-2/
https://acadgild.com/blog/top-20-apache-spark-interview-questions-2019
https://github.com/JaeYeopHan/Interview_Question_for_Beginner
https://wikidocs.net/23683

data_engineer_career
data_engineer_career copied to clipboard

Metadata

Table of Contents

1. Data Engineering

1-1. Hadoop Ecosystem

1-2. ELK Stack

1-3. Kubernetes and Docker

1-4. AWS

2. Computer Science

2-1. Operation System

2-2. Database

2-3. Network

2-4. Programming Language

2-5. Data Structure and Algorithm

2-6. common sense

3. Designing Data-Intensive Application

4. Fields of Study

Reference

← Metadata

Owner

Metadata

data_engineer_career data_engineer_career copied to clipboard

Metadata

Table of Contents

1. Data Engineering

1-1. Hadoop Ecosystem

1-2. ELK Stack

1-3. Kubernetes and Docker

1-4. AWS

2. Computer Science

2-1. Operation System

2-2. Database

2-3. Network

2-4. Programming Language

2-5. Data Structure and Algorithm

2-6. common sense

3. Designing Data-Intensive Application

4. Fields of Study

Reference

← Metadata

Owner

Metadata

data_engineer_career
data_engineer_career copied to clipboard