原文地址:Using PySpark to perform Transformations and Actions on RDD
Introduction
In my previous article, I introduced you to the basics of Apache Spark, different data representations (RDD / DataFrame / Dataset) and basics of operations (Transformation and Action). We even solved a machine learning problem from one of our past hackathons. In this article, I will continue from the place I left in my previous article. I will focus on manipulating RDD in PySpark by applying operations (Transformation and Actions).
As you would remember, a RDD (Resilient Distributed Database) is a collection of elements, that can be divided across multiple nodes in a cluster to run parallel processing. It is also a fault tolerant collection of elements, which means it can automatically recover from failures. RDD is immutable, i.e. once created, we can not change a RDD. So, then how do I apply operations on a RDD? Well, we apply an operation and store results in another RDD
For this article, one must have some understanding about Apache Spark and hands on experience in python programming.
Table of Contents
- Recap
- What is Transformation and Action?
- Transformation and Action
- Major Categories
- Applying Transformation and Action
- General
- Mathematical and Statistical
- Set Theory and Relational
- Data-structure and IO
阅读更多