PySpark Code Snippets

Sparks by https://unsplash.com/@stephaniemccabe

 

I have been using PySpark for some time now and I thought to share with you the process of how I begin learning Spark, my experiences, problems I encountered, and how I solved them! You are more than welcome to suggest and/or request code snippets in the comments section below or at my twitter @siaterliskonsta

Contents

DataFrames

From Spark’s website, a DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.

Reading Files

Here are some options on how to read files with PySpark.

Selecting Columns

Now that you have some data in your DataFrame you may need to select some specific columns instead of the whole thing. This is how you do it

Filtering

In addition, you may also want to filter the rows based on some conditions. This is how it works.

 

GroupBy

Now that you have learned the basic column and row manipulation, let’s move on to some aggregations. 

RDDs

Reading Files

Let’s see how to read files into Spark RDDs (Resilient Distributed Datasets)

 
 

Leave a Comment

Your email address will not be published. Required fields are marked *