SparkSQL: Quickstart Jupyter Template

I was interested in this experiment¬†that involved the querying of 9 million unique records distributed across three HDFS files (total 1.4GB) using Spark RDD’s, Spark DataFrames and SparkSQL to determine performance differences.

The results for the two different types of queries from the experiment are as shown below:

spark_benchmarking.png

Based on my own experience I’ve found that while RDD’s are the most performant, they are tedious when it comes to data manipulation. SparkSQL is definitely the most versatile and maintainable with a slight performance edge over DataFrames when grouping and ordering large datasets.

The below is an attempt to create a reusable template that can easily modified to repurpose SparkSQL queries:

from pyspark.sql import SparkSession

# Create an empty pandas DataFrame 
df = pd.DataFrame()

# Get an existing Spark Session or Create a New One 
sparkSession = SparkSession.builder.appName("reading csv").getOrCreate()

# Read a CSV file and populate a Spark DataFrame
df = sparkSession.read.csv("YOURFILE.csv", header=True, sep=",").cache()

# Print out the schema of the Spark Dataframe
df.dtypes

# Display the schema of the Spark DataFrame
df.schema

# Create a Table from the Spark DataFrame before running SparkSQL
df.createOrReplaceTempView("SOMETABLE")

# Run SparkSQL
sqlDF = sparkSession.sql('''
        SELECT Company, Rating, count(*) as Number_of_Ratings
        FROM SOMETABLE
        GROUP BY Company, Rating
        ORDER BY Company
        ''')

# Display the results
sqlDF.show(200, False)
Advertisements

Parquet File Viewer for Jupyter Notebook

The following code displays the binary contents of a parquet file as a table in a Jupyter notebook:

import pyarrow.parquet as pq
import pandas as pd

table = pq.read_table('SOME_PARQUET_TEST_FILE.parquet')
table_dict = dict(table.to_pydict())

items = table_dict.items()

keys = [item[0] for item in items]
values = [item[1] for item in items]
pivoted_values = zip(*values)

table_dictionary_array = []

for record in pivoted_values:
    table_dictionary_array.append(dict(zip(keys, record)))
    
df = pd.DataFrame.from_dict(table_dictionary_array)
df

AVRO File Viewer for Jupyter Notebook

The following code displays the binary contents of an AVRO file as a table in a Jupyter notebook:


import avro.schema, pandas as pd
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
from tabulate import tabulate

from urllib.request import urlopen
from pandas.io.json import json_normalize

reader = DataFileReader(open("SOME_AVRO_FILE.avro", "rb"),
                        DatumReader())  

df = pd.DataFrame(reader)
print(tabulate(df, headers="firstrow", tablefmt='grid'))