I was interested in this experiment that involved the querying of 9 million unique records distributed across three HDFS files (total 1.4GB) using Spark RDD’s, Spark DataFrames and SparkSQL to determine performance differences.
The results for the two different types of queries from the experiment are as shown below:
Based on my own experience I’ve found that while RDD’s are the most performant, they are tedious when it comes to data manipulation. SparkSQL is definitely the most versatile and maintainable with a slight performance edge over DataFrames when grouping and ordering large datasets.
The below is an attempt to create a reusable template that can easily modified to repurpose SparkSQL queries:
from pyspark.sql import SparkSession # Create an empty pandas DataFrame df = pd.DataFrame() # Get an existing Spark Session or Create a New One sparkSession = SparkSession.builder.appName("reading csv").getOrCreate() # Read a CSV file and populate a Spark DataFrame df = sparkSession.read.csv("YOURFILE.csv", header=True, sep=",").cache() # Print out the schema of the Spark Dataframe df.dtypes # Display the schema of the Spark DataFrame df.schema # Create a Table from the Spark DataFrame before running SparkSQL df.createOrReplaceTempView("SOMETABLE") # Run SparkSQL sqlDF = sparkSession.sql(''' SELECT Company, Rating, count(*) as Number_of_Ratings FROM SOMETABLE GROUP BY Company, Rating ORDER BY Company ''') # Display the results sqlDF.show(200, False)