What is the Difference between RDD, DataFrame, Dataset
RDD :
RDD is a fault-tolerant collection of elements that can be operated on in parallel.
DataFrame :
DataFrame is a table kind of a table format with named columns. It's equivalent to a table in a relational database or a data frame in Python, but with richer optimizations under the hood using the spark engine.
Dataset :
Dataset is a distributed collection of data. Dataset has benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
Comments
Post a Comment