AWS Glue 3.0 version

AWS Glue 3.0


AWS has announced that they have released Glue 3.0 version with new set of more optimizations.

Based on Apache Spark 3.1.1, which has optimizations from open-source Spark and developed by the AWS Glue and EMR services such as adaptive query execution, vectorized readers, and optimized shuffles and partition coalescing.

Glue 2.0 has older version of spark 2.4 but with Glue 3.0 we are able to access customized Spark 3.1.1

Amazon S3 optimized output committers by default, which was not default earlier in spark 2.0.

Reduced startup latency improving overall job completion times and interactivity.

Billing is same as Glue 2.0 costs for each second.

Spark 3.1, Python 3 or Spark 3.1, Scala 2 in Glue 3.0 version.



AWS Glue 3.0 does not yet support machine learning transforms.

AWS Glue 3.0 does not yet support development endpoints.

Python 2.7 is not supported with Spark 3.1.1.

AWS Glue 3.0 does not run on Apache YARN, so YARN settings do not apply.

AWS Glue 3.0 does not have a Hadoop Distributed File System (HDFS).

Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backwards compatible with Scala 2.11.

As there are many upgrades in dependencies in Glue 3.0, use --user-jars-first = 'true' If you want to provide override default jars in glue.


Older and new versions comparisons: 

DriverJDBC driver version in past AWS Glue versionsJDBC driver version in AWS Glue 3.0
MySQL5.18.0.23
Microsoft SQL Server6.1.07.0.0
Oracle Databaes11.221.1
PostgreSQL42.1.042.2.18
MongoDB2.0.04.0.0



DependencyVersion in AWS Glue 0.9Version in AWS Glue 1.0Version in AWS Glue 2.0Version in AWS Glue 3.0
Spark2.2.12.4.32.4.33.1.1-amzn-0
Hadoop2.7.3-amzn-62.8.5-amzn-12.8.5-amzn-53.2.1-amzn-3
Scala2.112.112.112.12
Jackson2.7.x2.7.x2.7.x2.10.x
Hive1.21.21.22.3.7-amzn-4
EMRFS2.20.02.30.02.38.02.46.0
Json4s3.2.x3.5.x3.5.x3.6.6
ArrowN/A0.10.00.10.02.0.0
AWS Glue Catalog clientN/AN/A1.10.03.0.0



Comments