What is AWS Athena and how we can use it.

 AWS Athena:

Athena is a AWS service managed by AWS by providing high availability environment to execute analytical queries.

Athena underneath has thousands of clusters parallelly running which pull the data from s3 and run queries on MB to PB data.

Athena charges data as per the data scan it won't cost if you have already created Athena tables.

In hive we have Internal tables and External tables but in Athena we have only External tables which point to the data underlying s3 and scan the data when query is executed.

If you know Hive tables Athena will also support external tables on top of Parquet, CSV, Avro, ORC data.

We have a concept of Worker Groups in Athena which will help to restrict access based on the users by assigning specific worker group to user.  We can temporarily disable worker groups so that users cannot execute using particular worker group or we can delete worker group.

We can mention Athena results bucket which holds the resultant data which is created out of the data being created from query executed.

Check the video for  hands-on

Table creation DDL:

Parquet table:

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
    name STRING,
    location STRING 
)
stored as parquet
location 's3://bucket_name/folder/'
Avro table:

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
    name STRING,
    location STRING 
)
stored as avro
location 's3://bucket_name/folder/'
CSV table:

CREATE EXTERNAL TABLE IF NOT EXISTS my_table(
    name STRING,
    location STRING)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://bucket_name/folder/'
TBLPROPERTIES (
    'skip.header.line.count'='1' 




Comments