1
Answer

please anyone from this group help me with the task

  1. Create ETL job in AWS Glue,  - Read CSV file, Apply transformation, Create 2 file CSV format and Parquet format and stored in S3 bucket
  2. Schedule AWS glue to run everyday at 6 pm 
  3. Parquet output file should be created in the S3 output bucket -> under dynamic Folder name (mm-dd-yyyy) ,
  4. CSV file  should be under static folder , and file name should be unique i.e. current date time as suffix
  5. Create Crawler to catalog this file under Glue database
  6.  Query data using Athena 

pls anyone in this group help me with this task

Answers (1)

0
Photo of Lokesh Varman
275 6.6k 268.2k 1y

Note - AI Generated content. Hope it might give you an idea. 

 

Creating an ETL job in AWS Glue to read a CSV file, apply transformations, create CSV and Parquet files, and store them in S3, and then scheduling it to run every day at 6 pm involves several steps. Here's a high-level overview of the process:

1. **Set Up AWS Glue:**

   - Ensure you have an AWS account and access to the AWS Glue service.

2. **Create an AWS Glue ETL Job:**

   - Go to the AWS Glue Console.
   - Create a new ETL job.
   - Specify your source, target, and transformation details.

3. **Read CSV File:**

   - In the ETL script, use a Glue DynamicFrame to read the CSV file.
   - Define your data source and format options (CSV).

```python
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "your-database", table_name = "your-table")
```

4. **Apply Transformations:**

   - Perform any necessary transformations on the data using the Glue ETL script. You can use PySpark transformations for this.

```python
# Example transformation
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [...])
```

5. **Create Output Files:**

   - Create two separate outputs for CSV and Parquet formats.
   - For CSV, use the `glueContext.write_dynamic_frame` function to write to a static folder with a unique filename.

```python
glueContext.write_dynamic_frame(
    frame = applymapping1,
    name = "output",
    output_dir = "s3://your-output-bucket/static-folder/" + current_datetime + "/",
    format = "csv"
)
```

   - For Parquet, use the same function but with a dynamic folder.

```python
glueContext.write_dynamic_frame(
    frame = applymapping1,
    name = "output",
    output_dir = "s3://your-output-bucket/dynamic-folder/" + date_today + "/",
    format = "parquet"
)
```

6. **Schedule the AWS Glue Job:**

   - In the AWS Glue Console, set up a schedule for your ETL job to run every day at 6 pm.

7. **Create a Crawler:**

   - Create an AWS Glue Crawler to catalog the data in your S3 buckets.

8. **Query Data Using Athena:**

   - After your data is cataloged, you can use AWS Athena to query it.
   - In the Athena console, create a database and tables based on the cataloged data.

9. **Create SQL Queries:**

   - Write SQL queries in Athena to retrieve and analyze your data.

10. **Automation (Optional):**

    - To automate the whole process, you can use AWS Step Functions or other orchestration tools to trigger your Glue job and other processes in sequence.

Remember to replace placeholders like "your-database," "your-table," "your-output-bucket," and provide your specific configurations and transformations in the Glue job script. Make sure you have appropriate IAM roles and permissions set for Glue to access S3 and other AWS services as needed.