ETL Jobs

ETL Jobs enables the user to run python or pyspark codes in order to work on the user’s datasets. By using this action, user can perform any operation on the datasets.

Create an ETL Job

Click on ‘ETL’ and choose ‘JOBS’ from the dropdown list in order to access and create jobs.

The Jobs can be created using the ‘+’ icon on the top Right of page under section.

Job creation page will be displayed with the following parameters, where parameter mentioned as [REQUIRED] are the mandatory parameters without which a job cannot be created:

Name [REQUIRED]: The name by which the user wants to create the job. The name must be unique across the amorphic platform.

Description: A brief description about the job created.

Job Type [REQUIRED]: The user needs to specify and describe the type of job registration whether it is a Spark Job Or Python Job

Max Concurrent Runs: This specifies the maximum number of concurrent runs allowed for the job

Max Retries: The specified number will be the maximum number of times the job will be retried if it fails

Allocated Capacity: Allocated capacity is the number of AWS Glue data processing units to allocated to the job (This parameter is available for both Spark & Python Job)

OR

Worker Type: The type of predefined worker that is allocated when a job runs. User can select one of Standard, G.1X, or G.2X. For more info on worker type please refer documentation (This parameter is available for only Spark Job)

AND

Number Of Workers: The number of workers of a defined workerType that are allocated when a job runs. The maximum number of workers user can define are 299 for G.1X , and 149 for G.2X (This parameter is available for only Spark Job)

Timeout: Maximum time that a job can run can consume resources before it is terminated. Default timeout is 48 hrs

Notify Delay after: After a job run starts, the number of minutes to wait before sending a job run delay notification

Notification Settings [REQUIRED]: User can choose the type of notification setting for the job alerts. There are 2 types of notification settings:
  • All: Email will be sent to all users (both authorized users and authorized groups) for both successful and failed executions.
  • Error-Only: Email will be sent to all users (both authorized users and authorized groups) for only failed executions.

Datasets Write Access: User can select datasets with the write access required for the job

Datasets Read Access: User can select datasets with the read access required for the job

Parameters Access: User accessible parameters from the parameters store will be displayed. User can use these parameters in the ETL script.

Shared Libraries: User accessible shared libraries will be displayed. User can use these libraries in the ETL script for dependency management.

Job Parameter: User can specify arguments that job execution script consumes, as well as arguments that AWS Glue consumes.

Once the Job metadata is created, Page is navigated to ‘Edit script’ page where the user is required to provide the script for the ETL Job and publish the Job.

Create an ETL Job

Note: To write a file to a dataset (LZ bucket) through ETL script, below is the required file name convention:

<Domain>/<DatasetName>/upload_date=<epoch>/<UserId>/<FileType>/<FileName>

ex: TEST/ETL_Dataset/upload_date=123123123/apollo/csv/test_file.csv

View Job

Job details page should be displayed with all the details specified and also with the default values for the unspecified fields (if applicable).

Job details page

Run Job

To execute the ETL Job, click on the Run Job (play icon) button on the top right side of the page. Once a job run is executed, refresh the execution status tab using the Refresh button and check the status.

Run an ETL Job

Refresh the status

Once the job execution is completed, Email notification will be sent based on the notification setting and job execution status. Below are the different scenarios:

Email Notification Success Failure/Error
All Yes Yes
Error-only No Yes

Download logs

Once a job status is updated, user can download the output (if any) and error logs (if any) through more (3 dots) option. The logs option is of 2 types:
  • Output: Output logs for the job execution. If logs are not available, It’ll display that ‘No output logs available for the execution’ message.
  • Error: Error logs for the job execution. If logs are not available, It’ll display that ‘No error logs available for the execution’ message.
Download execution logs

Job Actions

User can choose the actions to be performed on a job from the action icons available on top right of the page These job actions features are briefly shared below:

Job Actions

Job EMAIL Alerts

User now has an option to send EMAILs at the time of job run. For previously created job, an edit of the job is required to utilize this feature.

Below is the sample python code that should be used to send an EMAIL

queue_name = boto3.client("ssm").get_parameter(Name="SYSTEM.NOTIFICATIONS.QUEUE")["Parameter"]["Value"]
sqs_resource = boto3.resource("sqs")
notification_queue = sqs_resource.get_queue_by_name(QueueName=queue_name)
email_subject = "Subject of the email"
message = """Successfully performed required operation.

Write the message that needs to be send

"""
response = notification_queue.send_message(
    MessageBody=json.dumps({
                            "priority": "low",
                            "notifyTo" : {"email": {"emailFrom": "oliver@cloudwick.com", "emailSubject": email_subject, "userGroup": "Users", "notifyUsers": ["apollo@cloudwick.com"],
                                                    "messageBody": { "messageType": "text", "notificationMessage": message
                                                                }
                                                }
                                        }
    }),
    MessageGroupId="alertUser"
)

To utilize this functionality at the time of job creation or job update user needs to provide SYSTEM.NOTIFICATIONS.QUEUE in Parameter Access. If Mail server used is default then email addresses used for from and to addresses should be subscribed the Amorphic application to leverage this feature.