How to set up a bulk data load in Amorphic¶
A Bulk data load task in Amorphic can be setup by using the “Create New Task” option in the connection details page.
The following picture depicts the connections tasks page in Amorphic
How to create a Task¶
Below are the steps that are required to create a bulk data load task in Amorphic.
Task Name: Name of the task to be used, an identifier of the task.
Migration Type: Full Load, Change Data Capture (CDC) & Full Load and CDC
- Full Load : This option simply migrates the data from your source database to your target database.
- Full load and CDC (Migrate existing data and replicate ongoing changes) : This option performs a full data load while capturing changes on the source. After the full load is complete, captured changes are applied to the target and henceforth.
- CDC only (Replicate data changes only) : In this option, only ongoing datachanges are captured.
Target Location: Select the target where the data has to be migrated. (For S3Athena target type only Full Load is supported for now)
Sync To S3: Only applicable when the target location is selected to DWH, this option enables the user to choose whether data should be copied to S3 or not either for full-load or CDC related tasks. For CDC type of tasks to sync the data to S3, a schedule needs to be created in the schedules page after selecting this option as Yes.
CDC Start Time(Applicable only for CDC): Custom start time which is used as a starting point to capture the data from the source.
Target Extra Connection Attributes(Optional): Extra connection attributes to be applied to target for data migration jobs. Please refer below documentation for the available extra connection attributes.
For S3 target type of datasets Amorphic uses addColumnName=true;timestampColumnName=RecordModifiedTimeStamp;includeOpForFullLoad=true as ExtraConnectionAttributes. When user provides extra connection attributes in the option above then predefined settings will be overwritten, user have to make sure to add these flags when creating the tasks to be in sync with other data loads.
Replication Instance Class: Type of DMS instance class to be used for the data migration (When user selects the instance here then backend instance selection will be nullified). Please choose approx Instance based on the data volume of all the tables.
Allocated Storage: Amount of storage space you want for your replication instance. AWS DMS uses this storage for log files and cached transactions while replication tasks are in progress.
Both the above parameters are required to use the instance setting provided by the user for the DMS task, if above two parameters are defined then Approx Table Size parameter that was selected in the table metadata page will not have any affect and Amorphic uses the instance setting provided by the user else instance config is decided based on the aggregate sum of all Table Sizes in the task.
- Select all the tables that are needed to be loaded into Amorphic.
- Schema’s and tables can be filtered if necessary with the filter on the top.
- After selecting the tables, click on Next which is at the end of the page to proceed with the metadata edit/update.
- Bulk Edit:
- Bulk Edit is used to update/edit metadata of the datasets instead of editing/updating the prefilled metadata for each dataset
Information that is showed/edited/updated in this page will be used as metadata to register the datasets in Amorphic.
The following are the options available in Bulk edit page and their corresponding use.
Amorphic Dataset Name: This option is used to edit the Dataset Name’s in bulk by adding prefix/suffix to the generated names.
Description: Edit the generated description by using this option.
Approx Table Size: This parameter is used to determine the type of Instance to be used while running the data migration task and has nothing to do with the metadata of the dataset. Please select approx size of the source table with this parameter so that instance can be decided accordingly
Domain: Edit domain for the datasets.
Keywords: Edit Keywords for the datasets.
Filter By Column: This option is a filter condition to be applied on the source dataset to filter data during the ingestion.
Please note that metadata of the datasets can also be edited individually by selecting the table from left pane
- Filter & Transformation rules
- Filter rules are used to filter the data while ingesting the data to Amorphic. Below are the rules that are supported during ingestion.
- Equal to
- Greater than or Equal to
- Less than or Equal to
- Transformation rules are used to modify the column names or datatype of the table while ingesting the data to target, below is the list of tranformation rules that are available.
- Rename To
- Remove Column
- Make UpperCase
- Make LowerCase
- Add Prefix
- Remove Prefix
- Replace Prefix
- Add Suffix
- Remove Suffix
- Replace Suffix
- Change Data Type (Only Integer, Double & String data types are available for conversion now)
Below is the sample table which illustrates the column names that will be created in Amorphic when Transformation rules are applied on the columns. An example name of column_name is used to for all the rules.
Rule name Old value New Value Result Rename To N/A new_column_name new_column_name Remove Column N/A N/A N/A Make UpperCase N/A N/A COLUMN_NAME Make LowerCase N/A N/A column_name Add Prefix N/A prefix_ prefix_column_name Remove Prefix column_/COLUMN_ N/A name Replace Prefix column_ col_ col_name Add Suffix N/A _suffix column_name_suffix Remove Suffix _name/_NAME N/A column Replace Suffix _name _nm column_nm Change Data Type N/A N/A Datatype changes
When multiple tranformation rules are applied on a single column then AWS will consider only the first rule that is applied and rest others are ignored. Incase if multiple tranformation rules needed to applied on a column then consider using Rename Column Transformation rule
Preview and Submit
- Here all the details that were entered can be reviewed along with the Transformation and Filter rules added for each dataset.
- If the task name is already taken by some other task then it can be edited in this page and the edit option will be shown after Submit Task is clicked which will validate the task name in the backend.
After selecting all the options click on Submit Task which does the schema conversion and registers the datasets in Amorphic.
Please follow the below animation as a reference to create a task.
After successful datasets registration, the task can be started with the Start Task option.
View Task Details¶
Once the task is started, the status of the task will be changed to running and the latest status can be fetched with the page refresh page.
Data migration statistics can be viewed from the View option on the task details.
For Full-load type of tasks an additional tab called Runs will be shown which gives the run history of the task and their corresponding metrics.
Schedules tab shown in above image is applicable to any task type and is visible only when a schedule is created for the task.
Additional info related to the task like Datasets Registered, View/Download logs can be viewed under the View of Stats, Schedules, Datasets & Logs page.
Please follow below animation for the details