How to set up a bulk data load in Amorphic¶
A Bulk data load task in Amorphic can be setup by using the “Create New Task” option in the connection details page.
The following picture depicts the connections tasks page in Amorphic
How to create a Task¶
Below are the steps that are required to create a bulk data load task in Amorphic.
Select all the tables that are needed to be loaded into Amorphic.
Schema’s and tables can be filtered if necessary with the filter on the top.
After selecting the tables, click on Next which is at the end of the page to proceed with the metadata update.
Metadata of the tables can be modified in Bulk edit page.
Information that is provided in the bulk edit page will be used as metadata to register the datasets in Amorphic.
The following are the options available in Bulk edit page and their corresponding use.
Amorphic Dataset Name: This option is used to edit the Dataset Name’s in bulk by adding prefix/suffix to the generated names.
Description: Edit the generated description by using this option.
Approx Table Size: This parameter is used to determine the type of Instance to be used while running the data migration task and has nothing to do with the metadata of the dataset. Please select approx size of the source table with this parameter so that instance can be decided accordingly
Domain: Edit domain for the datasets.
Keywords: Edit Keywords for the datasets.
Please note that metadata of the datasets can also be edited individually by selecting the table from left pane
Click on the Next (Review & Submit Task) to select the type of data load.
Following are the options available.
Task Name: Name of the task to be used, an identifier of the task.
Migration Type: Full Load, Change Data Capture (CDC) & Full Load and CDC
- Full Load : This option simply migrates the data from your source database to your target database.
- Full load and CDC (Migrate existing data and replicate ongoing changes) : This option performs a full data load while capturing changes on the source. After the full load is complete, captured changes are applied to the target and henceforth.
- CDC only (Replicate data changes only) : In this option, only ongoing datachanges are captured.
Target Location: Select the target where the data has to be migrated.
Sync To S3: Only applicable when the target location is selected to DWH, this option enables the user to choose whether data should be copied to S3 or not either for full-load or CDC related tasks. For CDC type of tasks to sync the data to S3, a schedule needs to be created in the schedules page after selecting this option as Yes.
CDC Start Time(Applicable only for CDC): Custom start time which is used as a starting point to capture the data from the source.
Extra Connection Attributes(Optional): Extra connection attributes to be applied to target for data migration jobs. Please refer below documentation for the available extra connection attributes.
For S3 target type of datasets Amorphic uses addColumnName=true;timestampColumnName=RecordModifiedTimeStamp;includeOpForFullLoad=true as ExtraConnectionAttributes. When user provides extra connection attributes in the option above then predefined settings will be overwritten, user have to make sure to add these flags when creating the tasks to be in sync with other data loads.
Replication Instance Class: Type of DMS instance class to be used for the data migration (When user selects the instance here then backend instance selection will be nullified). Please choose approx Instance based on the data volume of all the tables.
Allocated Storage: Amount of storage space you want for your replication instance. AWS DMS uses this storage for log files and cached transactions while replication tasks are in progress.
Both the above parameters are required to use the instance setting provided by the user for the DMS task, if above two parameters are defined then Approx Table Size parameter that was selected in the table metadata page will not have any affect and Amorphic uses the instance setting provided by the user else instance config is decided based on the aggregate sum of all Table Sizes in the task.
After selecting all the options click on Submit Task which does the schema conversion and registers the datasets in Amorphic.
Please follow the below animation as a reference to create a task.
After successful datasets registration, the task can be started with the Start Task option.
View Task Details¶
Once the task is started, the status of the task will be changed to running and the latest status can be fetched with the page refresh page.
Data migration statistics can be viewed from the View option on the task details.
For Full-load type of tasks an additional tab called Runs will be shown which gives the run history of the task and their corresponding metrics.
Schedules tab shown in above image is applicable to any task type and is visible only when a schedule is created for the task.
Additional info related to the task like Datasets Registered, View/Download logs can be viewed under the View of Stats, Schedules, Datasets & Logs page.
Please follow below animation for the details
When more number of tables are selected for a single task then user might face API gateway request body size restrictions. In this case please reduce the selection of tables in the task creation. As this is an AWS constraint we don’t have any ETA on the resolution yet.