Datasets

Amorphic Dataset portal helps you create unstructured, semi-structured and structured Datasets. These Datasets can be used as a single source of truth across the different departments of an organization. Amorphic Datasets helps in providing complete data lake visibility of data.

Datasets

Amorphic dataset provides capability to search with a Google-like search index through the Dataset Metadata.

Amorphic Dataset page consists of options to List or Create a new Dataset. Datasets are available to select in the Amorphic Dataset for listing purpose. You can sort through the Dataset using the Domain filters, Create new Datasets or View Dataset details.

Create New Datasets

You can create new Datasets in Amorphic by using the “Create Dataset” functionality of Amorphic application.

Create_Dataset

In order to create a new Dataset, you would require information like Domain, Connection type and File Type etc. Following are the main information required to create a new dataset.

The Amorphic dataset has a hierarchical structure such that Files are associated with Datasets which are associated with a domain. Hence to create a Dataset, you need to first create a Domain using Amorphic Administration. Then create a Dataset and upload new structured, semi-structured or unstructured files to the Dataset and upload the respective files.

  • Connection Type:

    Select connection type as JDBC for JDBC connection, S3 for an s3 connection while API for all other types of connection. A JDBC connection type will require you to select a JDBC connection from a list of Amorphic Connections (see connection section). You will also need to specify the table name from which the scheduler will run the data ingestion job. While for an S3 connection, you will need to specify a S3 connection and the path of the directory on which a schedule will poll for new datasets on an on-demand or on a time basis

  • File Type:

    The file type should be in sync with the ML Model supported file format option. Apart from the various supported formats, you can also perform metadata extraction from the unstructured dataset using auto ML functionalities which are integrated with AWS Transcribe and Comprehend services in the back end.

  • Target Location:

    This can be S3 for most of the connection types specified. Amorphic can ingest the data into a S3. You can access the connection details information once the dataset has been created. S3 Datasets do not require a schema file upload. Other type of Target locations are:

    • Target Location MySQL Dataset: The target location for a JDBC connection type can either be S3 or MySQL. Below is an example wherein the target location has been explicitly specified as Mysql for a JDBC connection type.

    As you can see from the above figure, there is “DWH Connection Details” for Datasets. The details like JDBC and ODBC Connection Strings and Host can be viewed in DWH Connect section of the dataset details. This connection detail along with the “DW Credential” information can be used to connect to external business intelligence tools.

    • Target Location Athena: We can store structured data i.e csv/xlsx in Athena with all connection types available in amorphic. Please refer to Athena Datasets for more detail.

    In case of target location as Datawarehouse (auroramysql/redshift/S3-Athena), the User is requested to upload the file, for schema inference and publish the schema.

User can also create the Dataset by using the “Navigator” which would direct the user to Data Creation page from any where in the application. To get the option displayed, the user need to doble tap on “Ctrl” button in the keyboard.

Below is a simple graphic to demonstrate Navigator.

Navigator

Schema

Amorphic Dataset registraction feature, helps user to extract and register schema of the uploaded file, when user chooses Datawarehouse (auroramysql or redshift) as target location.

Create_Dataset_DWH

Schema Extraction

When user registers a dataset with Datawarehouse (auroramysql or redshift) as target location and “My data files have headers” as “Yes”, user is directed to “Publish Schema” where inferred schema will be displayed.

Dataset_Schema

User has an option to edit “Column Name” and “Colum Type” in the Schema Extraction page

Schema View

The Schema tab appears in the dataset view only when users registers the dataset with target location as Datawarehouse (auroramysql or redshift).

Data_Schema_View

View Dataset

Dataset_Details

Upon clicking on View Details under a dataset, the user will be able to see all the following details of the dataset:

  1. Details
  2. DWH Connect
  3. Schema
  4. Files
  5. Schedules
  6. Authorized Users
  7. Authorized Users

Details

Details tab would have the dataset information like Dataset Name, Dataset S3 Location, Target Location, Domain, Connection Type, File Type etc.

Along with the user input Dataset Details like Dataset Name, Domain, Connnection Type, File Type etc, the details page would have.

Dataset_Metadata
  • Dataset S3 Location: The AWS S3 folder location of the Dataset
  • Dataset AI/ML Results: Advanced Analytics Summary of the Dataset
  • Created By : User Name if the user who created the dataset and the Creation Date-Time of the dataset
  • Last Modified By : User Name if the user who last modified the metadata of the dataset and the Date-Time when the metadata of the dataset was last modified.

DWH Connect

DWH Connect tab would have JDBC connection, Host and ODBC connection information. This is useful to establish a connection between different datasources to amorphic platform.

Note: In case of Redshift datawarehouse, When user is connected to the JDBC connection through a data source or a BI tool, all the tables (only schema not actual data) in the specific database will be displayed along with the user-owned tables (datasets). This is how amorphic works.

Files

Dataset_Files

In the files tab, the user can upload files in the dataset, delete files, perform operations such as download, apply ML and view AI/ML Results.

Invocations & Results

User can see the ML results, after applying ML model on the file, by clicking on “Invocations and Results” button next to the file.

These invocation results refer to outputs after Applying ML Model on the dataset, be it structured or unstructured.

Below is a picture showing how the invocation results are displayed. The history of invocations on a file are be displayed in this view.

Invocations_Results

User can download these logs for each invocation by clicking on the download button in the “Logs” column.

If there is no ML model applied on the file, clicking on the “Invocations and Results” button would display “No invocations” message.

Download File

User can download the file by clicking in the “Download File” option which is displayed by clicking on the File operations icon (more vertical icon) to the right corner of the file.

Apply ML

User can apply Machine Learning models on files in the dataset by selecting the “Apply ML” option in the dropdown of File operations, next to the file.

Once “Apply ML” option is selected, the application askes for “Type” (ML Model, Entity Recognizer). Below are the input User needs to provide for selected type. ML Model: User need to provide,

  • ML Model: The dropdown in Apply ML form shows the ML model names which are created in the platform and to which user has access to.
  • Instance Type: This dropdown, show the option of different types of machines on which the user wish apply the model on the file.
  • Target Dataset: This dropdown, show the list of amorphic dataset, to which the user wishes to write the output results.

Below is the picture pointing the user input fields in the application, required to apply Machine Learning model.

Dataset_Apply_ML

Entity Recognizer: User need to provide:

  • Entity Recognizer: The dropdown in Apply ML form shows the Entity Recognizer names which are created in the platform and to which user has access to.

View AI/ML Results

By Clicking on View AI/ML Results, the user will be displayed the view of AL/ML results applied ML models, Entity Recognizers.

The user can view the results by either clicking on icon in “Results” field displayed in the invocation results or by clicking on “View AI/ML Results” option in File Operation.

Below is the sample view of how AI results are displayed to the user when entity recognized is applied to one of the file.

Datasets_AI_ML_Results

Schedules

This tab shows the list of Schedules, scheduled upon this dataset. The User can schedule Data Ingest, ingesting files into the dataset, by using Schedules feature in the left menu bar.

The Schedules can be Time-Base or On-demand.

Authorized Users

This tab shows the list of users authorized to perform operations on the dataset. The owner, user who created or have owner access to the dataset, can provide dataset access to any other user in the system.

There are two type of access types:

  • Owner: This User has permissions to edit the dataset and provide access to other user for the dataset.
  • Read-only: This user has limited permission to dataset, such as view the details and download files

Authorized Groups

This tab shows the list of groups authorized to perform operations on the resources like Datasets, Dashboards, Models, Schedules etc. A group is a list of users given access to a resourse, in this case dataset. Groups are created by going to User Profile -> Profile & Settings –> Groups

There are two type of access types:

  • Owner: This group of users has permissions to edit the resources and provide access to other user/groups for the resources.
  • Read-only: This group has limited permission to resources, such as view the details.

List Datasets

In this view, the Users will be able to see the list of datasets they have access to. They can also limit the results shown per page using Results Per Page option, and can sort the them based on desired field and its order.

List_of_Datasets

Clone Datasets

User can clone a Dataset in Amorphic by clicking on clone button on the top right corner of the Dataset Details page.

Clone Dataset page auto-populates with the metadata of dataset from which it is being cloned, reducing the effort to fill every field required for registring the dataset.

The only field user needs to input/change is the “Dataset Name” and “Notification Settings”, as dataset with the existing Dataset Name can not be created. User can edit any field if he wants to before clicking the “Register” button at the bottom right corner of the form.

Below is the picture pointing to the populated fields in clone dataset form.

Clone Dataset

Once the user clicks the “Register” button, a new dataset will be created. The created dataset will show up in the Datasets page.

Delete Dataset

Dataset can be deleted using the “Delete” (trash) icon on the right corner of the page. Once dataset deletion is triggered, it’ll immediately delete all the related metadata.

Delete Dataset

Note: For bulk deletion of datasets, Please check the documentation on How to bulk delete the datasets in Amorphic