Introduction
ExampleGen component loads the data for the TFx pipeline. The component is used in creating Training and Inference pipelines. Also, ExampleGen enables the partitioning of the data into training/validation data sets. The Robotika Transform container is the inheritor of the Transform TFx [1] container with an extended UX.
Data Sources
Robotika currently supports loading from CSV and parquet format of the data. If the cloud pipeline has been chosen, Robotika can read from S3 buckets. For on-premise implementation, Robotika reads the data from the shared folders. To make your bucket visible by Robotika, either make a bucket public or update the permission policy. We strongly recommend updating a policy since making a bucket public makes your data vulnerable to potential data leaks. The example of the policy is presented below. Please insert your Bucket name in the Resources for the script to be functionable.
{
"Version": "2012–10–17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::848221505146:user/orchestrator"
},
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<BUCKET_NAME>/*",
"arn:aws:s3:::<BUCKET_NAME>"
]
}
]
}
How to use
There are two ways to load data to Robotika. The first way is to load the full dataset, and the component will divide it into training and evaluation data sets based on the train/val split ratio you can indicate in UX. Another way is to load directly two pre-divided datasets. The component would not divide the dataset in this scenario.
Please note that for both scenarios, you should indicate the folder together with the bucket name, and the Robotika component will read all the *.csv files (or *.parquest) from the indicated directory. There is no need to insert the file name itself.
References
[1] ExampleGen TFx component, https://www.tensorflow.org/tfx/guide/examplegen