1 of 1

Parquet

1. Overview

Apache Parquet is a file format designed to support fast data processing for complex data, with several notable characteristics:

1. Columnar: Unlike row-based formats such as CSV or Avro, Apache Parquet is column-oriented – meaning the values of each table column are stored next to each other, rather than those of each record:

2. Open-source: Parquet is free to use and open source under the Apache Hadoop license, and is compatible with most Hadoop data processing frameworks. To quote the project website, “Apache Parquet is… available to any project… regardless of the choice of data processing framework, data model, or programming language.”

3. Self-describing: In addition to data, a Parquet file contains metadata including schema and structure. Each file stores both the data and the standards used for accessing each record – making it easier to decouple services that write, store, and read Parquet files.

Example Use Case: You have a parquet file that contains your Employee information. You want to use a batch sync to pull this info into a Cinchy table and liberate your data.

The Parquet source supports batch syncs.

2. Info Tab

You can review the parameters that can be found in the info tab below (Image 1).

Values

Parameter

Description

Example

Title

Mandatory. Input a name for your data sync

Employee Sync

Version

Mandatory. This is a pre-populated field containing a version number for your data sync. You can override it if you wish.

1.0.0

Parameters

Since we are doing a local upload, we use "@Filepath"

3. Source Tab

The following table outlines the mandatory and optional parameters you will find on the Source tab (Image 2).

The following parameters will help to define your data sync source and how it functions.

Parameter

Description

Example

(Sync) Source

Mandatory. Select your source from the drop down menu.

Parquet

Source

The location of the source file. Either a Local upload, Amazon S3, or Azure Blob Storage The following authentication methods are supported per source: Amazon S3: Access Key ID/Secret Access Key Azure Blob Storage: Connection String

Local

Row Group Size

The recommended disk block/row group/file size is 512 to 1024 MB on HDFS.

Path

Mandatory. The path to the source file to load. To upload a local file, you must first insert a Parameter in the Info tab of the connection (ex: filepath). Then, you would reference that same value in this location (Ex: @Filepath). This will then trigger a File Upload option to import your file.

@Filepath

Auth Type

This field defines the authentication type for your data sync. Cinchy supports "Access Key" and "IAM" role. When selecting "Access Key", you must provide the key and key secret. When selecting "IAM role", a new field will appear for you to paste in the role's Amazon Resource Name (ARN). You also must ensure that:

The Schema section is where you define which source columns you want to sync in your connection. You can repeat the values for multiple columns.

Parameter

Description

Example

Name

Mandatory. The name of your column as it appears in the source.

Name

Alias

Optional. You may choose to use an alias on your column so that it has a different name in the data sync.

Data Type

Mandatory. The data type of the column values.

Text

Description

Optional. You may choose to add a description to your column.

There are other options available for the Schema section if you click on Show Advanced.

Parameter

Description

Example

Mandatory

If both Mandatory and Validated are checked on a column, then rows where the column is empty are rejected

If just Mandatory is checked on a column, then all rows are synced with the execution log status of failed, and the source error of "Mandatory Rule Violation"

If just Validated is checked on a column, then all rows are synced.

Validate Data

If both Mandatory and Validated are checked on a column, then rows where the column is empty are rejected

If just Validated is checked on a column, then all rows are synced.

Trim Whitespace

Optional if data type = text. If your data type was chosen as "text", you can choose whether to trim the whitespace (that is, spaces and other non-printing characters).

You can choose to add in a Transformation > String Replacement by inputting the following:

Parameter

Description

Example

Pattern

Mandatory if using a Transformation. The pattern for your string replacement, i.e. the string that will be searched and replaced.

Replacement

What you want to replace your pattern with.

Note that you can have more than one String Replacement

You have the option to add a source filter to your data sync. Please review the documentation here for more information on source filters.

4. Next Steps

Configure your Destination
Define your Sync Behaviour.
Add in your Post Sync Scripts, if required.
Define your Permissions.
Click Jobs > Start a Job to begin your sync.

Parquet

1. Overview

Apache Parquet is a file format designed to support fast data processing for complex data, with several notable characteristics:

Example Use Case: You have a parquet file that contains your Employee information. You want to use a batch sync to pull this info into a Cinchy table and liberate your data.

The Parquet source supports batch syncs.

2. Info Tab

You can review the parameters that can be found in the info tab below (Image 1).

Values

Parameter

Description

Example

Title

Mandatory. Input a name for your data sync

Employee Sync

Version

Mandatory. This is a pre-populated field containing a version number for your data sync. You can override it if you wish.

1.0.0

Parameters

Optional. Review our documentation on for more information about this field.

Since we are doing a local upload, we use "@Filepath"

3. Source Tab

The following table outlines the mandatory and optional parameters you will find on the Source tab (Image 2).

The following parameters will help to define your data sync source and how it functions.

Parameter

Description

Example

(Sync) Source

Mandatory. Select your source from the drop down menu.

Parquet

Source

Local

Row Group Size

Mandatory. The size of your Parquer Row Groups. for more on Row Group sizing.

The recommended disk block/row group/file size is 512 to 1024 MB on HDFS.

Path

@Filepath

Auth Type

The role to have at least read access to the source
The Connections pods' role must specified in the data sync config

The Schema section is where you define which source columns you want to sync in your connection. You can repeat the values for multiple columns.

Parameter

Description

Example

Name

Mandatory. The name of your column as it appears in the source.

Name

Alias

Optional. You may choose to use an alias on your column so that it has a different name in the data sync.

Data Type

Mandatory. The data type of the column values.

Text

Description

Optional. You may choose to add a description to your column.

There are other options available for the Schema section if you click on Show Advanced.

Parameter

Description

Example

Mandatory

If both Mandatory and Validated are checked on a column, then rows where the column is empty are rejected

If just Mandatory is checked on a column, then all rows are synced with the execution log status of failed, and the source error of "Mandatory Rule Violation"

If just Validated is checked on a column, then all rows are synced.

Validate Data

If both Mandatory and Validated are checked on a column, then rows where the column is empty are rejected

If just Validated is checked on a column, then all rows are synced.

Trim Whitespace

Optional if data type = text. If your data type was chosen as "text", you can choose whether to trim the whitespace (that is, spaces and other non-printing characters).

You can choose to add in a Transformation > String Replacement by inputting the following:

Parameter

Description

Example

Pattern

Mandatory if using a Transformation. The pattern for your string replacement, i.e. the string that will be searched and replaced.

Replacement

What you want to replace your pattern with.

Note that you can have more than one String Replacement

You have the option to add a source filter to your data sync. Please review the documentation here for more information on source filters.

4. Next Steps

Configure your Destination
Define your Sync Behaviour.
Add in your Post Sync Scripts, if required.
Define your Permissions.
Click Jobs > Start a Job to begin your sync.