Parquet
Overview
Apache Parquet is a file format designed to support fast data processing for complex data, with several notable characteristics:
1. Columnar: Unlike row-based formats such as CSV or Avro, Apache Parquet is column-oriented – meaning the values of each table column are stored next to each other, rather than those of each record:
2. Open-source: Parquet is free to use and open source under the Apache Hadoop license, and is compatible with most Hadoop data processing frameworks. To quote the project website, “Apache Parquet is… available to any project… regardless of the choice of data processing framework, data model, or programming language.”
3. Self-describing: In addition to data, a Parquet file contains metadata including schema and structure. Each file stores both the data and the standards used for accessing each record – making it easier to decouple services that write, store, and read Parquet files.
Example use case
You have a parquet file that contains your Employee information. You want to use a batch sync to pull this info into a Cinchy table and liberate your data.
The Parquet source supports batch syncs.
Info tab
You can find the parameters in the Info tab below (Image 1).
Values
Title
Mandatory. Input a name for your data sync
Employee Sync
Variables
Since we're doing a local upload, we use "@Filepath"
Permissions
Data syncs are role based access systems where you can give specific groups read, write, execute, and/or all of the above with admin access. Inputting at least an Admin Group is mandatory.
Source tab
The following table outlines the mandatory and optional parameters you will find on the Source tab.
The following parameters will help to define your data sync source and how it functions.
(Sync) Source
Mandatory. Select your source from the drop down menu.
Parquet
Source
The location of the source file. Either a Local upload, Amazon S3, or Azure Blob Storage The following authentication methods are supported per source: Amazon S3: Access Key ID/Secret Access Key Azure Blob Storage: Connection String
Local
Row Group Size
The recommended disk block/row group/file size is 512 to 1024 MB on HDFS.
Path
Mandatory. The path to the source file to load. To upload a local file, you must first insert a Variable in the Info tab of the connection (ex: filepath). Then, you would reference that same value in this location (Ex: @Filepath). This will then trigger a File Upload option to import your file.
@Filepath
Auth Type
This field defines the authentication type for your data sync. Cinchy supports "Access Key" and "IAM" role. When selecting **Access Key**, you must provide the key and key secret. When selecting **IAM role**, a new field will appear for you to paste in the role's Amazon Resource Name (ARN). You also must ensure that:
Test Connection
You can use the "Test Connection" button to ensure that your credentials are properly configured to access your source.
If configured correctly, a "Connection Successful" pop-up will appear.
If configured incorrectly, a "Connection Failed" pop-up will appear along with a link to the applicable error logs to help you troubleshoot.
Next steps
Configure your Destination
Define your Sync Actions.
Add in your Post Sync Scripts, if required.
Click Jobs > Start a Job to begin your sync.
Last updated