Description
An extension to FsDataWriter
that writes in Parquet format in the form of either Avro, Protobuf or ParquetGroup
. This implementation allows users to specify the CodecFactory to use through the configuration property writer.codec.type
. By default, the snappy codec is used. See Developer Notes to make sure you are using the right Gobblin jar.
Usage
writer.builder.class=org.apache.gobblin.writer.ParquetDataWriterBuilder
writer.destination.type=HDFS
writer.output.format=PARQUET
Example Pipeline Configuration
example-parquet.pull
contains an example of generating test data and writing to Parquet files.
Configuration
Key | Description | Default Value | Required |
---|---|---|---|
writer.parquet.page.size | The page size threshold. | 1048576 | No |
writer.parquet.dictionary.page.size | The block size threshold for the dictionary pages. | 134217728 | No |
writer.parquet.dictionary | To turn dictionary encoding on. Parquet has a dictionary encoding for data with a small number of unique values ( < 10^5 ) that aids in significant compression and boosts processing speed. | true | No |
writer.parquet.validate | To turn on validation using the schema. This validation is done by ParquetWriter not by Gobblin. |
false | No |
writer.parquet.version | Version of parquet writer to use. Available versions are v1 and v2. | v1 | No |
writer.parquet.format | In-memory format of the record being written to Parquet. Options are AVRO, PROTOBUF and GROUP |
GROUP | No |
Developer Notes
Gobblin provides integration with two different versions of Parquet through its modules. Use the appropriate jar based on the Parquet library you use in your code.
Jar | Dependency | Gobblin Release |
---|---|---|
gobblin-parquet |
com.twitter:parquet-hadoop-bundle |
>= 0.12.0 |
gobblin-parquet-apache |
org.apache.parquet:parquet-hadoop |
>= 0.15.0 |
If you want to look at the code, check out:
Module | File |
---|---|
gobblin-parquet | ParquetHdfsDataWriter |
gobblin-parquet | ParquetDataWriterBuilder |
gobblin-parquet-apache | ParquetHdfsDataWriter |
gobblin-parquet-apache | ParquetDataWriterBuilder |