Description

An extension to FsDataWriter that writes in Parquet format in the form of either Avro, Protobuf or ParquetGroup. This implementation allows users to specify the CodecFactory to use through the configuration property writer.codec.type. By default, the snappy codec is used. See Developer Notes to make sure you are using the right Gobblin jar.

Usage

writer.builder.class=org.apache.gobblin.writer.ParquetDataWriterBuilder
writer.destination.type=HDFS
writer.output.format=PARQUET

Example Pipeline Configuration

Configuration

Key Description Default Value Required
writer.parquet.page.size The page size threshold. 1048576 No
writer.parquet.dictionary.page.size The block size threshold for the dictionary pages. 134217728 No
writer.parquet.dictionary To turn dictionary encoding on. Parquet has a dictionary encoding for data with a small number of unique values ( < 10^5 ) that aids in significant compression and boosts processing speed. true No
writer.parquet.validate To turn on validation using the schema. This validation is done by ParquetWriter not by Gobblin. false No
writer.parquet.version Version of parquet writer to use. Available versions are v1 and v2. v1 No
writer.parquet.format In-memory format of the record being written to Parquet. Options are AVRO, PROTOBUF and GROUP GROUP No

Developer Notes

Gobblin provides integration with two different versions of Parquet through its modules. Use the appropriate jar based on the Parquet library you use in your code.

Jar Dependency Gobblin Release
gobblin-parquet com.twitter:parquet-hadoop-bundle >= 0.12.0
gobblin-parquet-apache org.apache.parquet:parquet-hadoop >= 0.15.0

If you want to look at the code, check out:

Module File
gobblin-parquet ParquetHdfsDataWriter
gobblin-parquet ParquetDataWriterBuilder
gobblin-parquet-apache ParquetHdfsDataWriter
gobblin-parquet-apache ParquetDataWriterBuilder