Discussion about this post

User's avatar
Mahantesh Angadi's avatar

Hi Shiyan Xu,

It's a nice article. Thanks for posing this.

Do you have any code sample to implement clustering - stitching small files into big files. Which runs asynchronously.

We have implemented a glue job which reads raw data from S3 and after some transformations the data will be written to hudi table (processed table). For the performance we have set "hoodie.upsert.shuffle.parallelism": '100'

This has created lot of smaller files. We would like to stitch these small files into big files.

Expand full comment
Albert Wong's avatar

How does clustering layout choice (linear, z-ordering) impact SQL? Does z-ordering benefit specific types of SQL queries?

Expand full comment
1 more comment...

No posts