3 Comments
User's avatar
Mahantesh Angadi's avatar

Hi Shiyan Xu,

It's a nice article. Thanks for posing this.

Do you have any code sample to implement clustering - stitching small files into big files. Which runs asynchronously.

We have implemented a glue job which reads raw data from S3 and after some transformations the data will be written to hudi table (processed table). For the performance we have set "hoodie.upsert.shuffle.parallelism": '100'

This has created lot of smaller files. We would like to stitch these small files into big files.

Albert Wong's avatar

How does clustering layout choice (linear, z-ordering) impact SQL? Does z-ordering benefit specific types of SQL queries?

Shiyan Xu's avatar

Layout choices improve query performance in general, not specific to sql; sql is just a way to compose queries, not different in execution from queries using spark dataframe for example.

If you query a housing info dataset with a specific range of latitudes and longitudes as a filter, running clustering with lat/lon order using z-ordering or hilbert can improve the query perf. This is dataset and predicate dependent, not sql query type dependent (after all, you just run SELECT statements with different clauses)