3 Comments
User's avatar
Mahantesh Angadi's avatar

Hi Shiyan Xu,

It's a nice article. Thanks for posing this.

Do you have any code sample to implement clustering - stitching small files into big files. Which runs asynchronously.

We have implemented a glue job which reads raw data from S3 and after some transformations the data will be written to hudi table (processed table). For the performance we have set "hoodie.upsert.shuffle.parallelism": '100'

This has created lot of smaller files. We would like to stitch these small files into big files.

Expand full comment
Albert Wong's avatar

How does clustering layout choice (linear, z-ordering) impact SQL? Does z-ordering benefit specific types of SQL queries?

Expand full comment
Shiyan Xu's avatar

Layout choices improve query performance in general, not specific to sql; sql is just a way to compose queries, not different in execution from queries using spark dataframe for example.

If you query a housing info dataset with a specific range of latitudes and longitudes as a filter, running clustering with lat/lon order using z-ordering or hilbert can improve the query perf. This is dataset and predicate dependent, not sql query type dependent (after all, you just run SELECT statements with different clauses)

Expand full comment