Apache Hudi: From Zero To One (6/10)

Shiyan Xu

Nov 13, 2023

Demystify clustering and space-filling curves

Read →

3 Comments

Mahantesh Angadi

Aug 19

Hi Shiyan Xu,

It's a nice article. Thanks for posing this.

Do you have any code sample to implement clustering - stitching small files into big files. Which runs asynchronously.

We have implemented a glue job which reads raw data from S3 and after some transformations the data will be written to hudi table (processed table). For the performance we have set "hoodie.upsert.shuffle.parallelism": '100'

This has created lot of smaller files. We would like to stitch these small files into big files.

Albert Wong

Jul 9, 2024

How does clustering layout choice (linear, z-ordering) impact SQL? Does z-ordering benefit specific types of SQL queries?

Reply (1)

Shiyan Xu

Aug 26, 2024

Layout choices improve query performance in general, not specific to sql; sql is just a way to compose queries, not different in execution from queries using spark dataframe for example.

If you query a housing info dataset with a specific range of latitudes and longitudes as a filter, running clustering with lat/lon order using z-ordering or hilbert can improve the query perf. This is dataset and predicate dependent, not sql query type dependent (after all, you just run SELECT statements with different clauses)

Datumagic

Apache Hudi: From Zero To One (6/10)