Do you have any code sample to implement clustering - stitching small files into big files. Which runs asynchronously.
We have implemented a glue job which reads raw data from S3 and after some transformations the data will be written to hudi table (processed table). For the performance we have set "hoodie.upsert.shuffle.parallelism": '100'
This has created lot of smaller files. We would like to stitch these small files into big files.
Layout choices improve query performance in general, not specific to sql; sql is just a way to compose queries, not different in execution from queries using spark dataframe for example.
If you query a housing info dataset with a specific range of latitudes and longitudes as a filter, running clustering with lat/lon order using z-ordering or hilbert can improve the query perf. This is dataset and predicate dependent, not sql query type dependent (after all, you just run SELECT statements with different clauses)
Hi Shiyan Xu,
It's a nice article. Thanks for posing this.
Do you have any code sample to implement clustering - stitching small files into big files. Which runs asynchronously.
We have implemented a glue job which reads raw data from S3 and after some transformations the data will be written to hudi table (processed table). For the performance we have set "hoodie.upsert.shuffle.parallelism": '100'
This has created lot of smaller files. We would like to stitch these small files into big files.
How does clustering layout choice (linear, z-ordering) impact SQL? Does z-ordering benefit specific types of SQL queries?
Layout choices improve query performance in general, not specific to sql; sql is just a way to compose queries, not different in execution from queries using spark dataframe for example.
If you query a housing info dataset with a specific range of latitudes and longitudes as a filter, running clustering with lat/lon order using z-ordering or hilbert can improve the query perf. This is dataset and predicate dependent, not sql query type dependent (after all, you just run SELECT statements with different clauses)