<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Datumagic]]></title><description><![CDATA[About magics made using data.]]></description><link>https://blog.datumagic.ai</link><image><url>https://substackcdn.com/image/fetch/$s_!b4vo!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89aea865-161a-4078-a03b-1fac2a3a2045_926x926.png</url><title>Datumagic</title><link>https://blog.datumagic.ai</link></image><generator>Substack</generator><lastBuildDate>Tue, 05 May 2026 10:49:53 GMT</lastBuildDate><atom:link href="https://blog.datumagic.ai/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Shiyan Xu]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[datumagic@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[datumagic@substack.com]]></itunes:email><itunes:name><![CDATA[Shiyan Xu]]></itunes:name></itunes:owner><itunes:author><![CDATA[Shiyan Xu]]></itunes:author><googleplay:owner><![CDATA[datumagic@substack.com]]></googleplay:owner><googleplay:email><![CDATA[datumagic@substack.com]]></googleplay:email><googleplay:author><![CDATA[Shiyan Xu]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Building a RAG-based AI Recommender (2/2)]]></title><description><![CDATA[An end-to-end code walkthrough.]]></description><link>https://blog.datumagic.ai/p/building-a-rag-based-ai-recommender-147</link><guid isPermaLink="false">https://blog.datumagic.ai/p/building-a-rag-based-ai-recommender-147</guid><dc:creator><![CDATA[Shiyan Xu]]></dc:creator><pubDate>Fri, 29 Aug 2025 05:00:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!k34H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c8103ac-2e87-45ab-adec-41f67aa525c6_960x540.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In <a href="https://blog.datumagic.ai/p/building-a-rag-based-ai-recommender">part 1</a> of this blog, we explored the core concepts of Retrieval-Augmented Generation (RAG) and how Apache Hudi's incremental processing capabilities provide a critical foundation for an efficient RAG data pipeline.</p><p>In this post, we'll move from theory to practice with a hands-on example, building an end-to-end demo RAG application from the ground up.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.datumagic.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Datumagic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Workflow Overview</h2><p>As illustrated in the figure below, our RAG application consists of five main components:</p><ul><li><p><strong>User-Facing API:</strong> The primary interface for the recommendation service.</p></li><li><p><strong>Product Service (Retrieval):</strong> For our e-commerce use case, this service manages access to and updates of product information. It represents the "retrieval" phase of RAG.</p></li><li><p><strong>LLM Service (Generation):</strong> This service builds context from the retrieved product information and connects to an LLM to generate user-facing responses. This constitutes the "generation" phase.</p></li><li><p><strong>Product Store:</strong> A Hudi table that stores the primary product information, which is built incrementally and serves as the single source of truth.</p></li><li><p><strong>Vector Store:</strong> Responsible for storing product information embeddings and handling similarity search requests from the Product Service.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k34H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c8103ac-2e87-45ab-adec-41f67aa525c6_960x540.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k34H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c8103ac-2e87-45ab-adec-41f67aa525c6_960x540.png 424w, https://substackcdn.com/image/fetch/$s_!k34H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c8103ac-2e87-45ab-adec-41f67aa525c6_960x540.png 848w, https://substackcdn.com/image/fetch/$s_!k34H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c8103ac-2e87-45ab-adec-41f67aa525c6_960x540.png 1272w, https://substackcdn.com/image/fetch/$s_!k34H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c8103ac-2e87-45ab-adec-41f67aa525c6_960x540.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k34H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c8103ac-2e87-45ab-adec-41f67aa525c6_960x540.png" width="960" height="540" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c8103ac-2e87-45ab-adec-41f67aa525c6_960x540.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:540,&quot;width&quot;:960,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:75862,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.datumagic.ai/i/172215217?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c8103ac-2e87-45ab-adec-41f67aa525c6_960x540.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!k34H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c8103ac-2e87-45ab-adec-41f67aa525c6_960x540.png 424w, https://substackcdn.com/image/fetch/$s_!k34H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c8103ac-2e87-45ab-adec-41f67aa525c6_960x540.png 848w, https://substackcdn.com/image/fetch/$s_!k34H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c8103ac-2e87-45ab-adec-41f67aa525c6_960x540.png 1272w, https://substackcdn.com/image/fetch/$s_!k34H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c8103ac-2e87-45ab-adec-41f67aa525c6_960x540.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For the user-facing API, we define an async endpoint, <code>/recommend</code>, to handle incoming requests. This endpoint processes natural language queries from users and uses product and LLM services to retrieve context and generate response. It also runs a periodic sync task that incrementally reads the latest product information from the Product Store, converts it to vectors, and upserts them into the vector store to keep it up-to-date.</p><p>The product service is responsible for embedding the user's query and performing a similarity search on the vector store. When the similarity search returns relevant product information, the Product Service passes this context to the LLM Service.</p><p>The LLM service then uses this context to compose a prompt, which it sends to an LLM API. Once a response is generated, the service formats it and delivers the final API response to the user.</p><p>Let's now examine each step, highlighting the key code snippets involved.</p><h2>API Server</h2><p>We use <a href="https://fastapi.tiangolo.com/">FastAPI</a> to quickly prototype the API endpoint <code>/recommend</code> as:</p><pre><code>@app.post("/recommend")
async def get_recommendations(request: RecommendationRequest) -&gt; dict:
    return recommendation_service.get_recommendations(request.query)</code></pre><p>Also define a background sync task to be run periodically while the server app is running, keeping the vector store up-to-date:</p><pre><code>async def background_sync(svc: RecommendationService):
    while True:
        try:
            logger.info("Starting background sync...")
            svc.sync_data_source()
            logger.info("Background sync completed")
        except Exception as e:
            logger.error(f"Background sync failed: {e}")

        await asyncio.sleep(config.update_interval)</code></pre><h3>Hudi Incremental Read</h3><p>The <code>sync_data_source()</code> function leverages the <a href="https://github.com/apache/hudi-rs">Hudi-rs</a> library&#8212;a native Rust implementation of Hudi&#8212;to incrementally read from the products table. The key snippet below shows how we initialize the Hudi table and perform an incremental read:</p><pre><code># Initialize the Hudi table from its base path
hudi_table = HudiTableBuilder.from_base_uri(
  config.hudi_table_path
).build()

# Perform an incremental read starting from the last sync timestamp
incr_batches = hudi_table.read_incremental_records(start_ts, None)</code></pre><p>The <code>start_ts</code> parameter is set to the timestamp of the last sync, which is maintained as a property within the product service. This ensures that only new or updated records are fetched during each cycle, making the data sync process highly efficient.</p><h3>Upsert Embeddings</h3><p>To prepare the product data for the vector store, we first need to process the raw text. The product names and descriptions must be broken down into smaller, semantically meaningful 'chunks.' This is a crucial step because embeddings are more effective when generated from concise, focused pieces of text.</p><p>While chunking strategies can be quite complex and highly dependent on the data, we'll use the <a href="https://pypi.org/project/semantic-text-splitter/">semantic-text-splitter library</a> for this demo. It allows us to divide the text into coherent semantic units. In the code, we define a <code>SemanticChunker</code> with configurable settings to handle this process.</p><pre><code>from semantic_text_splitter import TextSplitter


class SemanticChunker:
    def __init__(self):
        self.splitter = TextSplitter(
            capacity=config.max_chunk_tokens,
            overlap=config.chunk_overlap_tokens,
        )</code></pre><p>Once the text is chunked, the next step is to convert these chunks into numerical representations called embeddings. For this, we use the powerful <a href="https://sbert.net/">SentenceTransformers library</a>, which transforms each text chunk into a dense vector (as a NumPy array) that captures its semantic meaning.</p><pre><code>class EmbeddingService:
    def __init__(self):
        self.model = SentenceTransformer(config.embedding_model)

    def create_embeddings(
        self, chunks: list[ProductChunk]
    ) -&gt; np.ndarray:
        texts = [chunk.content for chunk in chunks]
        return self.model.encode(
            texts, convert_to_tensor=True
        ).cpu().numpy()</code></pre><p>For the vector store, we chosen <a href="https://qdrant.tech/">Qdrant</a>, a high-performance vector database. The embedding arrays are packaged into <code>PointStruct</code> objects&#8212;Qdrant's primary data structure&#8212;before being saved into a collection.</p><pre><code>class VectorStore:
    def __init__(self):
        self.client = QdrantClient(":memory:")
        self.collection_name = "products"
    
    def upsert(
        self, chunks: list[ProductChunk], embeddings: np.ndarray
    ):
        points = [
            PointStruct(
                id=chunk.chunk_id,
                vector=embeddings[i] * chunk.importance_score,
                payload={
                    "product_id": chunk.product_id,
                    "content": chunk.content,
                    "importance_score": chunk.importance_score,
                },
            )
            for i, chunk in enumerate(chunks)
        ]

        self.client.upsert(
            collection_name=self.collection_name,
            points=points,
            wait=False,
        )</code></pre><p>This process of combining incremental reads with vector upserts ensures that the vector store remains synchronized with the latest product information, making it ready for accurate similarity searches.</p><h2>Product Service</h2><p>When a user query is received, the product service uses the same embedding model to encode the query into a vector. It then performs a similarity search against the vector store to retrieve the top-K most relevant product chunks.</p><pre><code>class VectorStore:
    def __init__(self):
        self.client = QdrantClient(":memory:")
        self.collection_name = "products"

    def search(self, query_vector: np.ndarray) -&gt; list[ProductChunk]:
        top_k = int(config.vector_search_top_k)
        results = self.client.search(
            collection_name=self.collection_name,
            query_vector=query_vector,
            limit=top_k,
            score_threshold=config.similarity_threshold,
        )</code></pre><p>Because each chunk is associated with a product ID, the service can reconstruct a list of relevant product information from the search results. This list forms the context that is passed to the LLM Service for the final generation step.</p><h2>LLM Service</h2><p>The LLM service receives the list of <code>ProductSearchResult</code> objects from the product service to use as context. It then composes a prompt incorporating this information and sends it to the connected AI service provider (in this case, OpenAI).</p><pre><code>class LLMService:
    def __init__(self):
        import openai

        self.client = openai.OpenAI(api_key=config.openai_api_key)

    def generate_recommendation(
        self,
        query: str,
        matching_products: list[ProductSearchResult],
    ) -&gt; str:
        """Generate recommendations with chunk context"""
        context = self._prepare_context(matching_products)
        prompt = f"""
User Query: {query}

Based on these context:

{context}

(End of context)

Instructions:
1. Provide product recommendations to the user with brief explanations.
2. Only recommend the products shown in the context.
3. Do not mention the product IDs in the recommendations.
"""
        try:
            response = self.client.chat.completions.create(
                model=config.openai_model,
                messages=[
                    {
                        "role": "system",
                        "content": "You are a helpful e-commerce recommendation assistant.",
                    },
                    {"role": "user", "content": prompt},
                ],
                max_tokens=1000,
                temperature=0.7,
            )
            return response.choices[0].message.content or ""
        except Exception as e:
            logger.error(f"LLM generation failed: {e}")
            return "Unable to generate recommendations at the moment."</code></pre><p>Once OpenAI returns a response, the LLM service forwards it to the main recommendation service, which formats and delivers the final API response to the user.</p><h2>Test Run</h2><p>To test the end-to-end workflow, we first need a Hudi table populated with sample e-commerce product data. The schema can be minimal, as the key information is contained in the <code>name</code> and <code>description</code> columns.</p><p>First, create the Hudi table using Spark SQL:</p><pre><code>CREATE TABLE products (
    id BIGINT,
    ts BIGINT,
    name STRING,
    description STRING
) USING HUDI
TBLPROPERTIES (
    primaryKey = 'id',
    preCombineField = 'ts'
);</code></pre><p>Next, insert some realistic sample products into the table. Also, set your OpenAI secret key as an environment variable: <code>OPENAI_API_KEY</code>.</p><p>Now, you can start the API server by running <code>./uvicorn.sh</code>. Upon startup, the application will bootstrap the vector store with the initial records from the Hudi table. As you upsert more records into the Hudi table, the periodic sync process will incrementally process these changes and keep the vector store up-to-date.</p><p>To test the API, issue a sample e-commerce query to the endpoint, like: <code>"I like drinking coffee and listening to music with a wireless headphone."</code></p><pre><code>&#10140; curl -X POST "http://localhost:8000/recommend" \
  -H "Content-Type: application/json" \
  -d '{"query": "I like drinking coffee and listening to music with wireless headphone"}' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2483  100  2401  100    82    409     13  0:00:06  0:00:05  0:00:01   516
{
  "success": true,
  "query": "I like drinking coffee and listening to music with wireless headphone",
  "message": "Based on your love for coffee and music, here are two product recommendations that would enhance your experience:\n\n1. **Wireless Bluetooth Headphones**: Enjoy your coffee while immersing yourself in premium audio quality with these state-of-the-art wireless headphones. They feature active noise cancellation, a long battery life of 30 hours, and comfortable padding, making them perfect for listening to music without distractions. The water-resistant design also means you can wear them while enjoying your coffee outdoors or during workouts.\n\n2. **Programmable Coffee Maker**: Elevate your coffee experience with this professional-grade stainless steel coffee maker. It offers programmable brewing options and a thermal carafe that keeps your coffee hot for hours. With features like a brew strength selector and a self-cleaning function, you can enjoy a perfect cup of coffee tailored to your taste, all while you relax and listen to your favorite tunes. \n\nThese products will complement your routines beautifully!",
  "products": [
    {
      "id": 1001,
      "matched_content": [
        "Experience premium audio quality with these state-of-the-art wireless Bluetooth headphones. Featuring active noise cancellation technology, 30-hour battery life, and premium comfort padding. Perfect for music lovers, commuters, and professionals who demand crystal-clear sound quality. The headphones include a portable charging case, multiple ear tip sizes, and support for high-resolution audio codecs. Compatible with all major devices including smartphones, tablets, and laptops. Water-resistant design makes them ideal for workouts and outdoor activities."
      ]
    },
    {
      "id": 1004,
      "matched_content": [
        "Professional-grade stainless steel coffee maker with programmable brewing options and thermal carafe. Brews up to 12 cups of perfect coffee with precision temperature control and optimal extraction time. Features include auto-start timer, brew strength selector, and self-cleaning function. The thermal carafe keeps coffee hot for hours without a heating plate, preserving flavor and aroma. Built-in water filtration system removes impurities for better taste. Compact design fits most kitchen countertops while the sleek stainless steel finish complements any d&#233;cor. Includes permanent gold-tone filter and measuring scoop."
      ]
    }
  ]
}</code></pre><p>The response payload contains the AI-generated recommendation in the <code>message</code> field, along with the most relevant product information retrieved by the product service. You can find the complete, runnable code for this project in <a href="https://github.com/datumagic/hands-on-ai/tree/main/00-build-a-rag-based-ai-recommender">this GitHub repository</a>.</p><h2>Recap</h2><p>In this ending part of the blog, we successfully built a RAG-based AI recommender for e-commerce, leveraging <a href="https://fastapi.tiangolo.com/">FastAPI</a>, <a href="https://hudi.apache.org/">Apache Hudi</a>, <a href="https://qdrant.tech/">Qdrant</a>, and <a href="https://openai.com/api/">OpenAI</a>. While this application serves as a comprehensive local demonstration, transitioning to a production environment requires further consideration.</p><p>Real-world data is far more complex and varied, which needs significant tuning in several areas, including chunking strategies, embedding model selection, similarity search parameters, and LLM prompt engineering.</p><p>Ultimately, the success of any production-grade RAG application hinges on a solid data foundation. A reliable and efficient data lakehouse is not just a prerequisite but the core component for this architecture to thrive.</p><div><hr></div><blockquote><p><em>Follow me on LinkedIn and X for more updates.</em></p><ul><li><p><a href="https://www.linkedin.com/in/xushiyan/">linkedin.com/in/xushiyan/</a></p></li><li><p><a href="https://x.com/_xushiyan">x.com/_xushiyan</a></p></li></ul></blockquote>]]></content:encoded></item><item><title><![CDATA[Building a RAG-based AI Recommender (1/2)]]></title><description><![CDATA[What is RAG, and how to go from data to AI.]]></description><link>https://blog.datumagic.ai/p/building-a-rag-based-ai-recommender</link><guid isPermaLink="false">https://blog.datumagic.ai/p/building-a-rag-based-ai-recommender</guid><dc:creator><![CDATA[Shiyan Xu]]></dc:creator><pubDate>Fri, 11 Jul 2025 01:02:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!HCo8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac2db68d-c7cd-41ab-8bf0-d676bb18116e_1920x1080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You run an online shopping website that sells approximately 100,000 products, and you want to add a new AI recommender to help your website visitors find their desired products by answering questions like "I'm looking for headphones for swimming with a great value-to-price ratio." With new products being listed daily, old products being removed regularly, and sellers updating product details throughout the day, you need to ensure your AI recommender provides accurate and up-to-date information for your users. This scenario is perfectly suited for a RAG-based AI application with incremental processing capabilities provided by an Apache Hudi lakehouse.</p><p>In this two-part blog, I'll structure the content as follows:</p><ul><li><p><strong>Part 1:</strong> I'll introduce RAG from a conceptual perspective and explore why solid data architecture is crucial for AI success.</p></li><li><p><strong>Part 2:</strong> We'll get hands-on with working code and demonstrate the complete end-to-end flow of an AI recommender.</p></li></ul><p>By the end of this two-part blog series, you'll have a solid understanding of RAG basic concepts and know how to build AI apps. You'll get hands-on experience running incremental queries on Hudi tables, building vector search indexes with Qdrant, and setting up a FastAPI app that handles user questions and connects to OpenAI&#8212;so you can see a complete working example in action.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.datumagic.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Datumagic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>RAG is gonna be Ubiquitous</h2><p>Retrieval-Augmented Generation (RAG) has emerged as the dominant paradigm for building AI applications that need to work with private or domain-specific data. At its core, RAG consists of two essential steps: retrieval and generation. While Large Language Models (LLMs) possess impressive general knowledge, they don't know anything about your company's internal documents, customer data, or proprietary information. RAG solves this by first retrieving relevant pieces of your private data based on a user's query, then feeding this context to the LLM for generation. This approach allows the model to ground its responses in your specific data, transforming a general-purpose AI into a knowledgeable assistant that can reason about your unique business context.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YTqR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696e1e75-5d35-4fa1-bb94-d62624eea9dc_1920x697.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YTqR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696e1e75-5d35-4fa1-bb94-d62624eea9dc_1920x697.png 424w, https://substackcdn.com/image/fetch/$s_!YTqR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696e1e75-5d35-4fa1-bb94-d62624eea9dc_1920x697.png 848w, https://substackcdn.com/image/fetch/$s_!YTqR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696e1e75-5d35-4fa1-bb94-d62624eea9dc_1920x697.png 1272w, https://substackcdn.com/image/fetch/$s_!YTqR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696e1e75-5d35-4fa1-bb94-d62624eea9dc_1920x697.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YTqR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696e1e75-5d35-4fa1-bb94-d62624eea9dc_1920x697.png" width="728" height="264.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/696e1e75-5d35-4fa1-bb94-d62624eea9dc_1920x697.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:529,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:211162,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.datumagic.ai/i/167665372?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696e1e75-5d35-4fa1-bb94-d62624eea9dc_1920x697.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YTqR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696e1e75-5d35-4fa1-bb94-d62624eea9dc_1920x697.png 424w, https://substackcdn.com/image/fetch/$s_!YTqR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696e1e75-5d35-4fa1-bb94-d62624eea9dc_1920x697.png 848w, https://substackcdn.com/image/fetch/$s_!YTqR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696e1e75-5d35-4fa1-bb94-d62624eea9dc_1920x697.png 1272w, https://substackcdn.com/image/fetch/$s_!YTqR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696e1e75-5d35-4fa1-bb94-d62624eea9dc_1920x697.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">RAG at high-level</figcaption></figure></div><p>RAG is set to become ubiquitous because it solves a real problem every organization faces: how to make AI work with their specific data. Companies have tons of proprietary information&#8212;internal docs, customer records, product manuals&#8212;that could make AI incredibly useful if the models could actually access it. RAG provides a practical way to connect your existing data with powerful LLMs, putting AI that understands your business context within reach of most organizations. As more businesses realize they can get contextually aware AI with manageable engineering effort, RAG adoption will continue to explode.</p><h3>Data Preparation</h3><p>Before you can retrieve relevant information, your data needs to be transformed into a format that machines can efficiently search and understand. This preparation process involves two critical steps that lay the foundation for effective retrieval.</p><h4>Chunking</h4><p>The first step involves splitting your data into manageable chunks. This chunking process is essential because you don't want to retrieve massive documents for the LLM to process&#8212;there are limits on the context window that LLMs can effectively work with. Each chunk should contain enough context to be meaningful on its own while remaining small enough to fit comfortably within the LLM's processing capabilities. Choosing a good chunking strategy&#8212;whether by paragraphs, sentences, or semantic boundaries&#8212;is critical for improving your RAG system's efficiency and accuracy.</p><h4>Embedding</h4><p>Once your data is chunked, the next step involves converting these chunks into vectors using an embedding model. Vectors are arrays of numbers that encode human-understandable information into a machine-understandable format, where the dimensions work together to encode semantic meaning in complex patterns. To understand this concept, imagine a 3D vector in a coordinate system representing a point in space, where each element represents distances on the x, y, and z axes. Similarly, text embeddings use hundreds or thousands of dimensions to mathematically compare the similarity of different text chunks, even when they use completely different words to express similar concepts.</p><h3>Retrieval</h3><p>The retrieval process takes your prepared data and finds the most relevant pieces to answer a user's query. This involves converting the user's question into the same vector format and then searching your knowledge base for similar content.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HCo8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac2db68d-c7cd-41ab-8bf0-d676bb18116e_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HCo8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac2db68d-c7cd-41ab-8bf0-d676bb18116e_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!HCo8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac2db68d-c7cd-41ab-8bf0-d676bb18116e_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!HCo8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac2db68d-c7cd-41ab-8bf0-d676bb18116e_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!HCo8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac2db68d-c7cd-41ab-8bf0-d676bb18116e_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HCo8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac2db68d-c7cd-41ab-8bf0-d676bb18116e_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac2db68d-c7cd-41ab-8bf0-d676bb18116e_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:321964,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.datumagic.ai/i/167665372?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac2db68d-c7cd-41ab-8bf0-d676bb18116e_1920x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HCo8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac2db68d-c7cd-41ab-8bf0-d676bb18116e_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!HCo8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac2db68d-c7cd-41ab-8bf0-d676bb18116e_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!HCo8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac2db68d-c7cd-41ab-8bf0-d676bb18116e_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!HCo8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac2db68d-c7cd-41ab-8bf0-d676bb18116e_1920x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The data preparation and retrieval flow in RAG</figcaption></figure></div><p>With all your data chunks converted to vectors and stored in a vector database, it's time to retrieve the ones most relevant to a user's query. The user's question first gets converted into a vector using the same embedding model that processed your data chunks. Then, similarity functions calculate how close this query vector is to each stored vector in your database. Common similarity functions include cosine similarity, Euclidean distance, and dot product&#8212;each representing different mathematical approaches to measuring the "distance" between two vectors in high-dimensional space. Vector databases support these calculations out-of-the-box and can quickly return the top-k most similar vectors along with their associated original data chunks. These retrieved chunks then serve as the contextual foundation for the LLM to generate its response.</p><h4>Re-ranking</h4><p>In practice, retrieval typically involves a two-stage process. After the initial vector database search, there's an additional step called re-ranking that significantly improves accuracy. Re-ranking uses a cross-encoder model to perform a more sophisticated comparison between the user's query and the candidates returned by the vector database. While the distance-based similarity search is fast and efficient, it's not always precise in capturing semantic relevance. Cross-encoder models, though more computationally intensive, provide much more accurate similarity assessments. This two-stage approach dramatically boosts overall retrieval accuracy while maintaining good performance.</p><h3>Generation</h3><p>Once the retrieval process identifies and refines the most relevant context from your datasets, the generation step becomes relatively straightforward. You combine the retrieved context with the user's original query and send this information to an LLM via API calls, which then returns a response to the user.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NQRm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f78a274-eb8b-4637-9f8f-ce551fcd59b6_1333x815.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NQRm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f78a274-eb8b-4637-9f8f-ce551fcd59b6_1333x815.png 424w, https://substackcdn.com/image/fetch/$s_!NQRm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f78a274-eb8b-4637-9f8f-ce551fcd59b6_1333x815.png 848w, https://substackcdn.com/image/fetch/$s_!NQRm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f78a274-eb8b-4637-9f8f-ce551fcd59b6_1333x815.png 1272w, https://substackcdn.com/image/fetch/$s_!NQRm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f78a274-eb8b-4637-9f8f-ce551fcd59b6_1333x815.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NQRm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f78a274-eb8b-4637-9f8f-ce551fcd59b6_1333x815.png" width="1333" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f78a274-eb8b-4637-9f8f-ce551fcd59b6_1333x815.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1333,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:202791,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.datumagic.ai/i/167665372?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f78a274-eb8b-4637-9f8f-ce551fcd59b6_1333x815.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NQRm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f78a274-eb8b-4637-9f8f-ce551fcd59b6_1333x815.png 424w, https://substackcdn.com/image/fetch/$s_!NQRm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f78a274-eb8b-4637-9f8f-ce551fcd59b6_1333x815.png 848w, https://substackcdn.com/image/fetch/$s_!NQRm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f78a274-eb8b-4637-9f8f-ce551fcd59b6_1333x815.png 1272w, https://substackcdn.com/image/fetch/$s_!NQRm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f78a274-eb8b-4637-9f8f-ce551fcd59b6_1333x815.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The generation flow in RAG</figcaption></figure></div><p>In practice, you'll typically compose the context information and user query into a prompt template. This template allows you to specify detailed instructions for the LLM to follow, tailoring the response format, tone, and focus to match your specific business needs. For example, you might instruct the model to cite sources, maintain a professional tone, or format answers in a particular structure that aligns with your application's requirements.</p><h2>From Data to AI</h2><p>Now you have gone through the end-to-end flow of a RAG system, but it's still not the full picture of reality. Where do you gather and store the data for chunking? Your data is changing dynamically&#8212;how do you deal with updates to ensure your RAG apps are using up-to-date data? And what if the data contains noise that could cause your system to send misleading information to users? There is no shortcut here: a solid data platform serving high-quality data is the pre-requisite to fully utilizing AI's power.</p><p>Without a proper data infrastructure, even the most sophisticated LLM models can't reach their full potential&#8212;it's the classic "garbage in, garbage out" principle that still holds true in the age of AI. To fully harvest the fruits your RAG apps can offer, you'll likely spend a significant portion of your efforts&#8212;sometimes around 70%&#8212;on data engineering: ingesting and cleaning data, scaling clusters appropriately, setting up monitoring and alert systems, and managing access. These efforts ensure that you have an efficient, reliable, and business-ready data source for AI applications.</p><h3>&#8220;Garbage In, Garbage Out&#8221;</h3><p>Data quality fundamentally defines the lower bound of your AI inference results. No matter how sophisticated your models are, poor-quality input data will inevitably lead to subpar AI performance. This becomes even more critical in RAG systems, where AI responses are directly grounded in the retrieved information.</p><p>The medallion architecture provides a foundation to build quality control mechanisms, systematically refining data through bronze, silver, and gold layers. <a href="https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-810">See this previous blog</a>. Raw data often contains inconsistencies, duplicates, and formatting issues that lead to embeddings calculated from bad data, resulting in poor retrieval accuracy. The gold layer provides business-ready, curated datasets perfect for AI consumption. Even for unstructured binary data like images and videos, you'll want structured tables in the gold layer to track properly formatted metadata for them, enabling LLMs to have more accurate context for generating high-quality responses.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FRHl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18479bc7-3167-4d4c-90f4-230c6afed3d8_1477x956.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FRHl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18479bc7-3167-4d4c-90f4-230c6afed3d8_1477x956.png 424w, https://substackcdn.com/image/fetch/$s_!FRHl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18479bc7-3167-4d4c-90f4-230c6afed3d8_1477x956.png 848w, https://substackcdn.com/image/fetch/$s_!FRHl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18479bc7-3167-4d4c-90f4-230c6afed3d8_1477x956.png 1272w, https://substackcdn.com/image/fetch/$s_!FRHl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18479bc7-3167-4d4c-90f4-230c6afed3d8_1477x956.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FRHl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18479bc7-3167-4d4c-90f4-230c6afed3d8_1477x956.png" width="1456" height="942" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18479bc7-3167-4d4c-90f4-230c6afed3d8_1477x956.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:942,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:158981,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.datumagic.ai/i/167665372?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18479bc7-3167-4d4c-90f4-230c6afed3d8_1477x956.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FRHl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18479bc7-3167-4d4c-90f4-230c6afed3d8_1477x956.png 424w, https://substackcdn.com/image/fetch/$s_!FRHl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18479bc7-3167-4d4c-90f4-230c6afed3d8_1477x956.png 848w, https://substackcdn.com/image/fetch/$s_!FRHl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18479bc7-3167-4d4c-90f4-230c6afed3d8_1477x956.png 1272w, https://substackcdn.com/image/fetch/$s_!FRHl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18479bc7-3167-4d4c-90f4-230c6afed3d8_1477x956.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Medallion architecture provides high-quality data for AI apps</figcaption></figure></div><p>AI applications are also compute-intensive&#8212;from semantic chunking to calculating embeddings, from performing similarity searches to querying LLMs for response generation&#8212;each step involves heavy compute resources and significant costs. This makes two things essential: first, ensuring the data you put through this process is high-quality and worth the compute investment; second, running the process only on changes that affect the system. Rather than reprocessing entire table snapshots, you should work with incremental changes to update just a subset of all the embeddings.</p><h3>Why Hudi for Incremental Processing?</h3><p>Hudi was designed with incremental processing in mind from day one. Hudi's timeline essentially tracks changes made to the table, and through commit files, we can easily fetch the files that contain changes during a specific time window. With the relevant files identified, Hudi uses the record-level meta field <code>_hoodie_commit_time</code> to further filter down the records to be returned. The detailed design discussion can be found in <a href="https://blog.datumagic.ai/i/140342063/incremental-query">this post</a>.</p><p>This timeline and record-level filtering mechanism makes Hudi tables perfectly suitable for the medallion architecture, where changes propagate through multiple layers. The same semantics of incremental processing can be applied throughout your data lakehouse as a standard practice, establishing good data engineering and architecture patterns that make data operations and maintenance easier.</p><p>Looking into the future, Hudi has also made <a href="https://hudi.apache.org/roadmap">roadmap plans</a> to include vector search index support out-of-the-box, which will further simplify the architecture across the entire data storage layer. You'll be able to incrementally calculate embeddings and store them in a Hudi table, then perform similarity searches with Hudi readers. This creates a more unified reader and writer stack for your data architecture.</p><h2>Recap</h2><p>In this first part of the blog, we've introduced RAG at a conceptual level by walking through the end-to-end flow. We've also discussed the importance of good data architecture for AI applications and how Hudi's incremental processing capabilities can support such architecture. In part 2, we'll go through an end-to-end implementation of RAG, building the AI recommender example that we introduced at the beginning of this blog.</p><div><hr></div><blockquote><p><em>Follow me on LinkedIn and X for more updates.</em></p><ul><li><p><a href="https://www.linkedin.com/in/xushiyan/">linkedin.com/in/xushiyan/</a></p></li><li><p><a href="https://x.com/_xushiyan">x.com/_xushiyan</a></p></li></ul></blockquote>]]></content:encoded></item><item><title><![CDATA[Apache Hudi does XYZ (1/10)]]></title><description><![CDATA[File pruning with multi-modal index]]></description><link>https://blog.datumagic.ai/p/apache-hudi-does-xyz-110</link><guid isPermaLink="false">https://blog.datumagic.ai/p/apache-hudi-does-xyz-110</guid><dc:creator><![CDATA[Shiyan Xu]]></dc:creator><pubDate>Mon, 16 Jun 2025 13:24:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Ul-k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3dd709-772a-4236-acd8-7906c584a0cf_3840x2160.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Apache Hudi has a ton of awesome features, but honestly, the sheer number of them can feel pretty overwhelming when you're just starting out. That's why I'm putting together this 10-post blog series&#8212;to break down all those capabilities and highlight some of the coolest features in Hudi 1.0, which is a huge milestone release that really pushes Hudi toward being a full-fledged Data Lakehouse Management System.</p><p>Think of this series as the follow-up to my earlier "<a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-110">Apache Hudi: From Zero To One</a>" series, where I dove deep into Hudi's design concepts based on version 0.x. The good news is that almost everything from that earlier series still applies to Hudi 1.x, so it's still great for building a solid foundation&#8212;in fact, I'd highly recommend reading through that series first, or checking out the consolidated e-book version available at <a href="https://www.onehouse.ai/whitepaper/ebook-apache-hudi---zero-to-one">this link</a>. While that earlier series was pretty heavy on concepts and internals, this one's going to be different. I'm aiming for a good mix of theory and practical stuff&#8212;complete with sample code and real examples you can actually use. My goal is to help you not just understand what makes Hudi so powerful, but get you up and running with these features quickly through hands-on, practical guidance.</p><p>Let's start with one of Hudi's core performance features: the multi-modal index and file pruning. This is what makes your queries fast by helping engines figure out exactly which files to read and which ones to skip entirely.</p><h2>The Multi-Modal Index in Hudi</h2><p>Every data lakehouse table&#8212;whether it uses Delta, Hudi, or Iceberg&#8212;contains a metadata directory that describes the data stored in that table. For Hudi tables, this is the <code>.hoodie/</code> directory, and you can learn more about Hudi's complete storage layout (including this metadata directory) in <a href="https://datumagic.substack.com/i/135356155/storage-format">this post</a>.</p><p>Hudi's multi-modal index lives within the <code>.hoodie/metadata/</code> directory and has an interesting design: it's actually implemented as its own Hudi Merge-on-Read table, known as the metadata table. This metadata table gets updated synchronously alongside any write operations to your main data table, ensuring everything stays consistent and in sync.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w5BC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99cebc6e-1621-4ed3-adc9-1eb453205696_3840x2160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w5BC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99cebc6e-1621-4ed3-adc9-1eb453205696_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!w5BC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99cebc6e-1621-4ed3-adc9-1eb453205696_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!w5BC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99cebc6e-1621-4ed3-adc9-1eb453205696_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!w5BC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99cebc6e-1621-4ed3-adc9-1eb453205696_3840x2160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w5BC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99cebc6e-1621-4ed3-adc9-1eb453205696_3840x2160.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/99cebc6e-1621-4ed3-adc9-1eb453205696_3840x2160.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/48fb4b72-4d8d-4050-aec6-fa9401e68a5b_3840x2160.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316279,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.datumagic.com/i/164211774?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48fb4b72-4d8d-4050-aec6-fa9401e68a5b_3840x2160.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w5BC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99cebc6e-1621-4ed3-adc9-1eb453205696_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!w5BC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99cebc6e-1621-4ed3-adc9-1eb453205696_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!w5BC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99cebc6e-1621-4ed3-adc9-1eb453205696_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!w5BC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99cebc6e-1621-4ed3-adc9-1eb453205696_3840x2160.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Indexing is what separates a good lakehouse table from a great one&#8212;it can make or break your read and write performance. The challenge is that different queries need different types of indexes: range pruning relies on min/max values, point lookups need exact value matching, and vector searches use similarity calculations to find the closest matches. There's no single "one-size-fits-all" index that can handle everything efficiently.</p><p>That's why lakehouse tables need versatile indexing capabilities to perform well across all kinds of workloads. Hudi was actually a pioneer in this space, introducing the multi-modal index back in version 0.11 in 2022. The "multi-modal" name reflects how the underlying metadata table is partitioned by different index types, with each index using its own record schema designed for its specific purpose.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.datumagic.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Datumagic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Files, Partitions, and Statistics</h2><p>When you have a collection of columnar files like Parquet stored somewhere, partitioning by some columns at the physical storage level is the best indexing you can get without a lakehouse format&#8212;but it's also very coarse-grained and basic. Here's where it gets limiting: if your table is partitioned by column A and someone runs a query filtering on column B (like "find all records where column B &gt; X"), the query engine can't do much optimization. It still has to list all partitions and files, then scan through and filter all the records.</p><p>When you create a table in Hudi 1.x, three essential indexes are automatically enabled in the metadata table: <code>files</code>, <code>partition_stats</code>, and <code>column_stats</code>. These indexes provide the core information that query engines need to plan and execute queries efficiently. For more details about how query engines work with Hudi tables, check out <a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-210">this earlier post</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ul-k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3dd709-772a-4236-acd8-7906c584a0cf_3840x2160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ul-k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3dd709-772a-4236-acd8-7906c584a0cf_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!Ul-k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3dd709-772a-4236-acd8-7906c584a0cf_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!Ul-k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3dd709-772a-4236-acd8-7906c584a0cf_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!Ul-k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3dd709-772a-4236-acd8-7906c584a0cf_3840x2160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ul-k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3dd709-772a-4236-acd8-7906c584a0cf_3840x2160.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f3dd709-772a-4236-acd8-7906c584a0cf_3840x2160.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c4db8ce-5ef5-43fe-a6c0-7f03b8d27475_3840x2160.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:297989,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.datumagic.com/i/164211774?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c4db8ce-5ef5-43fe-a6c0-7f03b8d27475_3840x2160.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ul-k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3dd709-772a-4236-acd8-7906c584a0cf_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!Ul-k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3dd709-772a-4236-acd8-7906c584a0cf_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!Ul-k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3dd709-772a-4236-acd8-7906c584a0cf_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!Ul-k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3dd709-772a-4236-acd8-7906c584a0cf_3840x2160.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">File pruning process within the metadata table</figcaption></figure></div><p>When we say a query engine "supports Hudi," it means the engine has a component that understands Hudi's table layout, including how to read from the metadata table. Here's how the query planning process works:</p><p>First, the query engine reads the <code>files</code> index to get a list of partitions to examine. Then it uses the <code>partition_stats</code> index to prune that list by comparing your query predicates against partition-level statistics like min, max, and count values. For example, if your query is looking for records where <code>price &gt;= 300</code>, any partitions with a max price below 300 can be completely skipped.</p><p>With the pruned partition list in hand, the engine goes back to the <code>files</code> index to get the actual file lists for each remaining partition. But it's not done yet&#8212;it can prune those file lists even further using the <code>column_stats</code> index, which provides the same kind of statistics but at the file level instead of the partition level.</p><p>This multi-layered pruning process means the query engine only reads the files it actually needs, significantly reducing the amount of data it has to process.</p><h2>Running in Spark SQL</h2><p>Let's see file pruning in action by creating a Hudi table with sample data and running some Spark SQL queries. We'll start by creating a table with both <code>partition_stats</code> and <code>column_stats</code> disabled to establish a baseline.</p><pre><code>CREATE TABLE order (
    order_id STRING,
    price DECIMAL(12,2),
    order_status STRING,
    update_ts BIGINT,
    shipping_date DATE,
    shipping_country STRING
) USING HUDI
PARTITIONED BY (shipping_country)
OPTIONS (
    primaryKey = 'order_id',
    preCombineField = 'shipping_date',
    hoodie.metadata.index.column.stats.enable = 'false',
    hoodie.metadata.index.partition.stats.enable = 'false'
);</code></pre><p>And insert some sample data:</p><pre><code>INSERT INTO order VALUES
('ORD001', 389.99, 'PENDING',    17495166353, DATE('2023-01-01'), 'A'),
('ORD002', 199.99, 'CONFIRMED',  17495167353, DATE('2023-01-01'), 'A'),
('ORD003', 59.50,  'SHIPPED',    17495168353, DATE('2023-01-11'), 'B'),
('ORD004', 99.00,  'PENDING',    17495169353, DATE('2023-02-09'), 'B'),
('ORD005', 19.99,  'PENDING',    17495170353, DATE('2023-06-12'), 'C'),
('ORD006', 5.99,   'SHIPPED',    17495171353, DATE('2023-07-31'), 'C');</code></pre><p>The query for our test is as below:</p><pre><code>SELECT order_id, price, shipping_country
FROM order
WHERE price &gt; 300;</code></pre><p>This query looks for orders with price greater than 300, which only exist in the partition of <code>shipping_country=A</code>. After running the SQL, here's what we see in the Spark UI:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8d4q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab6e84a6-9e3e-4924-b30b-887354dface6_747x504.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8d4q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab6e84a6-9e3e-4924-b30b-887354dface6_747x504.png 424w, https://substackcdn.com/image/fetch/$s_!8d4q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab6e84a6-9e3e-4924-b30b-887354dface6_747x504.png 848w, https://substackcdn.com/image/fetch/$s_!8d4q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab6e84a6-9e3e-4924-b30b-887354dface6_747x504.png 1272w, https://substackcdn.com/image/fetch/$s_!8d4q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab6e84a6-9e3e-4924-b30b-887354dface6_747x504.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8d4q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab6e84a6-9e3e-4924-b30b-887354dface6_747x504.png" width="747" height="504" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab6e84a6-9e3e-4924-b30b-887354dface6_747x504.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:504,&quot;width&quot;:747,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:73760,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.datumagic.com/i/164211774?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab6e84a6-9e3e-4924-b30b-887354dface6_747x504.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8d4q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab6e84a6-9e3e-4924-b30b-887354dface6_747x504.png 424w, https://substackcdn.com/image/fetch/$s_!8d4q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab6e84a6-9e3e-4924-b30b-887354dface6_747x504.png 848w, https://substackcdn.com/image/fetch/$s_!8d4q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab6e84a6-9e3e-4924-b30b-887354dface6_747x504.png 1272w, https://substackcdn.com/image/fetch/$s_!8d4q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab6e84a6-9e3e-4924-b30b-887354dface6_747x504.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Spark read all 3 partitions and 3 files to find potential matches, but only 1 record from partition <code>A</code> actually satisfied the query condition.</p><h3>Enable <code>column_stats</code></h3><p>Now let's enable <code>column_stats</code> while keeping <code>partition_stats</code> disabled. Note that we can't do it the other way around&#8212;<code>partition_stats</code> requires <code>column_stats</code> to be enabled first.</p><pre><code>CREATE TABLE order (
    order_id STRING,
    price DECIMAL(12,2),
    order_status STRING,
    update_ts BIGINT,
    shipping_date DATE,
    shipping_country STRING
) USING HUDI
PARTITIONED BY (shipping_country)
OPTIONS (
    primaryKey = 'order_id',
    preCombineField = 'shipping_date',
    hoodie.metadata.index.column.stats.enable = 'true',
    hoodie.metadata.index.partition.stats.enable = 'false'
);</code></pre><p>Running the same SQL gives us this in the Spark UI:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2Ekj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64b2526a-dd9e-4cba-a435-e31aae65cbd2_692x461.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2Ekj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64b2526a-dd9e-4cba-a435-e31aae65cbd2_692x461.png 424w, https://substackcdn.com/image/fetch/$s_!2Ekj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64b2526a-dd9e-4cba-a435-e31aae65cbd2_692x461.png 848w, https://substackcdn.com/image/fetch/$s_!2Ekj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64b2526a-dd9e-4cba-a435-e31aae65cbd2_692x461.png 1272w, https://substackcdn.com/image/fetch/$s_!2Ekj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64b2526a-dd9e-4cba-a435-e31aae65cbd2_692x461.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2Ekj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64b2526a-dd9e-4cba-a435-e31aae65cbd2_692x461.png" width="692" height="461" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64b2526a-dd9e-4cba-a435-e31aae65cbd2_692x461.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:461,&quot;width&quot;:692,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64476,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.datumagic.com/i/164211774?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64b2526a-dd9e-4cba-a435-e31aae65cbd2_692x461.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2Ekj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64b2526a-dd9e-4cba-a435-e31aae65cbd2_692x461.png 424w, https://substackcdn.com/image/fetch/$s_!2Ekj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64b2526a-dd9e-4cba-a435-e31aae65cbd2_692x461.png 848w, https://substackcdn.com/image/fetch/$s_!2Ekj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64b2526a-dd9e-4cba-a435-e31aae65cbd2_692x461.png 1272w, https://substackcdn.com/image/fetch/$s_!2Ekj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64b2526a-dd9e-4cba-a435-e31aae65cbd2_692x461.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now it shows all 3 partitions but only 1 file was scanned. Without <code>partition_stats</code>, the query engine couldn't prune partitions, but <code>column_stats</code> successfully filtered out the non-matching files. The compute cost of examining those 2 irrelevant partitions and their files could have been avoided with <code>partition_stats</code> enabled.</p><h3>Enable <code>column_stats</code> and <code>partition_stats</code></h3><p>Now let's enable <code>partition_stats</code> as well. Since both indexes are enabled by default in Hudi 1.x, we can simply omit those additional configs from the CREATE statement.</p><pre><code>CREATE TABLE order (
    order_id STRING,
    price DECIMAL(12,2),
    order_status STRING,
    update_ts BIGINT,
    shipping_date DATE,
    shipping_country STRING
) USING HUDI
PARTITIONED BY (shipping_country)
OPTIONS (
    primaryKey = 'order_id',
    preCombineField = 'shipping_date'
);</code></pre><p>Running the same SQL gives us this in the Spark UI:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mxkw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb745f8c7-443d-435e-a07f-91574143b62d_685x457.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mxkw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb745f8c7-443d-435e-a07f-91574143b62d_685x457.png 424w, https://substackcdn.com/image/fetch/$s_!mxkw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb745f8c7-443d-435e-a07f-91574143b62d_685x457.png 848w, https://substackcdn.com/image/fetch/$s_!mxkw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb745f8c7-443d-435e-a07f-91574143b62d_685x457.png 1272w, https://substackcdn.com/image/fetch/$s_!mxkw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb745f8c7-443d-435e-a07f-91574143b62d_685x457.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mxkw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb745f8c7-443d-435e-a07f-91574143b62d_685x457.png" width="685" height="457" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b745f8c7-443d-435e-a07f-91574143b62d_685x457.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:457,&quot;width&quot;:685,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63825,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.datumagic.com/i/164211774?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb745f8c7-443d-435e-a07f-91574143b62d_685x457.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mxkw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb745f8c7-443d-435e-a07f-91574143b62d_685x457.png 424w, https://substackcdn.com/image/fetch/$s_!mxkw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb745f8c7-443d-435e-a07f-91574143b62d_685x457.png 848w, https://substackcdn.com/image/fetch/$s_!mxkw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb745f8c7-443d-435e-a07f-91574143b62d_685x457.png 1272w, https://substackcdn.com/image/fetch/$s_!mxkw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb745f8c7-443d-435e-a07f-91574143b62d_685x457.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now we see the full pruning effect happened&#8212;only 1 relevant partition and 1 relevant file were scanned, thanks to both indexes working together.</p><h3>Configure columns to be indexed</h3><p>By default, Hudi indexes the first 32 columns for both <code>partition_stats</code> and <code>column_stats</code>. This limit prevents excessive metadata overhead&#8212;each indexed column requires computing min, max, null-count, and value-count statistics for every partition and data file. In most cases, you only need to index a small subset of columns that are frequently used in query predicates. You can specify which columns to be indexed to reduce the maintenance costs:</p><pre><code>CREATE TABLE order (
    order_id STRING,
    price DECIMAL(12,2),
    order_status STRING,
    update_ts BIGINT,
    shipping_date DATE,
    shipping_country STRING
) USING HUDI
PARTITIONED BY (shipping_country)
OPTIONS (
    primaryKey = 'order_id',
    preCombineField = 'update_ts',
    'hoodie.metadata.index.column.stats.column.list' = 'price,shipping_date'
);</code></pre><p>The config <code>hoodie.metadata.index.column.stats.column.list</code> applies to both <code>partition_stats</code> and <code>column_stats</code>. By indexing just the <code>price</code> and <code>shipping_date</code> columns, queries filtering on price comparisons or shipping date ranges will already see significant performance improvements.</p><h2>Recap</h2><p>In this post, we explored Hudi's multi-modal index from a storage layout perspective and demonstrated the file pruning capabilities of the three default indexes in Hudi 1.x: <code>files</code>, <code>partition_stats</code>, and <code>column_stats</code>. Through our SQL examples, you can see how these indexes could dramatically reduce the number of files that need to be scanned, which is crucial for query performance.</p><p>In upcoming posts, we'll explore the multi-modal index's additional capabilities and discuss how its design benefits not just performance, but also scalability, extensibility, and maintenance.</p>]]></content:encoded></item><item><title><![CDATA[Apache Hudi: From Zero To One (10/10)]]></title><description><![CDATA[Becoming "One" - the upcoming 1.0 highlights]]></description><link>https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-1010</link><guid isPermaLink="false">https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-1010</guid><dc:creator><![CDATA[Shiyan Xu]]></dc:creator><pubDate>Sat, 13 Apr 2024 07:26:29 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf4b9d0-d0cd-4a9f-b0cc-bd496ce39de8_4756x3363.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><a href="https://hudi.apache.org/">Apache Hudi</a>: From Zero To One</em></p><ul><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-110">Post 1: A first glance at Hudi's storage format</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-210">Post 2: Dive into read operation flow and query types</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-310">Post 3: Understand write flows and operations</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-410">Post 4: All about writer indexes</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-510">Post 5: Introduce table services: compaction, cleaning, and indexing</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-610">Post 6: Demystify clustering and space-filling curves</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-710">Post 7: Concurrently run writers and table services</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-810">Post 8: Read and process incrementally</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-910">Post 9: Hudi Streamer - a "Swiss Army knife" for ingestion</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-1010">Post 10: Becoming "One" - the upcoming 1.0 highlights</a></em></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/company/apache-hudi&quot;,&quot;text&quot;:&quot;Follow on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.linkedin.com/company/apache-hudi"><span>Follow on LinkedIn</span></a></p><p>Throughout the last nine posts, I have explored Hudi concepts pertinent to release 0.14, ideas that are relevant across most of the 0.x versions. For the blog series finale, I aim to cast a glance into the future and delve into the exciting new features in the upcoming 1.0 release. In doing so, this ending post will effectively accomplish the purpose of the series: guiding readers from the foundational beginnings to the groundbreaking future - from zero to one.</p><h2>The Hudi Stack</h2><p>Let's take a step back to our initial discussion in the <a href="https://blog.datumagic.com/i/135356155/overview">first post</a> and revisit the Hudi stack, a framework that has remained its relevance across both the 0.x and 1.x versions. Illustrated below, this stack functions on top of storage systems, executing read and write operations against open file formats. It is structured into three layers: the transactional database, the programming API, and the user interface.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IJR5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61c50db7-f432-46dc-abeb-fe0b7ac92011_4758x3362.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IJR5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61c50db7-f432-46dc-abeb-fe0b7ac92011_4758x3362.jpeg 424w, https://substackcdn.com/image/fetch/$s_!IJR5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61c50db7-f432-46dc-abeb-fe0b7ac92011_4758x3362.jpeg 848w, https://substackcdn.com/image/fetch/$s_!IJR5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61c50db7-f432-46dc-abeb-fe0b7ac92011_4758x3362.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!IJR5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61c50db7-f432-46dc-abeb-fe0b7ac92011_4758x3362.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IJR5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61c50db7-f432-46dc-abeb-fe0b7ac92011_4758x3362.jpeg" width="1456" height="1029" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61c50db7-f432-46dc-abeb-fe0b7ac92011_4758x3362.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1029,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1122054,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IJR5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61c50db7-f432-46dc-abeb-fe0b7ac92011_4758x3362.jpeg 424w, https://substackcdn.com/image/fetch/$s_!IJR5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61c50db7-f432-46dc-abeb-fe0b7ac92011_4758x3362.jpeg 848w, https://substackcdn.com/image/fetch/$s_!IJR5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61c50db7-f432-46dc-abeb-fe0b7ac92011_4758x3362.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!IJR5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61c50db7-f432-46dc-abeb-fe0b7ac92011_4758x3362.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Hudi stack</figcaption></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.datumagic.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Datumagic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The transactional database layer, viewed as "the Hudi core", comprises several key components: the table format defines the storage layout, the table services keep the table optimized, the indexes speed up reads and writes, the concurrency control upholds the isolation principle, the lake cache elevates the read efficiency, and the metaserver centralizes the metadata access. These components act together to establish a robust foundation, delivering a database experience for Hudi Lakehouses.</p><p>The programming API layer introduces a suite of writer and reader interfaces, standardizing the integration with various execution and query engines. These APIs empower users across the ecosystem to fully harness Hudi's advanced capabilities such as efficient upserts and incremental processing.</p><p>The user interface layer provides a higher level of integrated tools that broadly fall into two categories: platform services, which include ingestion utilities, catalog sync tools, and admin CLI;  and query engines such as Spark, Flink, Presto, Trino, among others. The diverse array of tools further aids users in adopting Hudi and building comprehensive Lakehouse solutions.</p><h2>Release 1.0 Highlights</h2><p>While the Hudi stack remains consistent in version 1.0, the new release features redesigns and updates at the table format level compared to the 0.x versions. These changes, along with other innovative new features, have enhanced overall efficiency and throughput, significantly upgrading Hudi Lakehouse's capabilities.</p><h3>LSM Tree Timeline</h3><p>The Hudi Timeline fundamentally consists of a series of immutable transaction logs that record all changes made to a table. In the 0.x versions, the volume of transaction logs increases linearly over time. To optimize storage use, older Timeline instants are archived for optimizing the storage while at the cost of increased compute during access. For Hudi 1.0, a key design goal is to support a near-infinite Timeline that balances optimized storage with efficient access. To achieve this, Log-Structured Merge-Tree (LSM Tree) is adopted to define the Timeline layout in 1.0 tables.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tu_d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf4b9d0-d0cd-4a9f-b0cc-bd496ce39de8_4756x3363.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tu_d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf4b9d0-d0cd-4a9f-b0cc-bd496ce39de8_4756x3363.jpeg 424w, https://substackcdn.com/image/fetch/$s_!tu_d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf4b9d0-d0cd-4a9f-b0cc-bd496ce39de8_4756x3363.jpeg 848w, https://substackcdn.com/image/fetch/$s_!tu_d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf4b9d0-d0cd-4a9f-b0cc-bd496ce39de8_4756x3363.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!tu_d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf4b9d0-d0cd-4a9f-b0cc-bd496ce39de8_4756x3363.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tu_d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf4b9d0-d0cd-4a9f-b0cc-bd496ce39de8_4756x3363.jpeg" width="1456" height="1030" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2cf4b9d0-d0cd-4a9f-b0cc-bd496ce39de8_4756x3363.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1030,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:622203,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tu_d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf4b9d0-d0cd-4a9f-b0cc-bd496ce39de8_4756x3363.jpeg 424w, https://substackcdn.com/image/fetch/$s_!tu_d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf4b9d0-d0cd-4a9f-b0cc-bd496ce39de8_4756x3363.jpeg 848w, https://substackcdn.com/image/fetch/$s_!tu_d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf4b9d0-d0cd-4a9f-b0cc-bd496ce39de8_4756x3363.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!tu_d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cf4b9d0-d0cd-4a9f-b0cc-bd496ce39de8_4756x3363.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Store Hudi Timeline as Log-Structured Merge-Tree</figcaption></figure></div><p>LSM Tree is a multi-layered data structure designed for high write throughput. At the top layer, Hudi Timeline stores the transactions as active instants, having individual Avro files recording the metadata for each transaction's states: requested, inflight, and completed. When exceed a certain threshold, the active instants will be flushed to Parquet files, forming the first storage-optimized layer. The transactions are grouped and sorted chronologically, and the file names contain time range information, allowing efficient retrieval through manifest files. When the number of Parquet files stored at one level exceeds a certain limit, the files will be compacted and pushed down to the next level as larger files with more Timeline instants, further improving storage efficiency. Additionally, these highly compressed Parquet files are optimized for query performance, particularly when time-range filters or specific columns are targeted, ensuring fast data retrieval.</p><h3>Non-Blocking Concurrency Control</h3><p>When a streaming writer is present in a concurrent writing scenario, contention could frequently arise due to random updates (e.g., running a separate GDPR deleter job). Using Optimistic Concurrency Control (OCC) in such scenario can lead to repeated retries, thereby wasting compute resources. Hudi has adopted Multi-Version Concurrency Control (MVCC) to prevent blocking and retry behaviors due to contention among a single-writer and table service runners. Hudi also offers early conflict detection for OCC to reduce resource wastage upon retries. To advance further, Hudi 1.0 introduces Non-Blocking Concurrency Control (NBCC) for MOR tables to maximize writer throughput.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FRer!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F427091ac-29e5-4eed-bf4d-f9d87c5a8f30_4511x3546.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FRer!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F427091ac-29e5-4eed-bf4d-f9d87c5a8f30_4511x3546.jpeg 424w, https://substackcdn.com/image/fetch/$s_!FRer!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F427091ac-29e5-4eed-bf4d-f9d87c5a8f30_4511x3546.jpeg 848w, https://substackcdn.com/image/fetch/$s_!FRer!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F427091ac-29e5-4eed-bf4d-f9d87c5a8f30_4511x3546.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!FRer!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F427091ac-29e5-4eed-bf4d-f9d87c5a8f30_4511x3546.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FRer!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F427091ac-29e5-4eed-bf4d-f9d87c5a8f30_4511x3546.jpeg" width="1456" height="1145" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/427091ac-29e5-4eed-bf4d-f9d87c5a8f30_4511x3546.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1145,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:562881,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FRer!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F427091ac-29e5-4eed-bf4d-f9d87c5a8f30_4511x3546.jpeg 424w, https://substackcdn.com/image/fetch/$s_!FRer!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F427091ac-29e5-4eed-bf4d-f9d87c5a8f30_4511x3546.jpeg 848w, https://substackcdn.com/image/fetch/$s_!FRer!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F427091ac-29e5-4eed-bf4d-f9d87c5a8f30_4511x3546.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!FRer!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F427091ac-29e5-4eed-bf4d-f9d87c5a8f30_4511x3546.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Writing to MOR table with Non-Blocking Concurrency Control</figcaption></figure></div><p>NBCC allows multiple writers to persist updating Log Files freely to the same File Slice, and defers the conflict resolution to the compaction stage. Different from the 0.x versions, Log Files in 1.0 also record commit completion time in addition to the starting time. This new piece of information enables proper sorting for the Log Files and helps determine File Slice boundaries. Merging semantics based on a configurable ordering field are applied to the updating records during compaction. To resolve the clock skew issue, <code>TrueTimeGenerator</code> is implemented to ensure monotonically increasing timestamps for all writers' commits.</p><h3>File Group Reader &amp; Writer</h3><p>Since the very beginning, Hudi has incorporated the concept of record keys, a design choice that unlocks significant potential for record level operations. Paired with the File Group model, this approach lays a robust foundation for efficient upserts and look-ups. In Hudi 1.0, File Group Reader and Writer APIs are introduced to fully capitalizes on the design advantages offered by the record keys and the File Group model.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7dSO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecfd6c3-813c-4cb8-ad84-42328004ba80_4861x3290.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7dSO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecfd6c3-813c-4cb8-ad84-42328004ba80_4861x3290.jpeg 424w, https://substackcdn.com/image/fetch/$s_!7dSO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecfd6c3-813c-4cb8-ad84-42328004ba80_4861x3290.jpeg 848w, https://substackcdn.com/image/fetch/$s_!7dSO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecfd6c3-813c-4cb8-ad84-42328004ba80_4861x3290.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!7dSO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecfd6c3-813c-4cb8-ad84-42328004ba80_4861x3290.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7dSO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecfd6c3-813c-4cb8-ad84-42328004ba80_4861x3290.jpeg" width="1456" height="985" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ecfd6c3-813c-4cb8-ad84-42328004ba80_4861x3290.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:985,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:579074,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7dSO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecfd6c3-813c-4cb8-ad84-42328004ba80_4861x3290.jpeg 424w, https://substackcdn.com/image/fetch/$s_!7dSO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecfd6c3-813c-4cb8-ad84-42328004ba80_4861x3290.jpeg 848w, https://substackcdn.com/image/fetch/$s_!7dSO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecfd6c3-813c-4cb8-ad84-42328004ba80_4861x3290.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!7dSO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ecfd6c3-813c-4cb8-ad84-42328004ba80_4861x3290.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">File Group APIs: partial update and position-based merge</figcaption></figure></div><p>On the writer side, Hudi 1.0 employs partial updates, which involve only the updated columns and values, to greatly reduce the Log File sizes. By leveraging Hudi's advanced indexing systems, records targeted for updates are efficiently located, and positional information can be encoded alongside the data log blocks. On the reader side, having the minimized Log File data and the positional information to pinpoint the updating rows and columns, a snapshot query against an un-compacted File Slice can be fully optimized.</p><h3>Expression Index</h3><p>In the 0.x versions, Hudi has supported a variety of indexing capabilities, including the Bucket Index and Record-level Index, among others. To enhance flexibility and improve access speeds, Hudi 1.0 introduces the Expression Index, enabling faster retrieval methods and incorporating partitioning schemes into the indexing system.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6xYo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a1310c-e87e-4964-9175-760c96cf4e84_5200x3076.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6xYo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a1310c-e87e-4964-9175-760c96cf4e84_5200x3076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6xYo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a1310c-e87e-4964-9175-760c96cf4e84_5200x3076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6xYo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a1310c-e87e-4964-9175-760c96cf4e84_5200x3076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6xYo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a1310c-e87e-4964-9175-760c96cf4e84_5200x3076.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6xYo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a1310c-e87e-4964-9175-760c96cf4e84_5200x3076.jpeg" width="1456" height="861" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f4a1310c-e87e-4964-9175-760c96cf4e84_5200x3076.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:861,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:673174,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://datumagic.substack.com/i/143463746?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a1310c-e87e-4964-9175-760c96cf4e84_5200x3076.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6xYo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a1310c-e87e-4964-9175-760c96cf4e84_5200x3076.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6xYo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a1310c-e87e-4964-9175-760c96cf4e84_5200x3076.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6xYo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a1310c-e87e-4964-9175-760c96cf4e84_5200x3076.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6xYo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4a1310c-e87e-4964-9175-760c96cf4e84_5200x3076.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Expression Index: create and update</figcaption></figure></div><p>For example, consider a column "ts" that holds Epoch timestamps. Users might want to filter the data by different time precisions, such as monthly or hourly. By building an Expression Index on the "ts" column using the following SQL, it's possible to achieve effective data-skipping without the need to physically partition the table by hour or add a separate "hour" column.</p><pre><code>CREATE INDEX ts_hour ON hudi_table USING column_stats(ts) options(func='hour');</code></pre><p>Hudi stores user-created index definitions under a dedicated directory under the metadata path <code>.hoodie/</code>. These definitions inform query engines about the available indexes, facilitating more optimized query planning. The index entries are maintained under separate partitions within the Metadata Table, which serves as the indexing subsystem for the enclosing Hudi table. When writers commit changes, all the available Expression Indexes are updated to reflect these changes, in a manner similar to other enabled indexing features in the Metadata Table. This ensures that indexes remain up-to-date, maintaining high levels of efficiency for both read and write operations.</p><h2>Recap</h2><p>In this post, we revisited the Hudi stack diagram, and introduced four noteworthy features set to debut in the upcoming 1.0 release: the LSM Tree Timeline, Non-Blocking Concurrency Control, File Group Reader &amp; Writer, and the Expression Index. In short, Hudi 1.0 makes a defining release, setting a new standard in its development trajectory. As a closing note, here is a key excerpt from the <a href="https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md">1.x RFC</a> that succinctly captures the essence of this significant upgrade:</p><blockquote><p>We propose Hudi 1.x as a reimagination of Hudi, as the <em>transactional database for the lake</em>, with <a href="https://en.wikipedia.org/wiki/Polyglot_persistence">polyglot persistence</a>, raising the level of abstraction and platformization even higher for Hudi data lakes.</p></blockquote><p><em>Apache Hudi has a thriving community - come and engage with us via <a href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g">Slack</a>, <a href="https://github.com/apache/hudi">GitHub</a>, <a href="https://www.linkedin.com/company/apache-hudi/">LinkedIn</a>, <a href="https://twitter.com/apachehudi">X (Twitter)</a>, and <a href="https://www.youtube.com/@apachehudi">YouTube</a>!</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g&quot;,&quot;text&quot;:&quot;Engage on Slack&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g"><span>Engage on Slack</span></a></p>]]></content:encoded></item><item><title><![CDATA[Apache Hudi: From Zero To One (9/10)]]></title><description><![CDATA[Hudi Streamer - a "Swiss Army knife" for ingestion]]></description><link>https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-910</link><guid isPermaLink="false">https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-910</guid><dc:creator><![CDATA[Shiyan Xu]]></dc:creator><pubDate>Fri, 01 Mar 2024 07:59:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56486ec-3c46-4da3-b76e-b36038636999_5116x3127.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><a href="https://hudi.apache.org/">Apache Hudi</a>: From Zero To One</em></p><ul><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-110">Post 1: A first glance at Hudi's storage format</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-210">Post 2: Dive into read operation flow and query types</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-310">Post 3: Understand write flows and operations</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-410">Post 4: All about writer indexes</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-510">Post 5: Introduce table services: compaction, cleaning, and indexing</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-610">Post 6: Demystify clustering and space-filling curves</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-710">Post 7: Concurrently run writers and table services</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-810">Post 8: Read and process incrementally</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-910">Post 9: Hudi Streamer - a "Swiss Army knife" for ingestion</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-1010">Post 10: Becoming "One" - the upcoming 1.0 highlights</a></em></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/company/apache-hudi&quot;,&quot;text&quot;:&quot;Follow on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.linkedin.com/company/apache-hudi"><span>Follow on LinkedIn</span></a></p><p>Over the course of the last eight posts, I've explored many topics and internal designs of Hudi, including its storage layout, read and write operations, indexing, table services, and concurrency control mechanisms. The timing now feels right to broaden our perspective and start implementing some practical pipelines to streamline data flow into Hudi. In this post, my focus will shift to Hudi Streamer, a comprehensive data ingestion tool designed for deploying production-grade pipelines for Hudi tables. Given its versatility, a topic I will delve into further within this blog, I frequently liken it to a "Swiss Army knife" for importing data into Lakehouses.</p><h2>Overview</h2><p>Hudi Streamer is a Spark application<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> designed to offer a wide range of customizable interfaces for managing the write process to Hudi tables. It enables users to configure source data, define schemas, schedule table services, keep data catalogs in sync, and so on. The diagram below presents a high-level view of Hudi Streamer's components and their workflow.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!t6lg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbc8462f-6b14-446a-9efa-6e2bdb5a5be3_4426x3614.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t6lg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbc8462f-6b14-446a-9efa-6e2bdb5a5be3_4426x3614.jpeg 424w, https://substackcdn.com/image/fetch/$s_!t6lg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbc8462f-6b14-446a-9efa-6e2bdb5a5be3_4426x3614.jpeg 848w, https://substackcdn.com/image/fetch/$s_!t6lg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbc8462f-6b14-446a-9efa-6e2bdb5a5be3_4426x3614.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!t6lg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbc8462f-6b14-446a-9efa-6e2bdb5a5be3_4426x3614.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t6lg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbc8462f-6b14-446a-9efa-6e2bdb5a5be3_4426x3614.jpeg" width="1456" height="1189" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cbc8462f-6b14-446a-9efa-6e2bdb5a5be3_4426x3614.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1189,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:607518,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!t6lg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbc8462f-6b14-446a-9efa-6e2bdb5a5be3_4426x3614.jpeg 424w, https://substackcdn.com/image/fetch/$s_!t6lg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbc8462f-6b14-446a-9efa-6e2bdb5a5be3_4426x3614.jpeg 848w, https://substackcdn.com/image/fetch/$s_!t6lg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbc8462f-6b14-446a-9efa-6e2bdb5a5be3_4426x3614.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!t6lg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbc8462f-6b14-446a-9efa-6e2bdb5a5be3_4426x3614.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Hudi Streamer's ingestion workflow</figcaption></figure></div><p>The workflow illustrates an ingestion pipeline, typically configured step by step by users. With Hudi Streamer, the setup process will be greatly simplified with its rich set of options. The key to master this tool is to understand what the options are meant for and how to configure them properly. Below, we provide explanations for some of the foundational options.</p><ul><li><p>The "--table-type" (CoW or MoR), "--table-name" (for identifying the table), and "--target-base-path" (physical location of the table) are three required properties for writing to a Hudi table. </p></li><li><p>The "--continuous" flag indicates whether Hudi Streamer should operate in an ongoing manner or execute for once. If the flag is present, the application will keep fetching source data and writing to storage in a loop. Without the flag, Hudi Streamer performs one-time data fetching and writing before terminating. The "continuous" mode is ideal when there is a steady stream of data from upstream sources, whereas the "run-once" mode is tailored for batch or bootstrap use cases.</p></li><li><p>The "--min-sync-interval-seconds" works with the "continuous" mode, specifying the shortest allowable interval in seconds between ingestion cycles. For instance, if an ingestion operation requires 40 seconds to complete and the min-sync-interval is configured to 60 seconds, Hudi Streamer will pause for 20 seconds before initiating the subsequent ingestion cycle. This pause ensures the interval adheres to the minimum set duration. Conversely, if the ingestion duration extends to 70 seconds, surpassing the minimum interval, the application immediately proceeds to the next cycle without any delay. This functionality is crucial for ensuring that sufficient data accumulates at the upstream source for processing, thereby preventing the inefficiency of handling numerous small-scale writes.</p></li><li><p>The "--op" option represents the type of operation to be executed by Hudi Streamer, which fundamentally serves as another Hudi writer. It supports three write operations: UPSERT (default), INSERT, and BULK_INSERT. For a comprehensive review of write operations, please revisit <a href="https://blog.datumagic.com/i/136915529/write-operations">post 3</a>.</p></li><li><p>The "--filter-dupes" flag corresponds to the write client configuration <code>hoodie.combine.before.insert=false|true</code>. This setting allows users to pre-combine records by keys within the incoming batch, effectively reduce the amount of data to process. The flag is applicable when the write operation is set to INSERT or BULK_INSERT, however it should not be present when "--op" is set to UPSERT, since we don't want to lose potential updates before merging them with on-storage versions.</p></li><li><p>The "--props" and "--hoodie-conf" options offer flexible ways to take in arbitrary Hudi properties. The former points to a file containing a collection of properties, and the latter accepts a single configuration in the format of "key=value". It is important to note that the properties specified through "--hoodie-conf" take precedence over those extracted via "--props".</p></li></ul><p>In the forthcoming sections, we will delve into the major components depicted in the workflow diagram, offering a detailed exploration of additional options.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.datumagic.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Datumagic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Source</h2><p>Source is an abstraction for providing upstream source data for Hudi Streamer. Its primary responsibility is fetching data from the source system as an input batch for processing and writing. By extending the Source abstract class, Hudi Streamer can be seamlessly integrated with a wide range of data systems. Designed with a platform vision from day one, Hudi currently offers more than a dozen of Source implementations off-the-shelf, as shown in the following picture.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zfsa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56486ec-3c46-4da3-b76e-b36038636999_5116x3127.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zfsa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56486ec-3c46-4da3-b76e-b36038636999_5116x3127.jpeg 424w, https://substackcdn.com/image/fetch/$s_!zfsa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56486ec-3c46-4da3-b76e-b36038636999_5116x3127.jpeg 848w, https://substackcdn.com/image/fetch/$s_!zfsa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56486ec-3c46-4da3-b76e-b36038636999_5116x3127.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!zfsa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56486ec-3c46-4da3-b76e-b36038636999_5116x3127.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zfsa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56486ec-3c46-4da3-b76e-b36038636999_5116x3127.jpeg" width="1456" height="890" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f56486ec-3c46-4da3-b76e-b36038636999_5116x3127.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:890,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:633248,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zfsa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56486ec-3c46-4da3-b76e-b36038636999_5116x3127.jpeg 424w, https://substackcdn.com/image/fetch/$s_!zfsa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56486ec-3c46-4da3-b76e-b36038636999_5116x3127.jpeg 848w, https://substackcdn.com/image/fetch/$s_!zfsa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56486ec-3c46-4da3-b76e-b36038636999_5116x3127.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!zfsa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff56486ec-3c46-4da3-b76e-b36038636999_5116x3127.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Available source implementations for Hudi Streamer</figcaption></figure></div><p>To use a Source for Hudi Streamer, set its fully qualified class name to "--source-class" and configure some Source-specific properties where applicable. For example, a KafkaSource would require setting <code>hoodie.streamer.source.kafka.topic</code>. You may consult the <a href="https://hudi.apache.org/docs/configurations#DELTA_STREAMER_SOURCE">configurations page</a> for more details. Additionally, the "--source-limit" option sets an upper limit on the data amount to read during each fetch, enhancing control over the ingestion process.</p><h2>Transformer</h2><p>Upon retrieving incoming data from the Source, it often becomes necessary to perform lightweight transformations, such as adding or dropping specific columns or flattening the schema. The transformer interface facilitates these modifications in a straightforward yet effective manner. It processes a Spark Dataset and outputs the transformed version of the Dataset, enabling seamless data manipulation to meet the requirements of the ingestion pipeline.</p><p>The "--transformer-class" option takes in one or many class names of Transformer implementations. When multiple Transformers are given, they are applied sequentially, i.e., the output of one serving as the input for the next. This chained approach provides flexibility and facilitates code maintenance.</p><h2>Run Table Services</h2><p>Table services, as introduced in <a href="https://blog.datumagic.com/p/apache-hudi-from-zero-to-one-510">post 5</a>, can be easily managed by Hudi Streamer alongside data writing. When configured as "async", compaction and clustering will be scheduled inline by the Hudi write client internal to Hudi Streamer, and will be executed asynchronously by <code>HoodieAsyncTableService</code>, which uses a thread pool to submit and control table service jobs.</p><p>While async table service jobs are running, it might not always be desirable to write new data, for instance, the same cluster that is executing the table services may not have enough resources to perform ingestion. Furthermore, it's sometimes advisable to avoid running too many concurrent compaction or clustering jobs to prevent resource contention. Use "--max-pending-compactions" and "--max-pending-clustering" to limit the outstanding table service operations, and when the limits are reached, no new ingestion job will be scheduled.</p><p>When running ingestion jobs and table service jobs concurrently within the same Spark application, it's crucial to appropriately allocate the cluster's resources to ensure optimal performance and efficiency. Hudi Streamer facilitates this by enabling users to input scheduling configurations through specific options. These configurations play a key role in managing how resources are distributed between the ingestion, compaction, and clustering processes.</p><pre><code># for ingestion
--delta-sync-scheduling-weight
--delta-sync-scheduling-minshare

# for compaction
--compact-scheduling-weight
--compact-scheduling-minshare

# for clustering
--cluster-scheduling-weight
--cluster-scheduling-minshare</code></pre><p>The properties shown will be used to generate an XML file, which is then referenced by the Spark property <code>spark.scheduler.allocation.file</code>. To activate these settings, users should set <code>spark.scheduler.mode=FAIR</code> for the Spark application.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> For more explanation on the scheduling mechanism, please consult this Spark <a href="https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application">documentation page</a>.</p><h2>Sync to Catalogs</h2><p>Data catalogs play a crucial role in the data ecosystem, and Hudi supports multi-catalog sync out-of-the-box via its SyncTool classes. Hudi Streamer can integrate with SyncTools through the "--sync-tool-classes" option, which takes in a list of SyncTool class names:</p><pre><code><code># for AWS Glue Data Catalog
org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool

# for Google BigQuery
org.apache.hudi.gcp.bigquery.BigQuerySyncTool

# for Hive Metastore
org.apache.hudi.hive.HiveSyncTool

# for DataHub
org.apache.hudi.sync.datahub.DataHubSyncTool</code></code></pre><p>After each write, if the catalog sync is enabled using the "--enable-sync" flag, each of the configured SyncTools will run synchronously in sequence to upload metadata to the target data catalog. For example, if the write created some new partitions and added a new column to the table, the <code>AwsGlueCatalogSyncTool</code> will update the partition list and the schema stored in the catalog table.</p><p>For SyncTools to function properly, users should supply additional SyncTool-specific properties through the "--props" or "--hoodie-conf" options. For detailed configurations, please refer to <a href="https://hudi.apache.org/docs/configurations#META_SYNC">this section</a> of the documentation page.</p><h2>Other Notable Features</h2><p>The Schema Provider, specified through "--schemaprovider-class", serves the schema for reading from the Source and writing to the target table. A notable implementation of this is the <code>SchemaRegistryProvider</code>, which proves particularly useful when integrating with a <code>KafkaSource</code>. This implementation enables Hudi Streamer to access Kafka's schema registry, ensuring that data ingested from Kafka is accurately interpreted and processed.</p><p>The "--checkpoint" and "--initial-checkpoint-provider" facilitate pausing and resuming data fetching from the Source, avoiding data loss or duplication. The "--post-write-termination-strategy-class" allows for a graceful shutdown of Hudi Streamer in the "continuous" mode. The "--run-bootstrap" flag instructs the Hudi Streamer to perform a one-time bootstrap operation for a new Hudi table.</p><h2>Recap</h2><p>In this post, we've provided an overview of Hudi Streamer's workflow, followed by an in-depth exploration of its diverse options in Hudi Streamer. These discussions aim to highlight Hudi's rich platform capabilities in building an end-to-end ingestion pipeline. Please feel free to share your feedback and suggest content in the comments section.</p><p><em>Apache Hudi has a thriving community - come and engage with us via <a href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g">Slack</a>, <a href="https://github.com/apache/hudi">GitHub</a>, <a href="https://www.linkedin.com/company/apache-hudi/">LinkedIn</a>, <a href="https://twitter.com/apachehudi">X (Twitter)</a>, and <a href="https://www.youtube.com/@apachehudi">YouTube</a>!</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g&quot;,&quot;text&quot;:&quot;Engage on Slack&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g"><span>Engage on Slack</span></a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>For running with Flink, Hudi offers a similar utility tool <code>HoodieFlinkStreamer</code>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>To activate the scheduling options, Hudi Streamer also needs to be running in the "continuous" mode and the target table type should be MoR.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Apache Hudi: From Zero To One (8/10)]]></title><description><![CDATA[Read and process incrementally]]></description><link>https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-810</link><guid isPermaLink="false">https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-810</guid><dc:creator><![CDATA[Shiyan Xu]]></dc:creator><pubDate>Fri, 05 Jan 2024 23:08:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1T5k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ffc122-5ccf-4ccc-a2dd-261425170134_1650x1763.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><a href="https://hudi.apache.org/">Apache Hudi</a>: From Zero To One</em></p><ul><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-110">Post 1: A first glance at Hudi's storage format</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-210">Post 2: Dive into read operation flow and query types</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-310">Post 3: Understand write flows and operations</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-410">Post 4: All about writer indexes</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-510">Post 5: Introduce table services: compaction, cleaning, and indexing</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-610">Post 6: Demystify clustering and space-filling curves</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-710">Post 7: Concurrently run writers and table services</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-810">Post 8: Read and process incrementally</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-910">Post 9: Hudi Streamer - a "Swiss Army knife" for ingestion</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-1010">Post 10: Becoming "One" - the upcoming 1.0 highlights</a></em></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/company/apache-hudi&quot;,&quot;text&quot;:&quot;Follow on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.linkedin.com/company/apache-hudi"><span>Follow on LinkedIn</span></a></p><p>In this post, we will explore the topic of incremental processing in Hudi, addressing the missing piece mentioned in <a href="https://blog.datumagic.com/i/136325921/incremental-query">post 2</a>. We'll start with a concise overview of the incremental architecture before examining two related features in Hudi: incremental query and change data capture (CDC).</p><h2>Overview</h2><p>Incremental processing, a technique of extracting, loading, and transforming (ELT) subsets of data to keep end results up-to-date, has become a standard in constructing data pipelines for data lakehouses. Unlike traditional methods, which often involve pulling a complete data snapshot for storage overwriting or using costly join operations to identify updates, modern data lakehouses typically utilize a storage format inherently supportive of incremental processing to simplify the architecture. Benefiting from the native support, the <a href="https://www.onehouse.ai/glossary/medallion-architecture">medallion architecture</a> has gained popularity and has been adopted in production pipelines by numerous companies. This architecture is characterized by three key layers: the bronze layer, essential for reprocessing needs; the silver layer, ensuring data quality; and the gold layer, delivering business value.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1T5k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ffc122-5ccf-4ccc-a2dd-261425170134_1650x1763.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1T5k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ffc122-5ccf-4ccc-a2dd-261425170134_1650x1763.png 424w, https://substackcdn.com/image/fetch/$s_!1T5k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ffc122-5ccf-4ccc-a2dd-261425170134_1650x1763.png 848w, https://substackcdn.com/image/fetch/$s_!1T5k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ffc122-5ccf-4ccc-a2dd-261425170134_1650x1763.png 1272w, https://substackcdn.com/image/fetch/$s_!1T5k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ffc122-5ccf-4ccc-a2dd-261425170134_1650x1763.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1T5k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ffc122-5ccf-4ccc-a2dd-261425170134_1650x1763.png" width="1456" height="1556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31ffc122-5ccf-4ccc-a2dd-261425170134_1650x1763.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1556,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:221635,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1T5k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ffc122-5ccf-4ccc-a2dd-261425170134_1650x1763.png 424w, https://substackcdn.com/image/fetch/$s_!1T5k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ffc122-5ccf-4ccc-a2dd-261425170134_1650x1763.png 848w, https://substackcdn.com/image/fetch/$s_!1T5k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ffc122-5ccf-4ccc-a2dd-261425170134_1650x1763.png 1272w, https://substackcdn.com/image/fetch/$s_!1T5k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31ffc122-5ccf-4ccc-a2dd-261425170134_1650x1763.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Medallion architecture: from applications to AI &amp; BI</figcaption></figure></div><p>In the next sections, we'll discuss how Hudi achieves incremental processing, which is well-suited to supporting a robust implementation of the medallion architecture.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.datumagic.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Datumagic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Incremental Query</h2><p>Hudi effectively tracks changes in the form of transaction logs by persisting commit metadata within the Timeline, and thereby naturally facilitates incremental processing which, in most cases, relies on timestamp-based checkpointing. Hudi's incremental query feature is enabled through these configurations:</p><pre><code>hoodie.datasource.query.type=incremental
hoodie.datasource.read.begin.instanttime=202305150000
hoodie.datasource.read.end.instanttime=202305160000 # optional</code></pre><p>These allow for the retrieval of data that has changed within a defined time window. For more usage examples, please check out the <a href="https://hudi.apache.org/docs/quick-start-guide/#incremental-query">documentation page</a>. A few things to note on the behaviors:</p><ul><li><p>Setting <code>hoodie.datasource.read.begin.instanttime=0</code> effectively requests all changes made to the table from the very beginning of its history.</p></li><li><p>Omitting <code>hoodie.datasource.read.end.instanttime</code> will result in fetching the changes up to the most recent completed commit in the table.</p></li><li><p>The data returned by incremental queries contains records that were updated during the specified time window<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. These records are matched to their versions corresponding to the latest completed commit in the table. If <code>hoodie.datasource.read.end.instanttime</code> is set, the records will align with the commit denoted by this specified end time.</p></li><li><p>When the begin time is set to 0 and the end time is omitted, the incremental query effectively becomes equivalent to a snapshot query, retrieving all the latest records in the table.</p></li></ul><p>Now that we have an understanding of the behavior of incremental queries, we are prepared to delve into the details. The following diagram shows the workflow involved in fetching incremental data from a Hudi MoR table.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nCOm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b80890-6bcf-4618-ae24-3c1fbfa1d33b_1751x1119.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nCOm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b80890-6bcf-4618-ae24-3c1fbfa1d33b_1751x1119.png 424w, https://substackcdn.com/image/fetch/$s_!nCOm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b80890-6bcf-4618-ae24-3c1fbfa1d33b_1751x1119.png 848w, https://substackcdn.com/image/fetch/$s_!nCOm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b80890-6bcf-4618-ae24-3c1fbfa1d33b_1751x1119.png 1272w, https://substackcdn.com/image/fetch/$s_!nCOm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b80890-6bcf-4618-ae24-3c1fbfa1d33b_1751x1119.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nCOm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b80890-6bcf-4618-ae24-3c1fbfa1d33b_1751x1119.png" width="1456" height="930" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58b80890-6bcf-4618-ae24-3c1fbfa1d33b_1751x1119.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:930,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:103747,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nCOm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b80890-6bcf-4618-ae24-3c1fbfa1d33b_1751x1119.png 424w, https://substackcdn.com/image/fetch/$s_!nCOm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b80890-6bcf-4618-ae24-3c1fbfa1d33b_1751x1119.png 848w, https://substackcdn.com/image/fetch/$s_!nCOm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b80890-6bcf-4618-ae24-3c1fbfa1d33b_1751x1119.png 1272w, https://substackcdn.com/image/fetch/$s_!nCOm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b80890-6bcf-4618-ae24-3c1fbfa1d33b_1751x1119.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hudi incremental query flow</figcaption></figure></div><p>Incremental queries follow the read flow as depicted in <a href="https://blog.datumagic.com/i/136325921/spark-hudi-read-flow">post 2</a>, implementing two internal APIs: <code>collectFileSplits()</code> and <code>composeRDD()</code>. The implementation is largely divided into these steps:</p><ul><li><p><code>collectFileSplits()</code> is responsible for identifying all files relevant to the query. This function derives start and end timestamps based on user input to define a specific time range. This time range is then used to filter commits on the Timeline.</p></li><li><p>Hudi's Timeline, comprising a series of transaction logs, inherently represents the changes made over time. With a specified time range, it becomes straightforward to filter down to the relevant files needed for the <code>composeRDD()</code> function to process.</p></li><li><p>In a Hudi table, each record includes a metadata field named <code>_hoodie_commit_time</code>, which links the record to a specific commit in the Timeline. During the process of loading target files for records, incremental queries construct a commit time filter to further minimize the amount of data read. This filter is pushed to the level of file reading, allowing <code>composeRDD()</code> to be optimized to load only those records that are intended to be returned.</p></li></ul><h2>Change Data Capture</h2><p>Incremental queries effectively reveal which records have been changed and their final states. However, they don't provide specific details about the nature of these changes. For instance, if record X is identified as having been modified, the incremental query doesn't clarify its column values prior to the update, or whether it was a newly inserted record. Additionally, it doesn't indicate if any records were hard-deleted. To address these limitations, Hudi 0.13.0 introduced <a href="https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md">Change Data Capture (CDC)</a>. This enhanced format of incremental processing provides a more comprehensive view of data modifications, including inserts, updates, and deletes, thereby enabling a clearer understanding of the changes within the dataset.</p><p>To enable the CDC functionality, users need to set this table property <code>hoodie.table.cdc.enabled=true</code><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. Writers writing to the table will honor this setting and activate the process of creating CDC log files alongside Base Files. Thanks to Hudi's file grouping mechanism, these CDC log files are included in the same File Groups that hold the changed data. This makes it easy to extend table services like cleaning, and facilitate recovery operations like restore, to manage both CDC log files and data files altogether for more coherent file management. </p><p>To pull the CDC data, users just need to set the incremental format to <code>CDC</code> when performing incremental queries. Time-range related behaviors still apply to the <code>CDC</code> query format.</p><pre><code>hoodie.datasource.query.type=incremental
<strong>hoodie.datasource.query.incremental.format=cdc</strong>
hoodie.datasource.read.begin.instanttime=202305150000
hoodie.datasource.read.end.instanttime=202305160000 # optional</code></pre><p><code>T</code>he following diagram shows a brief overview of how writer and reader interact with CDC files and data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oyTb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05e5b080-80a7-4ea7-8f19-8d4507bb5812_1352x983.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oyTb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05e5b080-80a7-4ea7-8f19-8d4507bb5812_1352x983.png 424w, https://substackcdn.com/image/fetch/$s_!oyTb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05e5b080-80a7-4ea7-8f19-8d4507bb5812_1352x983.png 848w, https://substackcdn.com/image/fetch/$s_!oyTb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05e5b080-80a7-4ea7-8f19-8d4507bb5812_1352x983.png 1272w, https://substackcdn.com/image/fetch/$s_!oyTb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05e5b080-80a7-4ea7-8f19-8d4507bb5812_1352x983.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oyTb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05e5b080-80a7-4ea7-8f19-8d4507bb5812_1352x983.png" width="646" height="469.6878698224852" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05e5b080-80a7-4ea7-8f19-8d4507bb5812_1352x983.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:983,&quot;width&quot;:1352,&quot;resizeWidth&quot;:646,&quot;bytes&quot;:60144,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oyTb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05e5b080-80a7-4ea7-8f19-8d4507bb5812_1352x983.png 424w, https://substackcdn.com/image/fetch/$s_!oyTb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05e5b080-80a7-4ea7-8f19-8d4507bb5812_1352x983.png 848w, https://substackcdn.com/image/fetch/$s_!oyTb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05e5b080-80a7-4ea7-8f19-8d4507bb5812_1352x983.png 1272w, https://substackcdn.com/image/fetch/$s_!oyTb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05e5b080-80a7-4ea7-8f19-8d4507bb5812_1352x983.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hudi CDC write and read</figcaption></figure></div><p>On the writer side, Hudi's write handle holds the information<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> about the intended operations for the writing records (insert, update, or delete). This information is then encoded into a specific CDC log file format, containing four fields as shown in the diagram. The nullable "before" and "after" fields store the complete record snapshot before and after the change. Users have the flexibility to reduce the volume of logged data by adjusting <code>hoodie.table.cdc.supplemental.logging.mode</code>: use <code>DATA_BEFORE</code> to skip the "after" field, or set <code>OP_KEY_ONLY</code> to store record key instead of "before" and "after" fields.</p><p>On the reader side, CDC log files are loaded to construct the results, following a process similar to that of normal incremental queries (whose incremental format is called <code>latest_state</code>). If both "before" and "after" fields are logged, the results will be directly extracted from the CDC log files. In case of a less verbose logging mode is used, the results will be computed on-the-fly by looking up existing records in the table. This is essentially a trade-off between saving storage space and the efficiency of running CDC queries.</p><h3>Richer Insights</h3><p>The introduction of CDC capabilities greatly enhances Hudi tables' usage, supporting a wider range of scenarios and offering valuable insights. Take, for example, an account balance subject to frequent debit and credit transactions. Without CDC, periodic snapshot queries or the <code>latest_state</code> incremental queries might only see small or no change in the balance, potentially missing critical fluctuations. Through CDC queries, all changes are revealed, offering a comprehensive view of the account's activities. This level of details would be essential to enable fraud detection algorithms to take actions accordingly.</p><h2>Recap</h2><p>In this post, we provided a concise introduction to incremental processing and the <a href="https://www.onehouse.ai/glossary/medallion-architecture">medallion architecture</a>, followed by an in-depth exploration of Hudi's approach to supporting incremental queries and Change Data Capture (CDC). Finally, we discussed the significance of CDC in deriving valuable business insights. Please feel free to share your feedback and suggest content in the comments section.</p><p><em>Apache Hudi has a thriving community - come and engage with us via <a href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g">Slack</a>, <a href="https://github.com/apache/hudi">GitHub</a>, <a href="https://www.linkedin.com/company/apache-hudi/">LinkedIn</a>, <a href="https://twitter.com/apachehudi">X (Twitter)</a>, and <a href="https://www.youtube.com/@apachehudi">YouTube</a>!</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g&quot;,&quot;text&quot;:&quot;Engage on Slack&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g"><span>Engage on Slack</span></a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The updating actions and the time window here correspond to the processing time in the Hudi table, not the event time in the business domain.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Once enabled, users are not allowed to switch the setting on and off during the table's lifespan. This restriction is enforced due to its impact on the storage layout and the usecase being undesirable to accommodate the flexibility.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Depending on the execution engine and the index configuration, either writer or compaction runner has access to this information.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Apache Hudi: From Zero To One (7/10)]]></title><description><![CDATA[Concurrently run writers and table services]]></description><link>https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-710</link><guid isPermaLink="false">https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-710</guid><dc:creator><![CDATA[Shiyan Xu]]></dc:creator><pubDate>Thu, 07 Dec 2023 03:08:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38cd972d-89ea-4f5e-a7ed-89b430d41a6d_973x964.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><a href="https://hudi.apache.org/">Apache Hudi</a>: From Zero To One</em></p><ul><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-110">Post 1: A first glance at Hudi's storage format</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-210">Post 2: Dive into read operation flow and query types</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-310">Post 3: Understand write flows and operations</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-410">Post 4: All about writer indexes</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-510">Post 5: Introduce table services: compaction, cleaning, and indexing</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-610">Post 6: Demystify clustering and space-filling curves</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-710">Post 7: Concurrently run writers and table services</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-810">Post 8: Read and process incrementally</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-910">Post 9: Hudi Streamer - a "Swiss Army knife" for ingestion</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-1010">Post 10: Becoming "One" - the upcoming 1.0 highlights</a></em></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/company/apache-hudi&quot;,&quot;text&quot;:&quot;Follow on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.linkedin.com/company/apache-hudi"><span>Follow on LinkedIn</span></a></p><p>In the <a href="https://blog.datumagic.com/p/apache-hudi-from-zero-to-one-610">previous post</a>, we concluded our discussion on table services by exploring the intricacies of the clustering process and space-filling curves. With the knowledge gained in previous posts, we can seamlessly transition to the next topic: concurrency control, focusing specifically on managing concurrency for writers and table services.</p><h2>A Primer on Concurrency Control</h2><p>Every commit to a Hudi table is a transaction, whether it stems from adding new data or executing a table service job. Concurrency control is about orchestrating concurrently executed transactions to ensure correctness and consistency while maintaining optimal performance. A wealth of valuable resources is available online, such as <a href="https://15445.courses.cs.cmu.edu/fall2023/">this course</a> and <a href="https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf">this paper</a>. This primer aims to offer just enough context for the subsequent sections that delve into Hudi's implementation of concurrency control.</p><p>In databases, ACID are 4 essential properties to maintain the integrity and reliability of transactions. In the chart provided below, I've presented a brief summary of ACID, attempting to leave a clear and easily memorable overview.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.datumagic.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Datumagic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cCQk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c0b4d1-7ed8-4733-a6b3-b4c972173b37_1356x1129.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cCQk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c0b4d1-7ed8-4733-a6b3-b4c972173b37_1356x1129.png 424w, https://substackcdn.com/image/fetch/$s_!cCQk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c0b4d1-7ed8-4733-a6b3-b4c972173b37_1356x1129.png 848w, https://substackcdn.com/image/fetch/$s_!cCQk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c0b4d1-7ed8-4733-a6b3-b4c972173b37_1356x1129.png 1272w, https://substackcdn.com/image/fetch/$s_!cCQk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c0b4d1-7ed8-4733-a6b3-b4c972173b37_1356x1129.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cCQk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c0b4d1-7ed8-4733-a6b3-b4c972173b37_1356x1129.png" width="1356" height="1129" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23c0b4d1-7ed8-4733-a6b3-b4c972173b37_1356x1129.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1129,&quot;width&quot;:1356,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58466,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cCQk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c0b4d1-7ed8-4733-a6b3-b4c972173b37_1356x1129.png 424w, https://substackcdn.com/image/fetch/$s_!cCQk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c0b4d1-7ed8-4733-a6b3-b4c972173b37_1356x1129.png 848w, https://substackcdn.com/image/fetch/$s_!cCQk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c0b4d1-7ed8-4733-a6b3-b4c972173b37_1356x1129.png 1272w, https://substackcdn.com/image/fetch/$s_!cCQk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23c0b4d1-7ed8-4733-a6b3-b4c972173b37_1356x1129.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Atomicity</em> requires that each transaction be treated as an indivisible unit of work; any changes made by the transaction should be rolled back in the event of a halfway failure. <em>Consistency</em> is about application-specific constraints; for example, a primary key field cannot have duplicates, or the product price column must be non-negative. <em>Isolation</em> ensures concurrent transactions are isolated from each other, resulting in making changes as though they are executed sequentially. <em>Durability</em> mandates the preservation of committed data on storage, ensuring resilience against incidents such as hardware failures.</p><p>If the Isolation property is not honored, concurrent transactions will incur read and write anomalies, such as dirty read/write, lost update, etc. While enforcing a strictly serial execution of all transactions can eliminate the anomalies, this severely impacts performance, rendering the system practically unusable. Therefore, we should allow concurrent execution for performance, and coordinate them in equivalence to a serial schedule for correctness. In other words, what we need is a <em>serializable</em> schedule.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WuPt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38cd972d-89ea-4f5e-a7ed-89b430d41a6d_973x964.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WuPt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38cd972d-89ea-4f5e-a7ed-89b430d41a6d_973x964.png 424w, https://substackcdn.com/image/fetch/$s_!WuPt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38cd972d-89ea-4f5e-a7ed-89b430d41a6d_973x964.png 848w, https://substackcdn.com/image/fetch/$s_!WuPt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38cd972d-89ea-4f5e-a7ed-89b430d41a6d_973x964.png 1272w, https://substackcdn.com/image/fetch/$s_!WuPt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38cd972d-89ea-4f5e-a7ed-89b430d41a6d_973x964.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WuPt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38cd972d-89ea-4f5e-a7ed-89b430d41a6d_973x964.png" width="973" height="964" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/38cd972d-89ea-4f5e-a7ed-89b430d41a6d_973x964.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:964,&quot;width&quot;:973,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:37526,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WuPt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38cd972d-89ea-4f5e-a7ed-89b430d41a6d_973x964.png 424w, https://substackcdn.com/image/fetch/$s_!WuPt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38cd972d-89ea-4f5e-a7ed-89b430d41a6d_973x964.png 848w, https://substackcdn.com/image/fetch/$s_!WuPt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38cd972d-89ea-4f5e-a7ed-89b430d41a6d_973x964.png 1272w, https://substackcdn.com/image/fetch/$s_!WuPt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38cd972d-89ea-4f5e-a7ed-89b430d41a6d_973x964.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>MVCC (Multi-Version Concurrency Control) and OCC (Optimistic Concurrency Control) stand out as two widely adopted strategies for enforcing serializable schedules in various database systems. MVCC keeps multiple record versions on storage and associates them with monotonically increasing transaction IDs (e.g., timestamps). OCC "optimistically" allows concurrent transactions to proceed on their own first and resolves any conflicts later. Hudi adopted MVCC for handling single writer with concurrent table services without locking. In later releases, OCC implementation was added to support multi-writer scenarios. In the upcoming sections, we will explore how Hudi employs these strategies in dealing with concurrent writers and table services.</p><h2>MVCC in Hudi</h2><p>Timeline and File Slices serve as the foundation to Hudi's MVCC implementation. Timeline uses monotonically increasing commit start time to keep track of transactions to the table. File Slices handle record versioning and correspond to transaction timestamps. One layer above these, Hudi constructs a view object, namely <code>TableFileSystemView</code>, providing APIs to return the table's most recent storage states, such as the latest File Slices under a partition path, and File Groups that undergo clustering. Writers and readers always consult the table file-system view for deciding where to perform the actual IO operations. This design provides read-write isolation out-of-the-box since the new data writing does not interfere with readers accessing past versions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JbAw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c0d4cf3-a6dc-4a32-9766-d3af1a24588a_5409x2957.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JbAw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c0d4cf3-a6dc-4a32-9766-d3af1a24588a_5409x2957.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JbAw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c0d4cf3-a6dc-4a32-9766-d3af1a24588a_5409x2957.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JbAw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c0d4cf3-a6dc-4a32-9766-d3af1a24588a_5409x2957.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JbAw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c0d4cf3-a6dc-4a32-9766-d3af1a24588a_5409x2957.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JbAw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c0d4cf3-a6dc-4a32-9766-d3af1a24588a_5409x2957.jpeg" width="1456" height="796" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c0d4cf3-a6dc-4a32-9766-d3af1a24588a_5409x2957.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:796,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:917004,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JbAw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c0d4cf3-a6dc-4a32-9766-d3af1a24588a_5409x2957.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JbAw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c0d4cf3-a6dc-4a32-9766-d3af1a24588a_5409x2957.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JbAw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c0d4cf3-a6dc-4a32-9766-d3af1a24588a_5409x2957.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JbAw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c0d4cf3-a6dc-4a32-9766-d3af1a24588a_5409x2957.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">MVCC: Hudi table services run concurrently with a writer</figcaption></figure></div><p>When a write operation is in progress, a commit action indicating this write will be marked as "requested" or "inflight" on the Timeline. This makes the table file-system view aware of the ongoing action, ensuring that table service planners do not include the File Slices currently being written for subsequent execution. This logic also holds true in the scenario of concurrent table service jobs. Hudi's table services are idempotent operations because the plans containing information about which File Slices to read are persisted. Therefore, retries in the event of failure won't impact the final result.</p><p>While a compaction could be on-going, any new write to the MoR table would either route new records to new File Groups or append updates/deletes to Log Files. The Base File that the compaction job is producing will be excluded by the view to prevent misuse. When clustering is pending, users can configure the writer's behavior in case of updating a File Group that undergoes clustering: abort the write, rollback the clustering, defer to later conflict resolution (OCC), or perform dual-write to both the source and target clustering File Groups. Cleaning is always executed in a way that retains the latest File Slices, keeping the deletes clear of new writes.</p><h2>OCC in Hudi</h2><p>An OCC protocol typically comprises of three phases: read, validation, and write. In the read phase, concurrent writers perform the necessary IO operations to complete their work in isolation. The validation phase involves collecting the list of changes from each writer and determine if any conflicts exist. Lastly, during the write phase, all changes will be accepted if no conflicts are found, or if conflicts arise, the changes from writer with the later transaction time will be rolled back. This is similar to the GitHub workflow, where contributors can submit pull requests to the upstream repository. The merging will be blocked for pull requests that have conflicts, akin to the validation phase in OCC.</p><p>As concurrent updates could lead to write anomalies, Hudi implements OCC at the file-level granularity to handle multi-writer scenarios. To enable this feature, users need to set "hoodie.write.concurrency.mode" to <code>OPTIMISTIC_CONCURRENCY_CONTROL</code> and configure a locker provider accordingly.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>  The following diagram demonstrates how OCC is integrated into Hudi's write flow.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M4Px!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ff9930-ad6a-4209-a193-7ad5a65841b9_2349x1590.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M4Px!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ff9930-ad6a-4209-a193-7ad5a65841b9_2349x1590.png 424w, https://substackcdn.com/image/fetch/$s_!M4Px!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ff9930-ad6a-4209-a193-7ad5a65841b9_2349x1590.png 848w, https://substackcdn.com/image/fetch/$s_!M4Px!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ff9930-ad6a-4209-a193-7ad5a65841b9_2349x1590.png 1272w, https://substackcdn.com/image/fetch/$s_!M4Px!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ff9930-ad6a-4209-a193-7ad5a65841b9_2349x1590.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M4Px!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ff9930-ad6a-4209-a193-7ad5a65841b9_2349x1590.png" width="1456" height="986" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73ff9930-ad6a-4209-a193-7ad5a65841b9_2349x1590.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:986,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:223422,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M4Px!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ff9930-ad6a-4209-a193-7ad5a65841b9_2349x1590.png 424w, https://substackcdn.com/image/fetch/$s_!M4Px!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ff9930-ad6a-4209-a193-7ad5a65841b9_2349x1590.png 848w, https://substackcdn.com/image/fetch/$s_!M4Px!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ff9930-ad6a-4209-a193-7ad5a65841b9_2349x1590.png 1272w, https://substackcdn.com/image/fetch/$s_!M4Px!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ff9930-ad6a-4209-a193-7ad5a65841b9_2349x1590.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An example of Hudi OCC flow for multi-writer scenario</figcaption></figure></div><p>Highlight some key steps in the diagram:</p><ul><li><p>Write client 1 is writing t1.commit and first acquires a lock from the lock provider, which is usually implemented using an external-running server such as Zookeeper, Hive Metastore, or DynamoDB.</p></li><li><p>While holding the lock, write client 1 can exclusively check the Timeline to see if any concurrent commits have been completed before its own attempt. In this example, t2.commit by write client 2 is the only candidate Timeline instant to check against and it's still inflight, therefore client 1 can proceed to commit and release the lock.</p></li><li><p>Write client 2 is writing t2.commit and acquires the lock after client 1 releases it. In the pre-commit phase, the changed files by client 2, obtained from WriteStatus, conflict with the changed files by client 1, derived from t1.commit. Consequently, client 2 will abort the write.</p></li></ul><p>Aborted writes will be rolled back, implying the deletion of all the written files, both for data and metadata, as if the writes never occurred. While this fulfills Atomicity in the ACID properties, it could also be wasteful, particularly when the conflict chances are high. Hudi offers an early-conflict-detection mode for OCC. In this mode, before the actual files are written, lightweight marker files are created in a temporary folder. These markers serve as a preliminary step for conflict checking. For a detailed explanation of the design and implementation of early conflict detection, please refer to this <a href="https://www.youtube.com/watch?v=sgfMdeD-yk4">community talk</a>.</p><h2>Recap</h2><p>In this blog post, we went through a brief overview of the concurrency control topic before delving into the implementation details of two strategies, MVCC and OCC, within Hudi. These strategies are adeptly employed to address scenarios involving a single writer with table services and multiple writers. Please feel free to share your feedback and suggest content in the comments section.</p><p><em>Apache Hudi has a thriving community - come and engage with us via <a href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g">Slack</a>, <a href="https://github.com/apache/hudi">GitHub</a>, <a href="https://www.linkedin.com/company/apache-hudi/">LinkedIn</a>, <a href="https://twitter.com/apachehudi">X (Twitter)</a>, and <a href="https://www.youtube.com/@apachehudi">YouTube</a>!</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g&quot;,&quot;text&quot;:&quot;Engage on Slack&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g"><span>Engage on Slack</span></a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Readers may refer to the official <a href="https://hudi.apache.org/docs/concurrency_control#enabling-multi-writing">documentation page</a> for more details on enabling multi-writing.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Apache Hudi: From Zero To One (6/10)]]></title><description><![CDATA[Demystify clustering and space-filling curves]]></description><link>https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-610</link><guid isPermaLink="false">https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-610</guid><dc:creator><![CDATA[Shiyan Xu]]></dc:creator><pubDate>Mon, 13 Nov 2023 08:31:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!45HX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7ed39-e46e-4836-b94d-23ab54684a1c_2145x1386.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><a href="https://hudi.apache.org/">Apache Hudi</a>: From Zero To One</em></p><ul><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-110">Post 1: A first glance at Hudi's storage format</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-210">Post 2: Dive into read operation flow and query types</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-310">Post 3: Understand write flows and operations</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-410">Post 4: All about writer indexes</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-510">Post 5: Introduce table services: compaction, cleaning, and indexing</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-610">Post 6: Demystify clustering and space-filling curves</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-710">Post 7: Concurrently run writers and table services</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-810">Post 8: Read and process incrementally</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-910">Post 9: Hudi Streamer - a "Swiss Army knife" for ingestion</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-1010">Post 10: Becoming "One" - the upcoming 1.0 highlights</a></em></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/company/apache-hudi&quot;,&quot;text&quot;:&quot;Follow on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.linkedin.com/company/apache-hudi"><span>Follow on LinkedIn</span></a></p><p>In <a href="https://blog.datumagic.com/p/apache-hudi-from-zero-to-one-510">the previous post</a>, we covered the concept of table service and discussed compaction, cleaning, and indexing. To conclude this topic, we will now delve into the remaining service - clustering.</p><h2>Overview</h2><p>In the context of machine learning, clustering is a technique used to categorize data points into groups, unveiling underlying structures within the dataset. Many clustering algorithms employ specific methods to measure distances between data points, thereby determining the groups they belong to. When talking about clustering within the data storage domain, we can consider records as the data points and physical files as the groups. Therefore, the clustering process can be viewed as putting "proximate" records into the same files. You might naturally pose two follow-up questions: a) How can we determine if records are "proximate"? b) Why is clustering necessary? </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!45HX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7ed39-e46e-4836-b94d-23ab54684a1c_2145x1386.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!45HX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7ed39-e46e-4836-b94d-23ab54684a1c_2145x1386.png 424w, https://substackcdn.com/image/fetch/$s_!45HX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7ed39-e46e-4836-b94d-23ab54684a1c_2145x1386.png 848w, https://substackcdn.com/image/fetch/$s_!45HX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7ed39-e46e-4836-b94d-23ab54684a1c_2145x1386.png 1272w, https://substackcdn.com/image/fetch/$s_!45HX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7ed39-e46e-4836-b94d-23ab54684a1c_2145x1386.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!45HX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7ed39-e46e-4836-b94d-23ab54684a1c_2145x1386.png" width="612" height="395.5302197802198" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56f7ed39-e46e-4836-b94d-23ab54684a1c_2145x1386.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:941,&quot;width&quot;:1456,&quot;resizeWidth&quot;:612,&quot;bytes&quot;:179281,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!45HX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7ed39-e46e-4836-b94d-23ab54684a1c_2145x1386.png 424w, https://substackcdn.com/image/fetch/$s_!45HX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7ed39-e46e-4836-b94d-23ab54684a1c_2145x1386.png 848w, https://substackcdn.com/image/fetch/$s_!45HX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7ed39-e46e-4836-b94d-23ab54684a1c_2145x1386.png 1272w, https://substackcdn.com/image/fetch/$s_!45HX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7ed39-e46e-4836-b94d-23ab54684a1c_2145x1386.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.datumagic.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Datumagic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>To illustrate the concept of "proximity", let&#8217;s use the analogy of a 2-dimensional plane with X and Y axes. In this analogy, if a dataset&#8217;s schema has two columns, X and Y, and the records will be considered "close" when the value pairs (X, Y) are close to each other on the 2D plane. In case of wide schema with numerous of columns, more dimensions should be added accordingly. While visualizing high-dimensional spaces is challenging for 3-dimensional beings like ourselves, the proximity can still be determined mathematically, allowing computers to process the information.</p><p>Clustering, in the context of data storage, stands as a valuable optimization technique to improve the storage layout by preserving data locality for better read efficiency. There are three main motivations to perform clustering:</p><ul><li><p>Low-latency high-throughput writes often result in too many small files, hurting the query performance. A clustering task that consolidates and rewrites these data files into larger ones can effectively address the issue, especially when executed asynchronously to the writer.</p></li><li><p>During the process of rewriting data files, "proximate" records are more likely to be clustered in the same files, thereby facilitating data-skipping techniques. Clustered records tend to show better alignment with the file-level statistics like column min/max values, allowing data files to be skipped more effectively based on given predicates.</p></li><li><p>Reading clustered data can also take advantage of cache systems. The principle of spatial locality suggests that, following the access of certain data elements, nearby data elements are likely to be accessed in the near future. As clustered data exhibits good locality, utilizing block cache (e.g., in HDFS) can increase the hit rate, resulting in faster reads.</p></li></ul><h2>Clustering Workflow</h2><p>Similar to other table services mentioned in <a href="https://blog.datumagic.com/i/137970869/overview">post 5</a>, clustering can be run in three modes: inline, semi-async, and full-async. Users are encouraged to consult the <a href="https://hudi.apache.org/docs/configurations">official documentation</a> and configure these flags to control the running mode as needed.</p><pre><code>hoodie.clustering.inline
hoodie.clustering.schedule.inline
hoodie.clustering.async.enabled</code></pre><p>Since clustering involves rewriting data, a <code>.replacecommit</code> will be generated upon the completion of the table service job, indicating that the eligible File Groups have been rewritten into new ones. The clustering workflow, consisting of scheduling and execution, is illustrated below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a1rj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe864eb4a-95ea-497b-a875-7d54476393b2_2113x2299.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a1rj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe864eb4a-95ea-497b-a875-7d54476393b2_2113x2299.png 424w, https://substackcdn.com/image/fetch/$s_!a1rj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe864eb4a-95ea-497b-a875-7d54476393b2_2113x2299.png 848w, https://substackcdn.com/image/fetch/$s_!a1rj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe864eb4a-95ea-497b-a875-7d54476393b2_2113x2299.png 1272w, https://substackcdn.com/image/fetch/$s_!a1rj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe864eb4a-95ea-497b-a875-7d54476393b2_2113x2299.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a1rj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe864eb4a-95ea-497b-a875-7d54476393b2_2113x2299.png" width="1456" height="1584" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e864eb4a-95ea-497b-a875-7d54476393b2_2113x2299.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1584,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:280982,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!a1rj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe864eb4a-95ea-497b-a875-7d54476393b2_2113x2299.png 424w, https://substackcdn.com/image/fetch/$s_!a1rj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe864eb4a-95ea-497b-a875-7d54476393b2_2113x2299.png 848w, https://substackcdn.com/image/fetch/$s_!a1rj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe864eb4a-95ea-497b-a875-7d54476393b2_2113x2299.png 1272w, https://substackcdn.com/image/fetch/$s_!a1rj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe864eb4a-95ea-497b-a875-7d54476393b2_2113x2299.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hudi clustering workflow</figcaption></figure></div><p>The clustering workflow is akin to <a href="https://blog.datumagic.com/i/137970869/compaction">compaction</a>. During the scheduling phase, eligible partitions and File Slices are selected based on <code>ClusteringPlanStrategy</code>. Users have the flexibility to define partition patterns (e.g., using Regex) to target specific partitions. Within these partitions, File Slices meeting certain criteria - such as not undergoing another pending compaction or not qualifying as small files - are added to <code>HoodieClusteringGroup</code>s. These entities store information about the input and output for subsequent clustering execution. Typically, <code>HoodieClusteringGroup</code> adheres to size limits, such as the maximum total bytes of File Slices to include for rewriting. The total number of <code>HoodieClusteringGroup</code>s is also capped by default, preventing unintentional submission of resource-intensive clustering jobs.</p><p>The execution phase involves high-level steps as below:</p><ul><li><p>Deserialize the clustering plan</p></li><li><p>Load the designated input File Slices</p></li><li><p>Merge the loaded records</p></li><li><p>Bulk Insert the merged records to new File Groups</p></li><li><p>Report write statistics through the returned <code>WriteStatus</code></p></li></ul><p>Users can customize the execution by supplying their own implementation of <code>ClusteringExecutionStrategy</code>. By default, each <code>HoodieClusteringGroup</code> defined in a clustering plan will be submitted as a separate job to perform parallel rewriting of File Slices.</p><p>For File Groups undergoing a clustering process, writers will, by default<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, abort if updates or deletes on those File Groups are intended. However, failing writes in the case of running table services may not be ideal. Other pluggable strategies exist that allow updates to proceed, followed by resolving conflicts or enforcing dual-writes on both the old and new File Groups.</p><p>We have illustrated how the clustering workflow looks like as a Hudi table service. However, one crucial piece of information was missing: where does the record "proximity" we mentioned in the <a href="https://blog.datumagic.com/i/138369118/overview">overview</a> come to play during the process? This occurs at the Bulk Insert step, where records are re-partitioned and sorted according to <code>hoodie.layout.optimize.strategy</code>, which I&#8217;ll elaborate on in the next section.</p><h2>Layout Optimization Strategies</h2><p>Hudi offers three layout optimization strategies, namely Linear, Z-order, and Hilbert. Each of these defines how records should be sorted during Bulk Insert. The default strategy is Linear, which performs <a href="https://en.wikipedia.org/wiki/Lexicographic_order">lexicographical sorting</a>. The other two, Z-order and Hilbert, are known as space-filling curves that sort and preserve good spatial locality.</p><p>The Linear strategy is highly effective for datasets where record "proximity" relies on just one column. For instance, consider a table containing transaction records with a timestamp column. Analysts, for most of the time, run queries to fetch all records between transaction time A and B. Given that the records are considered "close" as long as the transaction timestamps are close, Linear is a perfect strategy due to sorting by the timestamp significantly preserve the locality.</p><p>The Linear strategy may not perform well with datasets that require two or more columns to determine record "proximity". Take, for example, a house inventory dataset with columns for latitude and longitude. Lexicographical sorting of latitude followed by longitude would group geographically distant house records together simply based on the proximity of latitude. In such cases, sorting algorithms capable of handling N-dimensional records are needed.</p><p>Space-filling curves are specifically designed to map N-dimensional points to one dimension. The term "space-filling" originates from the process where a curve traverses through the space, hitting all the possible points to fill it. Once the curve is straightened, all the multi-dimensional points are mapped to a one-dimensional space and assigned a single-value coordinate. Among various curve-drawing methods, Z-order and Hilbert, as shown below, are two approaches that can effectively preserve spatial locality through this mapping - the majority of nearby points on the curve are also close to each other in the original space.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!phqQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527484be-f427-4062-bf2a-70ae898ca302_2573x1406.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!phqQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527484be-f427-4062-bf2a-70ae898ca302_2573x1406.png 424w, https://substackcdn.com/image/fetch/$s_!phqQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527484be-f427-4062-bf2a-70ae898ca302_2573x1406.png 848w, https://substackcdn.com/image/fetch/$s_!phqQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527484be-f427-4062-bf2a-70ae898ca302_2573x1406.png 1272w, https://substackcdn.com/image/fetch/$s_!phqQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527484be-f427-4062-bf2a-70ae898ca302_2573x1406.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!phqQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527484be-f427-4062-bf2a-70ae898ca302_2573x1406.png" width="1456" height="796" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/527484be-f427-4062-bf2a-70ae898ca302_2573x1406.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:796,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182019,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!phqQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527484be-f427-4062-bf2a-70ae898ca302_2573x1406.png 424w, https://substackcdn.com/image/fetch/$s_!phqQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527484be-f427-4062-bf2a-70ae898ca302_2573x1406.png 848w, https://substackcdn.com/image/fetch/$s_!phqQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527484be-f427-4062-bf2a-70ae898ca302_2573x1406.png 1272w, https://substackcdn.com/image/fetch/$s_!phqQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527484be-f427-4062-bf2a-70ae898ca302_2573x1406.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Z-order and Hilbert curves in 2D plane; credits to <a href="https://eisenwave.github.io/voxel-compression-docs/rle/space_filling_curves.html">eisenwave.github.io</a></figcaption></figure></div><p>When we treat records as multi-dimensional points, drawing a Z-order or Hilbert curve essentially defines the way to sort them. Given that spatial locality is well preserved, actual "nearby" records are more likely to be stored in the same files. This fulfills the proximity condition explained in the <a href="https://blog.datumagic.com/i/138369118/overview">overview</a> and enhances read efficiency.</p><h2>Recap</h2><p>In this post, we completed the topic of table services by elaborating on clustering. Additionally, we discussed the space-filling curves and how they are used in the clustering process to optimizes the storage for reads. Please feel free to share your feedback and suggest content in the comments section.</p><p><em>Apache Hudi has a thriving community - come and engage with us via <a href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g">Slack</a>, <a href="https://github.com/apache/hudi">GitHub</a>, <a href="https://www.linkedin.com/company/apache-hudi/">LinkedIn</a>, <a href="https://twitter.com/apachehudi">X (Twitter)</a>, and <a href="https://www.youtube.com/@apachehudi">YouTube</a>!</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g&quot;,&quot;text&quot;:&quot;Engage on Slack&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g"><span>Engage on Slack</span></a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>This behavior is controlled by <code>hoodie.clustering.updates.strategy</code>. Users may supply a subclass of <code>org.apache.hudi.table.action.cluster.strategy.UpdateStrategy</code>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>The Z-order curve has big jumps, indicating that some adjacent 1D points could actually be far away from each other in the original space. The Hilbert curve may perform better in preserving locality due to the absence of such cases.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Apache Hudi: From Zero To One (5/10)]]></title><description><![CDATA[Introduce table services: compaction, cleaning, and indexing]]></description><link>https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-510</link><guid isPermaLink="false">https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-510</guid><dc:creator><![CDATA[Shiyan Xu]]></dc:creator><pubDate>Thu, 19 Oct 2023 03:41:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!zoiw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa91a7644-354f-41a8-8c14-3cb32e52ce51_1984x1494.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><a href="https://hudi.apache.org/">Apache Hudi</a>: From Zero To One</em></p><ul><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-110">Post 1: A first glance at Hudi's storage format</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-210">Post 2: Dive into read operation flow and query types</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-310">Post 3: Understand write flows and operations</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-410">Post 4: All about writer indexes</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-510">Post 5: Introduce table services: compaction, cleaning, and indexing</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-610">Post 6: Demystify clustering and space-filling curves</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-710">Post 7: Concurrently run writers and table services</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-810">Post 8: Read and process incrementally</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-910">Post 9: Hudi Streamer - a "Swiss Army knife" for ingestion</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-1010">Post 10: Becoming "One" - the upcoming 1.0 highlights</a></em></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/company/apache-hudi&quot;,&quot;text&quot;:&quot;Follow on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.linkedin.com/company/apache-hudi"><span>Follow on LinkedIn</span></a></p><p>The previous four posts in this series have delved into the details about read and write, offering ample context to the new topic addressed in this post, i.e., table services. The following content will be divided into two parts: the first segment will introduce the high-level concepts of table services, while the second part will cover three specific table services - compaction, cleaning, and indexing.</p><h2>Overview</h2><p>Table services can be defined as a type of maintenance job that operates on a table without adding new data. When ingesting new records, we often prioritize low latency, which may lead to making trade-offs and leave storage sub-optimized. Running table service jobs results in an improved storage layout, paving the way for more efficient read and write processes in the future.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.datumagic.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Datumagic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>A table service job comprises two steps: scheduling and execution. The scheduling step aims to generate a plan of execution, while the execution step carries out the plan and makes actual changes to the table. We can categorize the methods of running table services in Hudi into three modes: Inline, Semi-async, and Full-async, as depicted below, to provide flexibility for various real-world scenarios.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zoiw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa91a7644-354f-41a8-8c14-3cb32e52ce51_1984x1494.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zoiw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa91a7644-354f-41a8-8c14-3cb32e52ce51_1984x1494.png 424w, https://substackcdn.com/image/fetch/$s_!zoiw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa91a7644-354f-41a8-8c14-3cb32e52ce51_1984x1494.png 848w, https://substackcdn.com/image/fetch/$s_!zoiw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa91a7644-354f-41a8-8c14-3cb32e52ce51_1984x1494.png 1272w, https://substackcdn.com/image/fetch/$s_!zoiw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa91a7644-354f-41a8-8c14-3cb32e52ce51_1984x1494.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zoiw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa91a7644-354f-41a8-8c14-3cb32e52ce51_1984x1494.png" width="1456" height="1096" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a91a7644-354f-41a8-8c14-3cb32e52ce51_1984x1494.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1096,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:61620,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zoiw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa91a7644-354f-41a8-8c14-3cb32e52ce51_1984x1494.png 424w, https://substackcdn.com/image/fetch/$s_!zoiw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa91a7644-354f-41a8-8c14-3cb32e52ce51_1984x1494.png 848w, https://substackcdn.com/image/fetch/$s_!zoiw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa91a7644-354f-41a8-8c14-3cb32e52ce51_1984x1494.png 1272w, https://substackcdn.com/image/fetch/$s_!zoiw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa91a7644-354f-41a8-8c14-3cb32e52ce51_1984x1494.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Table services running modes</figcaption></figure></div><p>In the inline mode, both "schedule" and "execute" occur synchronously after the writer commits, making them "inline". This requires the simplest operational effort, as the two steps are automatically executed in sequence within the existing writer process. However, as an evident trade-off, it may introduce significant latency to the writing process.</p><p>The semi-async mode maintains inline scheduling and separates execution from it, i.e., execute asynchronously to the writer process. In this mode, users have the flexibility to deploy the service runner as a separate job or even to a different cluster, which might be necessary due to high computational requirement of the service execution.</p><p>The full-async mode is the most flexible mode that decouples table service running from writer processes. This is particularly helpful in managing a large number of tables in a lakehouse project, where a dedicated scheduler can be employed to optimize both scheduling and execution.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><p>As of release 0.14.0, Hudi offers four table services: compaction, clustering, cleaning, and indexing. In the following sections, we will explore compaction, cleaning, and indexing, reserving clustering in a subsequent post.</p><h2>Table Services</h2><h3>Compaction</h3><p>Recall from <a href="https://blog.datumagic.com/i/135356155/storage-format">post 1 on storage layout</a> that a File Slice can contain multiple Log Files and a Base File in MoR tables. As new data comes in, we evolve the File Slice by merging all Log Files against the Base File, creating a new version of the File Slice represented in a new Base File. This process is called compaction and is specific to MoR tables. However, this doesn&#8217;t apply to CoW tables, as new Base Files are generated automatically upon writes and have no Log Files to undergo compaction.</p><p>There are quite a few configurations to manage when scheduling and executing compaction. The <a href="https://hudi.apache.org/docs/compaction/">official documentation</a> provides detailed examples that showcase the usage. In this post, our focus is on the generalized internal workflow illustrated in the diagram below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pJrn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83817569-5f39-4ec5-a05d-e25045b176e0_2078x2280.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pJrn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83817569-5f39-4ec5-a05d-e25045b176e0_2078x2280.png 424w, https://substackcdn.com/image/fetch/$s_!pJrn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83817569-5f39-4ec5-a05d-e25045b176e0_2078x2280.png 848w, https://substackcdn.com/image/fetch/$s_!pJrn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83817569-5f39-4ec5-a05d-e25045b176e0_2078x2280.png 1272w, https://substackcdn.com/image/fetch/$s_!pJrn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83817569-5f39-4ec5-a05d-e25045b176e0_2078x2280.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pJrn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83817569-5f39-4ec5-a05d-e25045b176e0_2078x2280.png" width="1456" height="1598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83817569-5f39-4ec5-a05d-e25045b176e0_2078x2280.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1598,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:273157,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pJrn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83817569-5f39-4ec5-a05d-e25045b176e0_2078x2280.png 424w, https://substackcdn.com/image/fetch/$s_!pJrn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83817569-5f39-4ec5-a05d-e25045b176e0_2078x2280.png 848w, https://substackcdn.com/image/fetch/$s_!pJrn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83817569-5f39-4ec5-a05d-e25045b176e0_2078x2280.png 1272w, https://substackcdn.com/image/fetch/$s_!pJrn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83817569-5f39-4ec5-a05d-e25045b176e0_2078x2280.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hudi compaction workflow</figcaption></figure></div><p>The scheduling step determines whether compaction is necessary based on the configurable <code>CompactionTriggerStrategy</code>. If deemed true, it generates a compaction plan and saves it to the Timeline as a <code>.compaction.requested</code> action. Users can set the triggering threshold based on factors such as number of commits or elapsed time. If the criteria are met, a compaction plan generator will scan the table based on the <code>CompactionStrategy</code>, which essentially controls which File Slices should be compacted, and produces <code>CompactionOperation</code> for each File Slice to formulate a plan.</p><p>The execution step loads all the serialized <code>CompactionOperation</code> from the plan and runs them in parallel. Depending on the presence of the Base File in the target File Slice, either <code>MergeHandle</code> or <code>CreateHandle</code> will be used to write the merged records in a new File Slice. Similar to a write process, a group of <code>WriteStatus</code> will be returned, reporting statistics collected during the execution, and a <code>.commit</code> action will be saved on the Timeline, marking the success of the compaction.</p><p>Compaction jobs can be quite resource-intensive due to the high write-amplification when re-writing Base Files. An experimental table service named Log Compaction<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> was initially introduced in release 0.13.0 to address the write-amplification issue by only compacting Log Files into larger ones.</p><h3>Cleaning</h3><p>For incoming data, Hudi tables continually add File Slices to represent newer versions, taking more disk space. Cleaning is the table service designed to reclaim storage space by deleting old and unwanted versions. For detailed usage information, please refer to the <a href="https://hudi.apache.org/docs/hoodie_cleaner/">documentation page</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AGVj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0aac15bf-63c1-4124-9848-c13062748d76_2089x1995.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AGVj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0aac15bf-63c1-4124-9848-c13062748d76_2089x1995.png 424w, https://substackcdn.com/image/fetch/$s_!AGVj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0aac15bf-63c1-4124-9848-c13062748d76_2089x1995.png 848w, https://substackcdn.com/image/fetch/$s_!AGVj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0aac15bf-63c1-4124-9848-c13062748d76_2089x1995.png 1272w, https://substackcdn.com/image/fetch/$s_!AGVj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0aac15bf-63c1-4124-9848-c13062748d76_2089x1995.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AGVj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0aac15bf-63c1-4124-9848-c13062748d76_2089x1995.png" width="1456" height="1390" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0aac15bf-63c1-4124-9848-c13062748d76_2089x1995.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1390,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:248834,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AGVj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0aac15bf-63c1-4124-9848-c13062748d76_2089x1995.png 424w, https://substackcdn.com/image/fetch/$s_!AGVj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0aac15bf-63c1-4124-9848-c13062748d76_2089x1995.png 848w, https://substackcdn.com/image/fetch/$s_!AGVj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0aac15bf-63c1-4124-9848-c13062748d76_2089x1995.png 1272w, https://substackcdn.com/image/fetch/$s_!AGVj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0aac15bf-63c1-4124-9848-c13062748d76_2089x1995.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hudi cleaning workflow</figcaption></figure></div><p>Similar to compaction, Hudi utilizes <code>CleaningTriggerStrategy</code> to determine if cleaning is required at the time of scheduling. Currently, the only supported triggering criterion is the number of commits. After N commits (as configured), a cleaning planner will scan relevant partitions and determine if any File Slice meets the criteria for cleaning, as defined by <code>HoodieCleaningPolicy</code>. Physical paths of either Base Files or Log Files from the eligible File Slices will be used to generate a group of <code>CleanFileInfo</code>. A cleaning plan is then formulated based on that and saved into a <code>.clean.requested</code> action. As of now, three cleaning policies are supported: clean-by-commits, clean-by-file-versions, and clean-by-hours.</p><p>The cleaning execution is relatively straightforward: after loading the plan and deserializing the <code>CleanFileInfo</code>, the job performs file-system deletes for the target files in parallel. Statistics are initially collected at the partition level, and then aggregated and saved into a <code>.clean</code> action, indicating the completion.</p><h3>Indexing</h3><p>The indexing table service was first added in release 0.11.0 as an experimental feature. Currently, it is designed for building indexes for the <a href="https://hudi.apache.org/docs/metadata">metadata table</a>. We will refrain from delving into the indexing process as it requires prior knowledge of the metadata table. Instead, I will provide a brief overview of the design. For further learnings, I recommend consulting the <a href="https://hudi.apache.org/docs/metadata_indexing">official documentation</a>, <a href="https://www.onehouse.ai/blog/asynchronous-indexing-using-hudi">this blog</a>, and <a href="https://github.com/apache/hudi/blob/master/rfc/rfc-45/rfc-45.md">RFC-45</a>.</p><p>In <a href="https://blog.datumagic.com/i/137354585/indexing-apis">post 4</a>, we mentioned an indexing API named <code>updateLocation()</code> that is required by certain indexes to keep the indexing data in sync with the written data. From a table service perspective, we can view it as indexing running in inline mode, i.e., scheduled inline and executed inline. The current indexing service is considered as being in full-async mode. The metadata table can be seen as another index type that encompasses multiple indexes, also known as a multi-modal index. As the data table size grows, updating the metadata table inline with each write can be time-consuming. Therefore, we need the async table service to maintain high write-throughput while keeping the indexes up-to-date.</p><h2>Recap</h2><p>In this post, we introduced the concept of table service, delved into the detailed processes of compaction and cleaning, and briefly touched on indexing. Please feel free to share your feedback and suggest content in the comments section.</p><p><em>Apache Hudi has a thriving community - come and engage with us via <a href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g">Slack</a>, <a href="https://github.com/apache/hudi">GitHub</a>, <a href="https://www.linkedin.com/company/apache-hudi/">LinkedIn</a>, <a href="https://twitter.com/apachehudi">X (Twitter)</a>, and <a href="https://www.youtube.com/@apachehudi">YouTube</a>!</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g&quot;,&quot;text&quot;:&quot;Engage on Slack&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g"><span>Engage on Slack</span></a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The on-going <a href="https://github.com/apache/hudi/pull/4309">RFC-43 table service manager</a> is designed to support this platform feature out-of-the-box.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p><a href="https://github.com/apache/hudi/blob/master/rfc/rfc-48/rfc-48.md">RFC-48</a> has the design details for log compaction.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Apache Hudi: From Zero To One (4/10)]]></title><description><![CDATA[All about writer indexes]]></description><link>https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-410</link><guid isPermaLink="false">https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-410</guid><dc:creator><![CDATA[Shiyan Xu]]></dc:creator><pubDate>Thu, 28 Sep 2023 00:44:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe46cfee6-0d46-4191-9306-5ee35ab11a3e_1558x1374.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><a href="https://hudi.apache.org/">Apache Hudi</a>: From Zero To One</em></p><ul><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-110">Post 1: A first glance at Hudi's storage format</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-210">Post 2: Dive into read operation flow and query types</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-310">Post 3: Understand write flows and operations</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-410">Post 4: All about writer indexes</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-510">Post 5: Introduce table services: compaction, cleaning, and indexing</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-610">Post 6: Demystify clustering and space-filling curves</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-710">Post 7: Concurrently run writers and table services</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-810">Post 8: Read and process incrementally</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-910">Post 9: Hudi Streamer - a "Swiss Army knife" for ingestion</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-1010">Post 10: Becoming "One" - the upcoming 1.0 highlights</a></em></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/company/apache-hudi&quot;,&quot;text&quot;:&quot;Follow on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.linkedin.com/company/apache-hudi"><span>Follow on LinkedIn</span></a></p><p>In <a href="https://blog.datumagic.com/p/apache-hudi-from-zero-to-one-310">the previous post</a>, we walked through Hudi&#8217;s write operation flow. Among all the steps involved, indexing is a crucial one that verifies the existence of records in the table and helps achieve efficient update and delete operations. This post will introduce the indexing APIs and explore various types of indexes. Please note that the indexes covered in this post are intended for writers, which differs from reader-side indexing.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><h2>Indexing APIs</h2><p>Writer indexing abstractions are defined in <code>HoodieIndex</code>. I'll describe some key APIs below to provide a high-level understanding of what indexing entails.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.datumagic.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Datumagic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><ul><li><p><code>tagLocation()</code>: when a set of input records is passed to the index component during writing, this API is invoked to tag each record, determining whether it is present in the table, and then associating it with its location information. The resulting set of records is referred to as "tagged records". In <a href="https://datumagic.substack.com/i/136915529/transform-input">the HoodieRecord model</a>, the "currentLocation" field will be populated by this tagging process.</p></li><li><p><code>updateLocation()</code>: after writing to storage, certain indexes require location information to be updated to synchronize with the data table. This process is only executed during the post-IO phase for those applicable index types.</p></li><li><p><code>isGlobal()</code>: Hudi categorizes indexes into global and non-global types. Global indexes identify unique records across all table partitions, hence being "global" in relation to the table. Non-global indexes, on the other hand, validate uniqueness at the partition level. Typically, non-global indexes exhibit better performance due to their smaller scan space. However, they are not suitable for tables with records that can shift between partitions.</p></li><li><p><code>canIndexLogFiles()</code>: due to the implementation specifics, certain indexes are able to index on Log Files for Merge-on-Read tables. This characteristic affects how writers create file-writing handles: when this is true for the configured index, inserts will be routed to Log Files through <code>AppendHandle</code>.</p></li><li><p><code>isImplicitWithStorage()</code>: this is a characteristic that indicates whether the index is implicitly "persisted" along with data files on storage. Some indexes store their indexing data separately.</p></li></ul><h2>Index Types</h2><p>Hudi offers several out-of-the-box index types to suit different traffic patterns and table sizes. Selecting the most appropriate index for each table is a crucial tuning step. <a href="https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/">This article</a> well-explains the significance of making the right choices. In the following sections, I will illustrate the internal workings of writer indexes to enhance understanding.</p><h3>Simple Index</h3><p>The Simple Index is a non-global index and currently serves as the default type. The primary concept behind it involves scanning all Base Files within the relevant partitions to determine whether incoming records match any of the extracted keys.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XoHf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe46cfee6-0d46-4191-9306-5ee35ab11a3e_1558x1374.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XoHf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe46cfee6-0d46-4191-9306-5ee35ab11a3e_1558x1374.png 424w, https://substackcdn.com/image/fetch/$s_!XoHf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe46cfee6-0d46-4191-9306-5ee35ab11a3e_1558x1374.png 848w, https://substackcdn.com/image/fetch/$s_!XoHf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe46cfee6-0d46-4191-9306-5ee35ab11a3e_1558x1374.png 1272w, https://substackcdn.com/image/fetch/$s_!XoHf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe46cfee6-0d46-4191-9306-5ee35ab11a3e_1558x1374.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XoHf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe46cfee6-0d46-4191-9306-5ee35ab11a3e_1558x1374.png" width="1456" height="1284" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e46cfee6-0d46-4191-9306-5ee35ab11a3e_1558x1374.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1284,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:111403,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XoHf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe46cfee6-0d46-4191-9306-5ee35ab11a3e_1558x1374.png 424w, https://substackcdn.com/image/fetch/$s_!XoHf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe46cfee6-0d46-4191-9306-5ee35ab11a3e_1558x1374.png 848w, https://substackcdn.com/image/fetch/$s_!XoHf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe46cfee6-0d46-4191-9306-5ee35ab11a3e_1558x1374.png 1272w, https://substackcdn.com/image/fetch/$s_!XoHf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe46cfee6-0d46-4191-9306-5ee35ab11a3e_1558x1374.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Tagging flow for the Simple Index</figcaption></figure></div><p>From the left-join operation, if an input record matches an extracted key, the join result will include the location information, which will then be used to populate the "currentLocation" field of the <code>HoodieRecord</code>. This produces the so-called "tagged records". Those unmatched records will be kept as-is and union-ed with the tagged records for further processing.</p><p>The Simple Index has a global version known as the Global Simple Index. Unlike its non-global counterpart, it matches input against Base Files from all partitions rather than just the relevant ones. When a record's partition value is updated, the respective File Group is loaded, which also includes Log Files for MoR tables, for an additional tagging step: it merges the incoming record with its existing old version and tags the merged result to the location in the new partition.</p><p>Since the Simple Indexes tend to load all Base Files at either the partition level or the table level, they are well-suited for traffic patterns having random or evenly-distributed data access.</p><h3>Bloom Index</h3><p>The Bloom Index follows a similar high-level flow to the Simple Index. However, the distinguishing concept behind the Bloom Index lies in its approach to minimizing the number of keys and files for look-ups while maintaining a low read cost.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mcPs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b70a59-001d-454b-b7b4-5d1bfc288187_1341x1455.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mcPs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b70a59-001d-454b-b7b4-5d1bfc288187_1341x1455.png 424w, https://substackcdn.com/image/fetch/$s_!mcPs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b70a59-001d-454b-b7b4-5d1bfc288187_1341x1455.png 848w, https://substackcdn.com/image/fetch/$s_!mcPs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b70a59-001d-454b-b7b4-5d1bfc288187_1341x1455.png 1272w, https://substackcdn.com/image/fetch/$s_!mcPs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b70a59-001d-454b-b7b4-5d1bfc288187_1341x1455.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mcPs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b70a59-001d-454b-b7b4-5d1bfc288187_1341x1455.png" width="1341" height="1455" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c6b70a59-001d-454b-b7b4-5d1bfc288187_1341x1455.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1455,&quot;width&quot;:1341,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:103633,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mcPs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b70a59-001d-454b-b7b4-5d1bfc288187_1341x1455.png 424w, https://substackcdn.com/image/fetch/$s_!mcPs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b70a59-001d-454b-b7b4-5d1bfc288187_1341x1455.png 848w, https://substackcdn.com/image/fetch/$s_!mcPs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b70a59-001d-454b-b7b4-5d1bfc288187_1341x1455.png 1272w, https://substackcdn.com/image/fetch/$s_!mcPs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b70a59-001d-454b-b7b4-5d1bfc288187_1341x1455.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Bloom Index: filter keys and files for look-ups</figcaption></figure></div><p>The Bloom Index employs a 2-stage filtering to reduce the number of keys and files for look-ups. </p><ul><li><p>The first stage involves comparing input keys against an interval tree constructed using minimum and maximum record key values stored in Base Files&#8217; footers. Keys falling out of these ranges represent the new inserts, while the remaining keys are considered candidates for the next stage. </p></li><li><p>The second stage checks the candidate keys against deserialized Bloom filters, which help determine the definitively absent keys and the potentially present keys. Actual file look-ups are then carried out using the filtered keys and the associated Base Files, which subsequently return the key and location tuples for tagging.</p></li></ul><p>Please note that the filtering process before the look-ups only involves reading the file footers, thereby incurring low read costs.</p><p>Just like the Simple Index, the Bloom Index also has a global version known as the Global Bloom Index. It operates similarly to the non-global version, albeit at the table level, and employs the same logic as the Global Simple Index for handling partition-update scenarios.</p><h3>Bucket Index</h3><p>The Bucket Index is designed based on hashing, allowing us to consistently map a key to a File Group using a fixed hashing function, eliminating the need for any disk reads and resulting in significant time savings.</p><p>The Bucket Index comes in two variations - the Simple Bucket Index and the Consistent Bucket Index. The Simple Bucket Index assigns a fixed number of buckets, each mapping to one File Group, which in turn limits the total number of File Groups in the table. This leads to drawbacks on handling data skewness and scaling out. </p><p>On the other hand, the Consistent Bucket Index is designed to overcome the drawbacks by dynamically re-hashing an existing bucket into sub-buckets when the corresponding File Group exceeds a certain size threshold.</p><h3>HBase Index</h3><p>The HBase Index is implemented using an externally running HBase server. It stores the mappings between a record key and the relevant File Group information, and it is a global index. This offers efficient look-ups for tagging in most cases and can readily scale out as the table size increases. However, the drawback is the operational overhead involved in managing an additional server.</p><h3>Record Index</h3><p>The Record Index is a newly added feature in release 0.14.0 and operates logically similar to the HBase Index: it is also a global index that saves the mappings of record keys and File Groups. The key improvement lies in keeping the indexing data local to the Hudi tables, thus avoiding the cost of operating an extra server. Please refer to <a href="https://hudi.apache.org/blog/2023/11/01/record-level-index">this blog</a> for a detailed discussion.</p><h2>Recap</h2><p>In this post, we discussed Hudi indexing APIs for writers, delved into the detailed flows of the Simple Index and the Bloom Index, and briefly introduced the Bucket Index, the HBase Index, and the Record Index. Please feel free to share your feedback and suggest content in the comments section.</p><p><em>Apache Hudi has a thriving community - come and engage with us via <a href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g">Slack</a>, <a href="https://github.com/apache/hudi">GitHub</a>, <a href="https://www.linkedin.com/company/apache-hudi/">LinkedIn</a>, <a href="https://twitter.com/apachehudi">X (Twitter)</a>, and <a href="https://www.youtube.com/@apachehudi">YouTube</a>!</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g&quot;,&quot;text&quot;:&quot;Engage on Slack&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g"><span>Engage on Slack</span></a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Design for reader-side indexes like secondary index (<a href="https://github.com/apache/hudi/pull/5370">RFC-52</a>) and functional index (<a href="https://github.com/apache/hudi/pull/7235">RFC-63</a>) are under discussion.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Apache Hudi: From Zero To One (3/10)]]></title><description><![CDATA[Understand write flows and operations]]></description><link>https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-310</link><guid isPermaLink="false">https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-310</guid><dc:creator><![CDATA[Shiyan Xu]]></dc:creator><pubDate>Fri, 15 Sep 2023 12:30:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Kyt2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88ede8d3-aa4c-485b-ade1-59d5503a13d1_1581x1264.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><a href="https://hudi.apache.org/">Apache Hudi</a>: From Zero To One</em></p><ul><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-110">Post 1: A first glance at Hudi's storage format</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-210">Post 2: Dive into read operation flow and query types</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-310">Post 3: Understand write flows and operations</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-410">Post 4: All about writer indexes</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-510">Post 5: Introduce table services: compaction, cleaning, and indexing</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-610">Post 6: Demystify clustering and space-filling curves</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-710">Post 7: Concurrently run writers and table services</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-810">Post 8: Read and process incrementally</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-910">Post 9: Hudi Streamer - a "Swiss Army knife" for ingestion</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-1010">Post 10: Becoming "One" - the upcoming 1.0 highlights</a></em></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/company/apache-hudi&quot;,&quot;text&quot;:&quot;Follow on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.linkedin.com/company/apache-hudi"><span>Follow on LinkedIn</span></a></p><p>In <a href="https://blog.datumagic.com/p/apache-hudi-from-zero-to-one-210">the previous post</a>, we discussed Hudi query types and their integration with Spark. In this post, we will delve into the other aspect - write flows, with Spark as the example engine. There are numerous configurations and settings you can adjust when it comes to writing data. Therefore, this post does not aim to serve as a complete usage guide. Instead, my primary goal is to present the internal data flows and break down the steps involved. This will provide readers with a deeper understanding of running and fine-tuning Hudi applications. For various practical usage examples, please consult <a href="https://hudi.apache.org/docs/overview">Hudi&#8217;s official documentation page</a>.</p><h2>Overall Write Flow</h2><p>The picture below illustrates the typical high-level steps involved in a Hudi write operation within the context of an execution engine. I will provide a brief introduction to each step in this section.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.datumagic.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Datumagic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kyt2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88ede8d3-aa4c-485b-ade1-59d5503a13d1_1581x1264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kyt2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88ede8d3-aa4c-485b-ade1-59d5503a13d1_1581x1264.png 424w, https://substackcdn.com/image/fetch/$s_!Kyt2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88ede8d3-aa4c-485b-ade1-59d5503a13d1_1581x1264.png 848w, https://substackcdn.com/image/fetch/$s_!Kyt2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88ede8d3-aa4c-485b-ade1-59d5503a13d1_1581x1264.png 1272w, https://substackcdn.com/image/fetch/$s_!Kyt2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88ede8d3-aa4c-485b-ade1-59d5503a13d1_1581x1264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kyt2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88ede8d3-aa4c-485b-ade1-59d5503a13d1_1581x1264.png" width="1456" height="1164" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88ede8d3-aa4c-485b-ade1-59d5503a13d1_1581x1264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1164,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:38775,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kyt2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88ede8d3-aa4c-485b-ade1-59d5503a13d1_1581x1264.png 424w, https://substackcdn.com/image/fetch/$s_!Kyt2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88ede8d3-aa4c-485b-ade1-59d5503a13d1_1581x1264.png 848w, https://substackcdn.com/image/fetch/$s_!Kyt2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88ede8d3-aa4c-485b-ade1-59d5503a13d1_1581x1264.png 1272w, https://substackcdn.com/image/fetch/$s_!Kyt2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88ede8d3-aa4c-485b-ade1-59d5503a13d1_1581x1264.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hudi overall write flow</figcaption></figure></div><h4>Create write client</h4><p>A Hudi write client serves as the entry point for write operations, and Hudi write support is achieved by creating an engine-compatible write client instance. For instance, Spark utilizes the <code>SparkRDDWriteClient</code>, Flink employs the <code>HoodieFlinkWriteClient</code>, and Kafka Connect generates the <code>HoodieJavaWriteClient</code>. Typically, this step involves reconciling user-provided configurations with the existing Hudi table properties and subsequently passing the final configuration set to the client.</p><h4>Transform input</h4><p>Before a write client processes the input data, several transformations occur, including the construction of <code>HoodieRecord</code>s and schema reconciliation. Let's delve deeper into the <code>HoodieRecord</code>, as it is a fundamental model in the write paths.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BvIs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea48865d-a988-4784-a0aa-5ff1fc1df7a8_1410x728.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BvIs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea48865d-a988-4784-a0aa-5ff1fc1df7a8_1410x728.png 424w, https://substackcdn.com/image/fetch/$s_!BvIs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea48865d-a988-4784-a0aa-5ff1fc1df7a8_1410x728.png 848w, https://substackcdn.com/image/fetch/$s_!BvIs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea48865d-a988-4784-a0aa-5ff1fc1df7a8_1410x728.png 1272w, https://substackcdn.com/image/fetch/$s_!BvIs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea48865d-a988-4784-a0aa-5ff1fc1df7a8_1410x728.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BvIs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea48865d-a988-4784-a0aa-5ff1fc1df7a8_1410x728.png" width="1410" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea48865d-a988-4784-a0aa-5ff1fc1df7a8_1410x728.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1410,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:20484,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BvIs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea48865d-a988-4784-a0aa-5ff1fc1df7a8_1410x728.png 424w, https://substackcdn.com/image/fetch/$s_!BvIs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea48865d-a988-4784-a0aa-5ff1fc1df7a8_1410x728.png 848w, https://substackcdn.com/image/fetch/$s_!BvIs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea48865d-a988-4784-a0aa-5ff1fc1df7a8_1410x728.png 1272w, https://substackcdn.com/image/fetch/$s_!BvIs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea48865d-a988-4784-a0aa-5ff1fc1df7a8_1410x728.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">HoodieRecord and some important properties</figcaption></figure></div><p>Hudi identifies unique records using the <code>HoodieKey</code> model, which consists of "recordKey" and "partitionPath". These values are populated by implementing the <code>KeyGenerator</code> API. This API offers flexibility in extracting and transforming custom fields into the key based on the input schema. For usage examples, please refer to <a href="https://hudi.apache.org/docs/key_generation/">the documentation page</a>.</p><p>Both "currentLocation" and "newLocation" consist of a Hudi Timeline's action timestamp and a FileGroup's ID. Recalling <a href="https://blog.datumagic.com/i/135356155/data">the logical FileGroup and FileSlice concepts</a> from post 1, the timestamp points to a FileSlice within a specific FileGroup. The "location" properties are employed to locate physical files using logical information. If "currentLocation" is not null, it indicates where a record with the same key exists in the table, while "newLocation" specifies where the incoming record should be written.</p><p>The "data" field is a generic type that contains the actual bytes for the record, also known as the payload. Typically, this property implements <code>HoodieRecordPayload</code>, which guides engines on how to merge an old record with a new one. Starting from <a href="https://hudi.apache.org/releases/release-0.13.0#optimizing-record-payload-handling">release 0.13.0</a>, a new experimental interface, <code>HoodieRecordMerger</code>, has been introduced to replace <code>HoodieRecordPayload</code> and serve as the unified merging API.</p><h4>Start commit</h4><p>At this step, a write client always checks if there are any failed actions remaining on the table's Timeline and performs a rollback accordingly before initiating the write operation by creating a "requested" commit action on the Timeline.</p><h4>Prepare records</h4><p>The provided <code>HoodieRecord</code>s may optionally undergo deduplication and indexing based on user configurations and the operation type. If deduplication is necessary, records with the same key will be merged into one. If indexing is required, the "currentLocation" will be populated if the record exists.</p><p>The topic of indexing logic with various index types is crucial and warrants a dedicated post. For the purpose of understanding write flows, it is important to remember that an index is responsible for locating physical files for the given records.</p><h4>Partition records</h4><p>This is an essential pre-write step that determines which record goes into which FileGroup and, ultimately, which physical file. Incoming records will be assigned to update buckets and insert buckets, implying different strategies for subsequent file writing. Each bucket represents one RDD partition for distributed processing, as is the case with Spark.</p><h4>Write to storage</h4><p>This is when the actual I/O operations occur. Physical data files are either created or appended to using file writing handles. Before that, marker files may also be created in the <code>.hoodie/.temp/</code> directory to indicate the type of write operation that will be performed for the corresponding data files. This is valuable for efficient rollback and conflict resolution scenarios.</p><h4>Update index</h4><p>After data is written to disk, there may be a need to immediately update the index data to ensure read/write correctness. This applies specifically to index types that are not synchronously updated during writing, such as the HBase index hosted in an HBase server.</p><h4>Commit changes</h4><p>In this final step, the write client will undertake multiple tasks to correctly conclude the transactional write. For example, it may run pre-commit validation if configured, check for conflicts with concurrent writers, save commit metadata to the Timeline, reconcile WriteStatus with marker files, and so on.</p><h2>Write Operations</h2><p>Upserting data is a common scenario in Lakehouse pipelines. In this section, we will delve into the Upsert flow for CoW table in detail, followed by a brief overview of all other supported write operations.</p><h3>Upsert</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!De5z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5a8fa-08c8-40c6-bc7a-d333b82beba8_2522x3188.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!De5z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5a8fa-08c8-40c6-bc7a-d333b82beba8_2522x3188.png 424w, https://substackcdn.com/image/fetch/$s_!De5z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5a8fa-08c8-40c6-bc7a-d333b82beba8_2522x3188.png 848w, https://substackcdn.com/image/fetch/$s_!De5z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5a8fa-08c8-40c6-bc7a-d333b82beba8_2522x3188.png 1272w, https://substackcdn.com/image/fetch/$s_!De5z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5a8fa-08c8-40c6-bc7a-d333b82beba8_2522x3188.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!De5z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5a8fa-08c8-40c6-bc7a-d333b82beba8_2522x3188.png" width="1456" height="1840" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4bb5a8fa-08c8-40c6-bc7a-d333b82beba8_2522x3188.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1840,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:317900,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!De5z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5a8fa-08c8-40c6-bc7a-d333b82beba8_2522x3188.png 424w, https://substackcdn.com/image/fetch/$s_!De5z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5a8fa-08c8-40c6-bc7a-d333b82beba8_2522x3188.png 848w, https://substackcdn.com/image/fetch/$s_!De5z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5a8fa-08c8-40c6-bc7a-d333b82beba8_2522x3188.png 1272w, https://substackcdn.com/image/fetch/$s_!De5z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb5a8fa-08c8-40c6-bc7a-d333b82beba8_2522x3188.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hudi upsert flow for a CoW table</figcaption></figure></div><ol><li><p>Write client starts the commit and creates the "requested" action on Timeline.</p></li><li><p>Input records undergo the preparation step: duplicates are merged, and target file locations are populated by the index. At this point in the process, we have the exact records to be written and know which of those exist in the table, along with their respective locations (FileGroups).</p></li><li><p>Prepared records are categorized into "update" and "insert" buckets. Initially, a WorkloadProfile is constructed to gather information on the number of updates and inserts in the relevant physical partitions. This data is then serialized into an "inflight" action on the Timeline. Subsequently, based on the WorkloadProfile, buckets are generated to hold the records. For updates, each updating FileGroup is assigned as an update bucket. In the case of inserts, the small-file handling logic comes into play: any BaseFile smaller than a specified threshold (determined by <code>hoodie.parquet.small.file.limit</code>) becomes a candidate for accommodating the inserts, with its enclosing FileGroup being designated as an update bucket. If no such BaseFile exists, insert buckets will be allocated, and new FileGroups will be created for them later.</p></li><li><p>The bucketized records are then processed through file-writing handles for actual persistence to storage. In the case of records in the update buckets, "merge" handles are used, resulting in the creation of new FileSlices within the existing FileGroups (achieved by merging with data from the old FileSlices). For records in the insert buckets, "create" handles are utilized, leading to the creation of entirely new FileGroups. This process is done by <code>HoodieExecutor</code>s, which employ a producer-consumer pattern for reading and writing records.</p></li><li><p>Once all data has been written, the file-writing handles return collections of WriteStatus that contain metadata about the writes, including the number of errors, the number of inserts performed, the total written size in bytes, and more. This information is sent back to the Spark driver for aggregation. If no errors have occurred, the write client will generate commit metadata and persist it as a completed action on the Timeline.</p></li></ol><p>Upserting to a MoR table follows a very similar flow, with a different set of conditions to determine the types of file-writing handles used for both updates and inserts.</p><h3>Insert &amp; Bulk Insert</h3><p>The Insert flow is very similar to Upsert, with the key difference being the absence of an indexing step. This implies that the entire writing process is faster (will be even faster if deduplication is turned off), but it may result in duplicates in the table.</p><p>Bulk Insert follows the same semantics as Insert, meaning it can also result in duplicates due to the absence of indexing. However, the distinction lies in the absence of small-file handling for Bulk Insert. The records partitioning strategy is determined by setting <code>BulkInsertSortMode</code> or can be customized by implementing <code>BulkInsertPartitioner</code>. Bulk Insert also enables row-writing mode by default for Spark, bypassing Avro data model conversion at the "transform input" step and working directly with the engine-native data model <code>Row</code>. This mode gives even more efficient writes.</p><p>Overall, Bulk Insert is generally more performant than Insert but may require additional configuration tuning to address small-file issues.</p><h3>Delete</h3><p>The Delete flow can be viewed as a special case of the Upsert flow. The primary difference is that, during the "transform input" step, input records are transformed into <code>HoodieKey</code>s and passed on to subsequent stages, as these are the minimum required data for identifying the records to be deleted. It's important to note that this process results in a hard delete, meaning that the target records will not exist in the new FileSlices of the corresponding FileGroups.</p><h3>Delete Partition</h3><p>Delete Partition follows a completely different flow compared to those introduced above. Instead of input records, it takes a list of physical partition paths, which is configured via <code>hoodie.datasource.write.partitions.to.delete</code>. Because there are no input records, processes such as indexing, partitioning, and writing to storage do not apply. Delete Partition saves all the FileGroup IDs of the target partition paths in a <code>.replacecommit</code> action on the Timeline, ensuring that subsequent writers and readers treat them as deleted.</p><h3>Insert Overwrite &amp; Insert Overwrite Table</h3><p>Insert Overwrite completely rewrite partitions with the provided records. This flow can be effectively seen as a combination of Delete Partition and Bulk Insert: it extracts affected partition paths from the input records, marks all existing FileGroups in those partitions as deleted, and creates new FileGroups to store the incoming records.</p><p>Insert Overwrite Table is a variation of Insert Overwrite. Instead of extracting affected partition paths from input records, it fetches all partition paths of the table for the purpose of overwriting.</p><h2>Recap</h2><p>In this post, we have explored the common high-level steps in Hudi write paths, delved into the CoW Upsert flow with a detailed explanation of record partitioning logic, and introduced all other write operations. Please feel free to share your feedback and suggest content in the comments section.</p><p><em>Apache Hudi has a thriving community - come and engage with us via <a href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g">Slack</a>, <a href="https://github.com/apache/hudi">GitHub</a>, <a href="https://www.linkedin.com/company/apache-hudi/">LinkedIn</a>, <a href="https://twitter.com/apachehudi">X (Twitter)</a>, and <a href="https://www.youtube.com/@apachehudi">YouTube</a>!</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g&quot;,&quot;text&quot;:&quot;Engage on Slack&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g"><span>Engage on Slack</span></a></p>]]></content:encoded></item><item><title><![CDATA[Apache Hudi: From Zero To One (2/10)]]></title><description><![CDATA[Dive into read operation flow and query types]]></description><link>https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-210</link><guid isPermaLink="false">https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-210</guid><dc:creator><![CDATA[Shiyan Xu]]></dc:creator><pubDate>Wed, 06 Sep 2023 08:49:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facb15e52-7fad-436c-a855-4dbcf5539c05_1691x1588.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><a href="https://hudi.apache.org/">Apache Hudi</a>: From Zero To One</em></p><ul><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-110">Post 1: A first glance at Hudi's storage format</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-210">Post 2: Dive into read operation flow and query types</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-310">Post 3: Understand write flows and operations</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-410">Post 4: All about writer indexes</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-510">Post 5: Introduce table services: compaction, cleaning, and indexing</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-610">Post 6: Demystify clustering and space-filling curves</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-710">Post 7: Concurrently run writers and table services</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-810">Post 8: Read and process incrementally</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-910">Post 9: Hudi Streamer - a "Swiss Army knife" for ingestion</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-1010">Post 10: Becoming "One" - the upcoming 1.0 highlights</a></em></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/company/apache-hudi&quot;,&quot;text&quot;:&quot;Follow on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.linkedin.com/company/apache-hudi"><span>Follow on LinkedIn</span></a></p><p>In <a href="https://blog.datumagic.com/p/apache-hudi-from-zero-to-one-110">the previous post</a>, we discussed the data layout within a Hudi table and introduced the two table types, CoW and MoR, along with their respective trade-offs. Building on that, we will now explore how read operations work in Hudi.</p><p>There are several engines, such as Spark, Presto, and Trino, integrated with Hudi that enable you to execute analytical queries. Although the integration APIs may differ, the fundamental process in distributed query engines remains consistent. This process entails interpreting the input SQL, creating a query plan for execution on worker nodes, and collecting the results to return to users.</p><p>In this post, I have selected Spark<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> as the example engine to illustrate the flow of read operations and provide code snippets to showcase the usage of various Hudi query types. I will begin by introducing Spark queries with a primer, then delve into the Hudi-Spark integration points, and finally, explain the different query types.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.datumagic.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Datumagic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Spark Query Primer</h2><p>Spark SQL is a distributed SQL engine that performs analytical tasks for large-scale data. A typical analytics query begins with user-provided SQL, aiming to retrieve results from a table on storage. Spark SQL takes this input and proceeds through multiple phases, as depicted in the diagram below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DHVH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390afb68-b600-4c06-8399-cbb12e66de0f_1040x513.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DHVH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390afb68-b600-4c06-8399-cbb12e66de0f_1040x513.png 424w, https://substackcdn.com/image/fetch/$s_!DHVH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390afb68-b600-4c06-8399-cbb12e66de0f_1040x513.png 848w, https://substackcdn.com/image/fetch/$s_!DHVH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390afb68-b600-4c06-8399-cbb12e66de0f_1040x513.png 1272w, https://substackcdn.com/image/fetch/$s_!DHVH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390afb68-b600-4c06-8399-cbb12e66de0f_1040x513.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DHVH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390afb68-b600-4c06-8399-cbb12e66de0f_1040x513.png" width="1040" height="513" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/390afb68-b600-4c06-8399-cbb12e66de0f_1040x513.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:513,&quot;width&quot;:1040,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57889,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DHVH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390afb68-b600-4c06-8399-cbb12e66de0f_1040x513.png 424w, https://substackcdn.com/image/fetch/$s_!DHVH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390afb68-b600-4c06-8399-cbb12e66de0f_1040x513.png 848w, https://substackcdn.com/image/fetch/$s_!DHVH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390afb68-b600-4c06-8399-cbb12e66de0f_1040x513.png 1272w, https://substackcdn.com/image/fetch/$s_!DHVH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390afb68-b600-4c06-8399-cbb12e66de0f_1040x513.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Spark SQL query planning flow</figcaption></figure></div><p>During the analysis phase, the input is parsed, resolved, and converted into a tree structure that works as an abstraction of the SQL statement. The table catalog is consulted for information such as table names and column types.</p><p>At the logical optimization step, the tree is evaluated and optimized at the logical layer. Some common optimizations include predicate pushdown, schema pruning, and null propagation. This step generates a Logical Plan that outlines the necessary computations for the query. Since it is a logical representation, the Logical Plan lacks the specifics needed for running on actual nodes.</p><p>Physical planning serves as the bridge between the logical layer and the physical layer. A Physical Plan specifies the precise manner in which computations should be executed. For instance, in a Logical Plan, there may be a join node indicating a join operation, whereas in the Physical Plan, the join operation could be specified as a sort-merge join or a broadcast-hash join, depending on size estimates from the relevant tables. The optimal Physical Plan is selected for code generation and actual execution.</p><p>The three phases are features provided by <a href="https://www.databricks.com/glossary/catalyst-optimizer">Catalyst Optimizer</a>. For further study on this topic, you may explore excellent talks like the ones linked <a href="https://www.youtube.com/watch?v=RmUn5vHlevc">here</a> and <a href="https://www.youtube.com/watch?v=ywPuZ_WrHT0">here</a>.</p><p>During execution, a Spark application operates on the foundational data structure known as RDD (Resilient Distributed Dataset). RDDs are collections of JVM objects that are immutable, partitioned across nodes, and fault-tolerant due to the tracking of data lineage information. As the application runs, the planned computations are performed: RDDs are transformed and acted upon to produce results. This process is also commonly referred to as "materializing" the RDDs.</p><h3>Data Source API</h3><p>While Catalyst Optimizer is formulating query plans, connecting to the data source becomes advantageous, enabling optimizations to be pushed down. Spark's Data Source API is designed to provide extensibility for integrating with a wide range of data sources. Some sources are supported out-of-the-box, such as JDBC, Hive tables, and Parquet files. Hudi tables, owing to the specific data layout, represent another type of custom data source.</p><h2>Spark-Hudi Read Flow</h2><p>The diagram below illustrates some key interfaces and method calls in the Spark-Hudi read flow.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W4G9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facb15e52-7fad-436c-a855-4dbcf5539c05_1691x1588.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W4G9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facb15e52-7fad-436c-a855-4dbcf5539c05_1691x1588.png 424w, https://substackcdn.com/image/fetch/$s_!W4G9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facb15e52-7fad-436c-a855-4dbcf5539c05_1691x1588.png 848w, https://substackcdn.com/image/fetch/$s_!W4G9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facb15e52-7fad-436c-a855-4dbcf5539c05_1691x1588.png 1272w, https://substackcdn.com/image/fetch/$s_!W4G9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facb15e52-7fad-436c-a855-4dbcf5539c05_1691x1588.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W4G9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facb15e52-7fad-436c-a855-4dbcf5539c05_1691x1588.png" width="1456" height="1367" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/acb15e52-7fad-436c-a855-4dbcf5539c05_1691x1588.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1367,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:139868,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W4G9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facb15e52-7fad-436c-a855-4dbcf5539c05_1691x1588.png 424w, https://substackcdn.com/image/fetch/$s_!W4G9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facb15e52-7fad-436c-a855-4dbcf5539c05_1691x1588.png 848w, https://substackcdn.com/image/fetch/$s_!W4G9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facb15e52-7fad-436c-a855-4dbcf5539c05_1691x1588.png 1272w, https://substackcdn.com/image/fetch/$s_!W4G9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facb15e52-7fad-436c-a855-4dbcf5539c05_1691x1588.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hudi read operation flow using Spark</figcaption></figure></div><ol><li><p><code>DefaultSource</code> serves as the entry point of the integration, defining the data source&#8217;s format as <code>org.apache.hudi</code> or <code>hudi</code>. It provides a <code>BaseRelation</code>, which Hudi uses to implement the data extraction process.</p></li><li><p><code>buildScan()</code> is a core API to pass filters to data sources for optimizations. Hudi defines <code>collectFileSplits()</code> for gathering relevant files.</p></li><li><p><code>collectFileSplits()</code> passes all the filters to a <code>FileIndex</code> object that helps identify the necessary files to read.</p></li><li><p><code>FileIndex</code> locates all the relevant <code>FileSlice</code>s for further processing.</p></li><li><p><code>composeRDD()</code> is invoked after <code>FileSlice</code>s are identified.</p></li><li><p><code>FileSlice</code>s are loaded and read out as <code>RDD</code>s. For columnar files like Base Files in Parquet, this read operation minimizes the transferred bytes by reading only the necessary columns.</p></li><li><p><code>RDD</code>s are returned from the API for further planning and code generation.</p></li></ol><p>Please note that the steps mentioned above provide only a high-level overview of the read flow, omitting details such as support for schema-on-read and advanced indexing techniques like data skipping using a metadata table.</p><p>The flow is common to all Hudi query types with Spark<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. In the following sections, I will explain how the various query types work. All of them, except for Read-Optimized, are applicable to both CoW and MoR tables.</p><h3>Snapshot Query</h3><p>This is the default query type when reading Hudi tables. It aims to retrieve the latest records from the table, essentially capturing a "snapshot" of the table at the time of the query. When performed on MoR tables, the merging of Log Files with the Base File occurs and results in some performance impact.</p><p>After launching a spark-sql shell with Hudi dependency<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, you may run these SQLs to setup a MoR table with one record inserted and updated.</p><pre><code>create table hudi_mor_example (
  id int,
  name string,
  price double,
  ts bigint
) using hudi
tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts'
) location '/tmp/hudi_mor_example';

set hoodie.spark.sql.insert.into.operation=UPSERT;
insert into hudi_mor_example select 1, 'foo', 10, 1000;
insert into hudi_mor_example select 1, 'foo', 20, 2000;
insert into hudi_mor_example select 1, 'foo', 30, 3000;</code></pre><p>You can execute a snapshot query by running a SELECT statement as shown below, and it will retrieve the latest value of the record.</p><pre><code>spark-sql&gt; select id, name, price, ts from hudi_mor_example;
1       foo     30.0    3000
Time taken: 0.161 seconds, Fetched 1 row(s)</code></pre><h3>Read-Optimized (RO) Query</h3><p>RO query type is designed as a trade-off for lower read latency with potentially older results, and therefore, it is exclusively applicable to MoR tables. When conducting such queries, <code>collectFileSplits()</code> will only fetch Base Files for FileSlices.</p><p>The provided setup code above automatically generates a catalog table named <code>hudi_mor_example_ro</code>, which specifies a property <code>hoodie.query.as.ro.table=true</code>. This property instructs query engines to always perform RO queries. Running the SELECT statement below returns the original value of the record since the subsequent updates have not yet been applied to the Base File.</p><pre><code>spark-sql&gt; select id, name, price, ts from hudi_mor_example_ro;
1       foo     10.0    1000
Time taken: 0.114 seconds, Fetched 1 row(s)</code></pre><h3>Time Travel Query</h3><p>By specifying a timestamp, users can request a historical snapshot of a Hudi table at the given time. As previously discussed in <a href="https://datumagic.substack.com/i/135356155/data">post 1</a>, FileSlices are associated with specific commit times and, therefore, support filtering. When performing time travel queries, the <code>FileIndex</code> locates only the FileSlices that correspond to, or are just older than, the specified time if there is no exact match.</p><pre><code>spark-sql&gt; select id, name, price, ts from hudi_mor_example timestamp as of '20230905221619987';
1       foo     30.0    3000
Time taken: 0.274 seconds, Fetched 1 row(s)

spark-sql&gt; select id, name, price, ts from hudi_mor_example timestamp as of '20230905221619986';
1       foo     20.0    2000
Time taken: 0.241 seconds, Fetched 1 row(s)</code></pre><p>The first SELECT statement executes a time travel query precisely at the deltacommit time of the latest insert, providing the most recent snapshot of the table. The second query sets a timestamp earlier than the latest insert&#8217;s, resulting in a snapshot as of the second-to-last insert. </p><p>The timestamp in the example follows the Hudi Timeline&#8217;s format <code>'yyyyMMddHHmmssSSS'</code>. You may also set it in the form of <code>'yyyy-MM-dd HH:mm:ss.SSS'</code> or <code>'yyyy-MM-dd'</code>.</p><h3>Incremental Query</h3><p>Users can set a starting timestamp, with or without an ending timestamp, to retrieve changed records within the specified time window. If no ending time is set, the time window will include the most recent records. Hudi also offers full Change-Data-Capture (CDC) capabilities by enabling additional logs on the writer's side and activating CDC mode for incremental readers. Further details will be covered in a separate post dedicated to incremental processing.</p><h2>Recap</h2><p>In this post, we provided an overview of Spark's Catalyst Optimizer, explored how Hudi implements the Spark Data Source API for reading data, and introduced four distinct Hudi query types. In the upcoming post, I will demonstrate the write flow to further enhance our understanding of Hudi. Please feel free to share your feedback and suggest content in the comments section.</p><p><em>Apache Hudi has a thriving community - come and engage with us via <a href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g">Slack</a>, <a href="https://github.com/apache/hudi">GitHub</a>, <a href="https://www.linkedin.com/company/apache-hudi/">LinkedIn</a>, <a href="https://twitter.com/apachehudi">X (Twitter)</a>, and <a href="https://www.youtube.com/@apachehudi">YouTube</a>!</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g&quot;,&quot;text&quot;:&quot;Engage on Slack&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g"><span>Engage on Slack</span></a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Versions used: Spark 3.2, Scala 2.12, Hudi 0.14</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Incremental queries have a slightly different flow on the Hudi internals, where FileIndex may not be involved for locating FileSlices.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>The <a href="https://hudi.apache.org/docs/quick-start-guide/#setup">Hudi quick-start guide</a> has the detailed steps.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Apache Hudi: From Zero To One (1/10)]]></title><description><![CDATA[A first glance at Hudi's storage format]]></description><link>https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-110</link><guid isPermaLink="false">https://blog.datumagic.ai/p/apache-hudi-from-zero-to-one-110</guid><dc:creator><![CDATA[Shiyan Xu]]></dc:creator><pubDate>Mon, 28 Aug 2023 21:56:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb35a69da-d721-42e7-94f5-de097b7d5fc6_1153x907.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><a href="https://hudi.apache.org/">Apache Hudi</a>: From Zero To One</em></p><ul><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-110">Post 1: A first glance at Hudi's storage format</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-210">Post 2: Dive into read operation flow and query types</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-310">Post 3: Understand write flows and operations</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-410">Post 4: All about writer indexes</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-510">Post 5: Introduce table services: compaction, cleaning, and indexing</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-610">Post 6: Demystify clustering and space-filling curves</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-710">Post 7: Concurrently run writers and table services</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-810">Post 8: Read and process incrementally</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-910">Post 9: Hudi Streamer - a "Swiss Army knife" for ingestion</a></em></p></li><li><p><em><a href="https://datumagic.substack.com/p/apache-hudi-from-zero-to-one-1010">Post 10: Becoming "One" - the upcoming 1.0 highlights</a></em></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/company/apache-hudi&quot;,&quot;text&quot;:&quot;Follow on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.linkedin.com/company/apache-hudi"><span>Follow on LinkedIn</span></a></p><p>After dedicating approximately 4 years to working on <a href="https://hudi.apache.org/">Apache Hudi</a>, including 3 years as a committer, I decided to start this blog series with the intention of presenting Hudi's design and usage in an organized and beginner-friendly manner. My aim is to ensure the series is easy to follow for people with some knowledge of distributed data systems. The series will comprise 10 posts, each delving into a key aspect of Hudi. (Why 10? Purely a playful nod to 0 and 1, echoing the series title :) ) The ultimate goal is to help readers understand Hudi with both breadth and depth, enabling them to confidently utilize and contribute to this open-source project.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><h2>Hudi Overview</h2><p>Hudi is a transactional data lake platform that brings database and data warehouse capabilities to the data lake. The <a href="https://hudi.apache.org/docs/next/hudi_stack">Hudi stack</a> shown below clearly illustrates the major features of the platform.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.datumagic.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Datumagic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Z2JJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e56f2b-1de1-45cc-8e78-10ba2870b37f_4758x3362.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Z2JJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e56f2b-1de1-45cc-8e78-10ba2870b37f_4758x3362.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Z2JJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e56f2b-1de1-45cc-8e78-10ba2870b37f_4758x3362.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Z2JJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e56f2b-1de1-45cc-8e78-10ba2870b37f_4758x3362.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Z2JJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e56f2b-1de1-45cc-8e78-10ba2870b37f_4758x3362.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Z2JJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e56f2b-1de1-45cc-8e78-10ba2870b37f_4758x3362.jpeg" width="1456" height="1029" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/48e56f2b-1de1-45cc-8e78-10ba2870b37f_4758x3362.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1029,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1122054,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Z2JJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e56f2b-1de1-45cc-8e78-10ba2870b37f_4758x3362.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Z2JJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e56f2b-1de1-45cc-8e78-10ba2870b37f_4758x3362.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Z2JJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e56f2b-1de1-45cc-8e78-10ba2870b37f_4758x3362.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Z2JJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e56f2b-1de1-45cc-8e78-10ba2870b37f_4758x3362.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Hudi stack</figcaption></figure></div><p>At its core, Hudi defines a table format that organizes the data and metadata files within storage systems, allowing for features such as ACID transactions, efficient indexing, and incremental processing to be achieved. The remainder of this post will explore the format details, essentially showcasing the structure of a Hudi table on storage and explaining the roles of different files.</p><h2>Storage Format</h2><p>The diagram below depicts a typical data layout of a Hudi table under the table's base path in storage.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rpsz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb35a69da-d721-42e7-94f5-de097b7d5fc6_1153x907.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rpsz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb35a69da-d721-42e7-94f5-de097b7d5fc6_1153x907.png 424w, https://substackcdn.com/image/fetch/$s_!Rpsz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb35a69da-d721-42e7-94f5-de097b7d5fc6_1153x907.png 848w, https://substackcdn.com/image/fetch/$s_!Rpsz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb35a69da-d721-42e7-94f5-de097b7d5fc6_1153x907.png 1272w, https://substackcdn.com/image/fetch/$s_!Rpsz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb35a69da-d721-42e7-94f5-de097b7d5fc6_1153x907.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rpsz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb35a69da-d721-42e7-94f5-de097b7d5fc6_1153x907.png" width="1153" height="907" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b35a69da-d721-42e7-94f5-de097b7d5fc6_1153x907.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:907,&quot;width&quot;:1153,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:117977,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Rpsz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb35a69da-d721-42e7-94f5-de097b7d5fc6_1153x907.png 424w, https://substackcdn.com/image/fetch/$s_!Rpsz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb35a69da-d721-42e7-94f5-de097b7d5fc6_1153x907.png 848w, https://substackcdn.com/image/fetch/$s_!Rpsz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb35a69da-d721-42e7-94f5-de097b7d5fc6_1153x907.png 1272w, https://substackcdn.com/image/fetch/$s_!Rpsz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb35a69da-d721-42e7-94f5-de097b7d5fc6_1153x907.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A typical data layout of a Hudi table</figcaption></figure></div><p>There are two main types of files:</p><ul><li><p>Metadata files located in <code>&lt;base_path&gt;/.hoodie/</code></p></li><li><p>Data files stored within partition paths for partitioned tables, or under the base path for non-partitioned tables</p></li></ul><h3>Metadata</h3><p>The <code>&lt;base path&gt;/.hoodie/hoodie.properties</code> file contains essential table configurations, such as table name and version, which both writers and readers of the table will adhere to and utilize.</p><p>Alongside <code>hoodie.properties</code>, there are meta-files that record transactional actions to the table, forming the Hudi table's Timeline.</p><pre><code># an example of deltacommit actions on Timeline
20230827233828740.deltacommit.requested
20230827233828740.deltacommit.inflight
20230827233828740.deltacommit</code></pre><p>These meta-files follow the naming pattern below:</p><pre><code>&lt;action timestamp&gt;.&lt;action type&gt;[.&lt;action state&gt;]</code></pre><p>The <strong>&lt;action timestamp&gt;</strong></p><ul><li><p>marks when an action was first scheduled to run.</p></li><li><p>uniquely identifies an action on Timeline.</p></li><li><p>is monotonically increasing across different actions on a Timeline.</p></li></ul><p>The <strong>&lt;action type&gt;</strong> shows what kind of changes were made by the action. There are write action types, such as <code>commit</code> and <code>deltacommit</code>, which indicate new write operations (insert, update, or delete) that occurred on the table. Additionally, there are table service actions, such as <code>compaction</code> and <code>clean</code>, as well as recovery actions like <code>savepoint</code> and <code>restore</code>. We will discuss different action types in more detail in future posts.</p><p>The <strong>&lt;action state&gt;</strong> can be <strong>requested</strong>, <strong>inflight</strong>, or <strong>completed</strong> (without a suffix). As the names suggest, <strong>requested</strong> indicates being scheduled to run, <strong>inflight</strong> means execution in progress, and <strong>completed</strong> means that the action is done.</p><p>The meta-files for these actions, in JSON or AVRO format, contain information about the changes that should be applied to the table or that have been applied. Keeping these transaction logs makes it possible to recreate the table's state, achieve snapshot isolation, and reconcile writer conflicts through concurrency control mechanisms.</p><p>There are other metadata files and directories stored under <code>.hoodie/</code>. To give some examples, the <code>metadata/</code> contains further metadata related to actions on the Timeline and serves as an index for readers and writers. The <code>.heartbeat/</code> directory stores files for heartbeat management, while the<code>.aux/</code> is reserved for various auxiliary purposes.</p><h3>Data</h3><p>Hudi categorizes physical data files into Base File and Log File:</p><ul><li><p>Base File contains the main stored records in a Hudi table and is optimized for read.</p></li><li><p>Log File contains the records' changes on top of its associated Base File and is optimized for write.</p></li></ul><p>Within a partition path of a Hudi table (as shown in the previous layout diagram), a single Base File and its associated Log Files (which can be none or many) are grouped together as a File Slice. Multiple File Slices constitute a File Group. Both the File Group and the File Slice are logical concepts designed to enclose physical files, simplifying access and manipulation for both readers and writers. By defining these models, Hudi can</p><ul><li><p>fulfill both read and write efficiency requirements. Typically, Base File is configured as a columnar file format (e.g., Apache Parquet) and Log File is set to a row-based file format (e.g., Apache Avro).</p></li><li><p>achieve versioning across commit actions. Each File Slice is tied to a specific timestamp of an action on the Timeline, and the File Slices within a File Group essentially track how the contained records evolved over time.</p></li></ul><h3>Table Types</h3><p>Hudi defines two table types - Copy-on-Write (CoW) and Merge-on-Read (MoR). The layout differences are as follows: CoW has no Log File compared to MoR, and write operations result in <code>.commit</code> actions instead of <code>.deltacommit</code>. Throughout our discussion, we have been using MoR as the example. Understanding CoW becomes straightforward once you grasp MoR - you can treat CoW as a special case of MoR where records in a Base File and the changes are implicitly merged into a new Base File during each write operation.</p><p>When choosing the table type for a Hudi table, it is important to take into account the read and write patterns, as there are some implications:</p><ul><li><p>CoW has high write amplification due to rewriting records in the new File Slices for every write, while read operations will always be optimized. This is well-suited for read-heavy analytical workloads or small tables.</p></li><li><p>MoR has low write amplification because changes are buffered in Log Files and batch-processed to merge and create new File Slices. However, read latency is affected since inflight merging of Log Files with Base File is required for reading the latest records.</p></li></ul><p>Users may also opt to only read Base Files of an MoR table to obtain efficiency while sacrificing result freshness. We will discuss more about Hudi's different read modes in forthcoming posts. As the Hudi project evolves, the merging costs associated with reading from MoR tables has been optimized over past releases. It is foreseeable that MoR will become the preferred table type for most workload scenarios.</p><h2>Recap</h2><p>In this initial post of the zero-to-one series, we have explored the fundamental concepts of Hudi's storage format to elucidate how metadata and the data are structured within Hudi tables. We also briefly explained the different table types and their trade-offs. As shown in the overview diagram, Hudi serves as a comprehensive lakehouse platform offering features across various dimensions. In the upcoming nine posts, I will progressively cover other significant facets of the platform. Please don't hesitate to share your feedback and suggest content in the comments section.</p><p><em>Apache Hudi has a thriving community - come and engage with us via <a href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g">Slack</a>, <a href="https://github.com/apache/hudi">GitHub</a>, <a href="https://www.linkedin.com/company/apache-hudi/">LinkedIn</a>, <a href="https://twitter.com/apachehudi">X (Twitter)</a>, and <a href="https://www.youtube.com/@apachehudi">YouTube</a>!</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g&quot;,&quot;text&quot;:&quot;Engage on Slack&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://join.slack.com/t/apache-hudi/shared_invite/zt-2ggm1fub8-_yt4Reu9djwqqVRFC7X49g"><span>Engage on Slack</span></a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The concepts introduced in the series will be based on 0.14.0, the latest version at the time of writing.</p><p></p></div></div>]]></content:encoded></item></channel></rss>