Mastering Text Chunking with Ollama: Advanced Techniques for Processing Large Documents with Local LLMs
A comprehensive guide to advanced text chunking strategies for Ollama, including semantic, hierarchical, and sliding window approaches for processing large documents while maintaining context and coherence.
Daniel Kliewer
Author, Sovereign AI


Mastering Text Chunking with Ollama: A Comprehensive Guide to Advanced Processing
In today's world of AI and large language models, one of the most common challenges developers face is handling text that exceeds a model's context window. Ollama, while powerful for running local language models, shares this limitation with other LLMs. This comprehensive guide will explore advanced chunking techniques to effectively process large documents with Ollama while maintaining coherence and context.
Understanding Chunking in the Context of Ollama
Chunking is the process of dividing large text into smaller, manageable segments that fit within a model's token limit. Ollama, which provides access to models like Llama, Mistral, and others, has specific token limitations depending on the model you're using. Effective chunking isn't just about breaking text apart—it's about doing so intelligently to preserve meaning across segments.
Why Advanced Chunking Matters for Ollama
When working with Ollama, proper chunking techniques become essential for several reasons:
-
Context Window Constraints: Most models accessible through Ollama have context windows ranging from 2K to 8K tokens, limiting how much text they can process at once.
-
Memory Efficiency: Even if a model technically supports larger contexts, processing smaller chunks can reduce RAM usage, allowing Ollama to run smoothly on machines with limited resources.
-
Coherence Across Chunks: Without proper chunking strategies, the model might lose the thread of thought between segments, resulting in disjointed or contradictory outputs.
-
Processing Efficiency: Well-designed chunking allows for parallel processing and can significantly reduce the time needed to handle large documents.
Advanced Chunking Strategies for Ollama
Let's explore several sophisticated chunking approaches that go beyond basic text splitting:
1. Semantic Chunking
Rather than chunking based solely on character or token count, semantic chunking divides text based on meaning and context.
python1import nltk2from nltk.tokenize import sent_tokenize3import numpy as np4from sklearn.metrics.pairwise import cosine_similarity5import spacy67# Load SpaCy model for semantic understanding8nlp = spacy.load("en_core_web_md")910def semantic_chunking(text, max_tokens=1000, overlap=100):11 # Break into sentences first12 sentences = sent_tokenize(text)1314 # Get sentence embeddings15 sentence_embeddings = [nlp(sentence).vector for sentence in sentences]1617 # Track token count (approximate)18 token_counts = [len(sentence.split()) for sentence in sentences]1920 chunks = []21 current_chunk = []22 current_token_count = 02324 for i, sentence in enumerate(sentences):25 # If adding this sentence would exceed our limit, start a new chunk26 if current_token_count + token_counts[i] > max_tokens and current_chunk:27 chunks.append(" ".join(current_chunk))2829 # For overlap, find the most semantically similar sentences to include30 if overlap > 0 and len(current_chunk) > 0:31 # Get embeddings for current chunk sentences32 current_embs = sentence_embeddings[i-len(current_chunk):i]33 # Find sentences with highest similarity to include in overlap34 similarities = cosine_similarity([sentence_embeddings[i]], current_embs)[0]35 overlap_indices = np.argsort(similarities)[-int(overlap/10):] # Heuristic for number of sentences3637 # Add overlapping sentences to new chunk38 current_chunk = [sentences[i-len(current_chunk)+idx] for idx in overlap_indices]39 current_token_count = sum(token_counts[i-len(current_chunk)+idx] for idx in overlap_indices)40 else:41 current_chunk = []42 current_token_count = 04344 current_chunk.append(sentence)45 current_token_count += token_counts[i]4647 # Add the last chunk if it's not empty48 if current_chunk:49 chunks.append(" ".join(current_chunk))5051 return chunks
This approach ensures that semantically related content stays together, providing Ollama with more coherent chunks to process.
2. Hierarchical Chunking
Hierarchical chunking creates a tree-like structure where larger documents are first divided into major sections, then subsections, and finally into token-sized chunks.
python1def hierarchical_chunking(document, max_tokens=1000):2 # First level: Split by major section headers3 sections = re.split(r'# [A-Za-z\s]+\n', document)45 # Second level: For each section, split by sub-headers6 subsections = []7 for section in sections:8 if not section.strip():9 continue10 subsecs = re.split(r'## [A-Za-z\s]+\n', section)11 subsections.extend([s for s in subsecs if s.strip()])1213 # Final level: Split subsections into token-sized chunks14 final_chunks = []15 for subsection in subsections:16 words = subsection.split()17 for i in range(0, len(words), max_tokens):18 chunk = ' '.join(words[i:i+max_tokens])19 if chunk.strip():20 final_chunks.append(chunk)2122 return final_chunks
This method is particularly useful for processing structured documents like academic papers or technical documentation with Ollama.
3. Sliding Window Chunking with Context Retention
This advanced technique maintains continuity by creating overlapping windows of text:
python1def sliding_window_chunking(text, window_size=800, stride=600, context_size=200):2 """3 Process text using a sliding window approach that maintains context4 - window_size: The main processing window size in tokens5 - stride: How far to move the window for each chunk (smaller than window_size creates overlap)6 - context_size: How much previous context to include with each chunk7 """8 words = text.split()9 chunks = []1011 # Initialize with first chunk having no previous context12 for i in range(0, len(words), stride):13 if i == 0:14 # First chunk has no previous context15 chunk = words[i:i+window_size]16 else:17 # Calculate how much previous context to include18 context_start = max(0, i-context_size)1920 # Create a marker showing where previous context ends and new content begins21 context_part = words[context_start:i]22 new_part = words[i:i+window_size-len(context_part)]2324 # Combine with a special separator25 chunk = (26 "--- PREVIOUS CONTEXT ---\n" +27 " ".join(context_part) +28 "\n--- NEW CONTENT ---\n" +29 " ".join(new_part)30 )3132 if chunk:33 chunks.append(chunk if isinstance(chunk, str) else " ".join(chunk))3435 # If we've processed all words, break36 if i + window_size >= len(words):37 break3839 return chunks
This approach is particularly effective for narrative text where continuity between chunks is critical for Ollama to maintain the flow of ideas.
Implementing Advanced Chunking with Ollama
Now let's see how we can apply these chunking strategies with Ollama's API for practical use cases:
python1import json2import requests34def process_with_ollama(chunks, model="llama2", system_prompt=None):5 """6 Process a list of text chunks with Ollama7 """8 responses = []910 # Base URL for Ollama API11 url = "http://localhost:11434/api/generate"1213 for i, chunk in enumerate(chunks):14 # Create a metadata-rich prompt for context15 prompt = f"[Chunk {i+1} of {len(chunks)}]\n\n{chunk}"1617 # Prepare the request payload18 payload = {19 "model": model,20 "prompt": prompt,21 "stream": False22 }2324 # Add system prompt if provided25 if system_prompt:26 payload["system"] = system_prompt2728 # Make the API call29 try:30 response = requests.post(url, json=payload)31 response.raise_for_status() # Check for HTTP errors3233 # Extract and store the response34 result = response.json()35 responses.append(result["response"])3637 print(f"Processed chunk {i+1}/{len(chunks)}")38 except Exception as e:39 print(f"Error processing chunk {i+1}: {str(e)}")40 responses.append(f"Error: {str(e)}")4142 return responses
Advanced Example: Document Analysis with Context Maintenance
Let's create a more complex workflow that uses semantic chunking for document analysis while maintaining context between chunks:
python1def analyze_document_with_ollama(document_path, model="llama2:13b"):2 """3 Analyze a large document by:4 1. Reading the document5 2. Creating semantic chunks6 3. Processing each chunk while maintaining context7 4. Synthesizing a coherent analysis8 """9 # Read the document10 with open(document_path, 'r', encoding='utf-8') as f:11 document = f.read()1213 # Create semantic chunks14 print("Creating semantic chunks...")15 chunks = semantic_chunking(document, max_tokens=1800, overlap=200)16 print(f"Document divided into {len(chunks)} semantic chunks")1718 # Process each chunk with Ollama19 system_prompt = """20 You are analyzing a document that has been divided into chunks.21 For each chunk:22 1. Identify key points, arguments, and evidence23 2. Note how these connect to previous chunks if applicable24 3. Maintain a coherent understanding of the document as it progresses25 """2627 print("Processing chunks with Ollama...")28 chunk_analyses = process_with_ollama(chunks, model=model, system_prompt=system_prompt)2930 # Create a final synthesis prompt31 synthesis_prompt = "Below are analyses of different sections of a document:\n\n"32 for i, analysis in enumerate(chunk_analyses):33 synthesis_prompt += f"SECTION {i+1} ANALYSIS:\n{analysis}\n\n"3435 synthesis_prompt += """36 Based on these section analyses, provide a comprehensive synthesis of the entire document.37 Include:38 1. The main thesis or argument39 2. Key supporting points and evidence40 3. Any significant counterarguments or limitations41 4. Overall evaluation of the document's effectiveness42 """4344 # Process the synthesis with Ollama45 print("Creating final synthesis...")46 synthesis_payload = {47 "model": model,48 "prompt": synthesis_prompt,49 "stream": False50 }5152 response = requests.post("http://localhost:11434/api/generate", json=synthesis_payload)53 synthesis = response.json()["response"]5455 return {56 "num_chunks": len(chunks),57 "chunk_analyses": chunk_analyses,58 "synthesis": synthesis59 }
Advanced Chunking Considerations for Ollama
Token Estimation with Different Models
Different Ollama models have varying tokenization methods. Here's a simple utility to help estimate token counts across models:
python1def estimate_tokens(text, model_type="llama2"):2 """3 Estimate token count for different Ollama models4 """5 # Average ratios of tokens to characters for different model families6 # These are approximations and will vary7 token_ratios = {8 "llama2": 0.25, # ~4 characters per token9 "mistral": 0.23, # ~4.3 characters per token10 "mpt": 0.22, # ~4.5 characters per token11 "falcon": 0.26 # ~3.8 characters per token12 }1314 ratio = token_ratios.get(model_type.lower(), 0.25) # Default to llama2 ratio1516 # Simple estimation based on character count17 return int(len(text) * ratio)
Handling Code and Technical Content
Code and technical content require special chunking considerations:
python1def chunk_code_document(document):2 """3 Specialized chunking for technical documents with code blocks4 """5 # Split document by Markdown code blocks6 parts = re.split(r'(```[\w]*\n[\s\S]*?\n```)', document)78 chunks = []9 current_chunk = ""10 current_token_est = 01112 for part in parts:13 # If this is a code block, try to keep it intact14 is_code_block = part.startswith('```') and part.endswith('```')15 part_token_est = estimate_tokens(part)1617 # If adding this part would exceed our limit, start a new chunk18 if current_token_est + part_token_est > 1800 and current_chunk:19 chunks.append(current_chunk)20 current_chunk = ""21 current_token_est = 02223 # If it's a code block that alone exceeds token limit, we need to split it24 if is_code_block and part_token_est > 1800:25 # Process the large code block separately26 if current_chunk: # Save any accumulated content first27 chunks.append(current_chunk)28 current_chunk = ""29 current_token_est = 03031 # Split code by lines, preserving syntax highlighting info32 code_lang = re.match(r'```([\w]*)\n', part)33 code_lang = code_lang.group(1) if code_lang else ""3435 code_content = part[3+len(code_lang):-3].strip()36 code_lines = code_content.split('\n')3738 code_chunks = []39 current_code_chunk = f"```{code_lang}\n"40 current_code_tokens = estimate_tokens(current_code_chunk)4142 for line in code_lines:43 line_tokens = estimate_tokens(line + '\n')44 if current_code_tokens + line_tokens > 1700: # Leave room for the closing ```45 current_code_chunk += "```"46 code_chunks.append(current_code_chunk)47 current_code_chunk = f"```{code_lang}\n{line}\n"48 current_code_tokens = estimate_tokens(current_code_chunk)49 else:50 current_code_chunk += line + '\n'51 current_code_tokens += line_tokens5253 # Add the last code chunk if not empty54 if current_code_chunk != f"```{code_lang}\n":55 current_code_chunk += "```"56 code_chunks.append(current_code_chunk)5758 chunks.extend(code_chunks)59 else:60 # Regular text or small code block61 current_chunk += part62 current_token_est += part_token_est6364 # Add the last chunk if not empty65 if current_chunk:66 chunks.append(current_chunk)6768 return chunks
Parallel Processing with Ollama
For large documents, you can process multiple chunks in parallel to save time:
python1import concurrent.futures23def process_chunks_in_parallel(chunks, model="llama2", max_workers=4):4 """5 Process multiple chunks in parallel with Ollama6 """7 def process_chunk(chunk_data):8 i, chunk = chunk_data9 url = "http://localhost:11434/api/generate"10 prompt = f"[Chunk {i+1} of {len(chunks)}]\n\n{chunk}"1112 payload = {13 "model": model,14 "prompt": prompt,15 "stream": False16 }1718 try:19 response = requests.post(url, json=payload)20 response.raise_for_status()21 return response.json()["response"]22 except Exception as e:23 return f"Error processing chunk {i+1}: {str(e)}"2425 results = [None] * len(chunks)2627 with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:28 # Submit all chunks for processing29 future_to_index = {executor.submit(process_chunk, (i, chunk)): i30 for i, chunk in enumerate(chunks)}3132 # Process results as they complete33 for future in concurrent.futures.as_completed(future_to_index):34 index = future_to_index[future]35 try:36 results[index] = future.result()37 except Exception as e:38 results[index] = f"Error: {str(e)}"3940 return results
Best Practices for Ollama Chunking
Based on extensive testing with various Ollama models, here are some best practices:
-
Retain Document Structure: When possible, align chunk boundaries with natural document divisions like paragraphs, sections, or sentences.
-
Context Windows: Use a smaller effective window size than the model's maximum to leave room for the model's response.
-
Model-Specific Tuning:
- Llama models generally perform better with slightly smaller chunks (1500-1800 tokens)
- Mistral models can often handle larger coherent chunks (2000+ tokens)
- Adjust based on your specific model
-
Metadata Enhancement: Include metadata in each chunk that indicates its position and relationship to other chunks.
-
Adaptive Chunking: Consider the content type—code, technical text, and narrative content may benefit from different chunking strategies.
-
System Prompts: Use clear system prompts to tell Ollama how to handle chunked content.
Common Chunking Pitfalls with Ollama
When implementing chunking with Ollama, be aware of these common issues:
-
Mid-Sentence Splitting: Avoid splitting sentences between chunks when possible, as this can disrupt the model's understanding.
-
Losing Key Context: Critical information mentioned early in a document might be missing from later chunks if not properly carried forward.
-
Tokenizer Mismatches: Remember that character or word counts aren't perfect proxies for token counts, which can lead to chunks that exceed token limits.
-
Neglecting Document Structure: Splitting without respect to document structure (e.g., cutting across headers or code blocks) often produces poor results.
-
Overloading Context Windows: Very dense information-rich chunks may overwhelm the model even if they're within token limits.
Conclusion: Mastering Ollama with Advanced Chunking
Advanced chunking techniques are essential for getting the most out of Ollama, especially when working with larger documents or complex content. By implementing semantic, hierarchical, or sliding window chunking approaches, you can process content that far exceeds the model's native context window while maintaining coherence and accuracy.
The techniques outlined in this guide will help you build more sophisticated applications with Ollama that can handle real-world document processing tasks efficiently. By understanding the nuances of different chunking strategies and how they interact with different Ollama models, you can create systems that make the most of local LLM capabilities without being constrained by context window limitations.
Remember that the ideal chunking strategy depends on your specific use case, content type, and chosen model. Experiment with the approaches outlined here and adapt them to your particular needs for optimal results.

Sovereign AI: Building Local-First Intelligent Systems
by Daniel Kliewer · Paperback · 72 pages
The hands-on guide to building AI that runs on your hardware, keeps your data private, and eliminates cloud dependence. Working code included.