·11 min

Mastering Text Chunking with Ollama: Advanced Techniques for Processing Large Documents with Local LLMs

A comprehensive guide to advanced text chunking strategies for Ollama, including semantic, hierarchical, and sliding window approaches for processing large documents while maintaining context and coherence.

DK

Daniel Kliewer

Author, Sovereign AI

OllamaText ChunkingLocal LLMsDocument ProcessingSemantic ChunkingHierarchical ChunkingSliding WindowPythonNatural Language ProcessingAI Development
Sovereign AI book cover

From the Book

This is from Sovereign AI: Building Local-First Intelligent Systems.

Get the Book — $88
Mastering Text Chunking with Ollama: Advanced Techniques for Processing Large Documents with Local LLMs

Image

Mastering Text Chunking with Ollama: A Comprehensive Guide to Advanced Processing

In today's world of AI and large language models, one of the most common challenges developers face is handling text that exceeds a model's context window. Ollama, while powerful for running local language models, shares this limitation with other LLMs. This comprehensive guide will explore advanced chunking techniques to effectively process large documents with Ollama while maintaining coherence and context.

Understanding Chunking in the Context of Ollama

Chunking is the process of dividing large text into smaller, manageable segments that fit within a model's token limit. Ollama, which provides access to models like Llama, Mistral, and others, has specific token limitations depending on the model you're using. Effective chunking isn't just about breaking text apart—it's about doing so intelligently to preserve meaning across segments.

Why Advanced Chunking Matters for Ollama

When working with Ollama, proper chunking techniques become essential for several reasons:

  1. Context Window Constraints: Most models accessible through Ollama have context windows ranging from 2K to 8K tokens, limiting how much text they can process at once.

  2. Memory Efficiency: Even if a model technically supports larger contexts, processing smaller chunks can reduce RAM usage, allowing Ollama to run smoothly on machines with limited resources.

  3. Coherence Across Chunks: Without proper chunking strategies, the model might lose the thread of thought between segments, resulting in disjointed or contradictory outputs.

  4. Processing Efficiency: Well-designed chunking allows for parallel processing and can significantly reduce the time needed to handle large documents.

Advanced Chunking Strategies for Ollama

Let's explore several sophisticated chunking approaches that go beyond basic text splitting:

1. Semantic Chunking

Rather than chunking based solely on character or token count, semantic chunking divides text based on meaning and context.

python
1import nltk
2from nltk.tokenize import sent_tokenize
3import numpy as np
4from sklearn.metrics.pairwise import cosine_similarity
5import spacy
6
7# Load SpaCy model for semantic understanding
8nlp = spacy.load("en_core_web_md")
9
10def semantic_chunking(text, max_tokens=1000, overlap=100):
11 # Break into sentences first
12 sentences = sent_tokenize(text)
13
14 # Get sentence embeddings
15 sentence_embeddings = [nlp(sentence).vector for sentence in sentences]
16
17 # Track token count (approximate)
18 token_counts = [len(sentence.split()) for sentence in sentences]
19
20 chunks = []
21 current_chunk = []
22 current_token_count = 0
23
24 for i, sentence in enumerate(sentences):
25 # If adding this sentence would exceed our limit, start a new chunk
26 if current_token_count + token_counts[i] > max_tokens and current_chunk:
27 chunks.append(" ".join(current_chunk))
28
29 # For overlap, find the most semantically similar sentences to include
30 if overlap > 0 and len(current_chunk) > 0:
31 # Get embeddings for current chunk sentences
32 current_embs = sentence_embeddings[i-len(current_chunk):i]
33 # Find sentences with highest similarity to include in overlap
34 similarities = cosine_similarity([sentence_embeddings[i]], current_embs)[0]
35 overlap_indices = np.argsort(similarities)[-int(overlap/10):] # Heuristic for number of sentences
36
37 # Add overlapping sentences to new chunk
38 current_chunk = [sentences[i-len(current_chunk)+idx] for idx in overlap_indices]
39 current_token_count = sum(token_counts[i-len(current_chunk)+idx] for idx in overlap_indices)
40 else:
41 current_chunk = []
42 current_token_count = 0
43
44 current_chunk.append(sentence)
45 current_token_count += token_counts[i]
46
47 # Add the last chunk if it's not empty
48 if current_chunk:
49 chunks.append(" ".join(current_chunk))
50
51 return chunks

This approach ensures that semantically related content stays together, providing Ollama with more coherent chunks to process.

2. Hierarchical Chunking

Hierarchical chunking creates a tree-like structure where larger documents are first divided into major sections, then subsections, and finally into token-sized chunks.

python
1def hierarchical_chunking(document, max_tokens=1000):
2 # First level: Split by major section headers
3 sections = re.split(r'# [A-Za-z\s]+\n', document)
4
5 # Second level: For each section, split by sub-headers
6 subsections = []
7 for section in sections:
8 if not section.strip():
9 continue
10 subsecs = re.split(r'## [A-Za-z\s]+\n', section)
11 subsections.extend([s for s in subsecs if s.strip()])
12
13 # Final level: Split subsections into token-sized chunks
14 final_chunks = []
15 for subsection in subsections:
16 words = subsection.split()
17 for i in range(0, len(words), max_tokens):
18 chunk = ' '.join(words[i:i+max_tokens])
19 if chunk.strip():
20 final_chunks.append(chunk)
21
22 return final_chunks

This method is particularly useful for processing structured documents like academic papers or technical documentation with Ollama.

3. Sliding Window Chunking with Context Retention

This advanced technique maintains continuity by creating overlapping windows of text:

python
1def sliding_window_chunking(text, window_size=800, stride=600, context_size=200):
2 """
3 Process text using a sliding window approach that maintains context
4 - window_size: The main processing window size in tokens
5 - stride: How far to move the window for each chunk (smaller than window_size creates overlap)
6 - context_size: How much previous context to include with each chunk
7 """
8 words = text.split()
9 chunks = []
10
11 # Initialize with first chunk having no previous context
12 for i in range(0, len(words), stride):
13 if i == 0:
14 # First chunk has no previous context
15 chunk = words[i:i+window_size]
16 else:
17 # Calculate how much previous context to include
18 context_start = max(0, i-context_size)
19
20 # Create a marker showing where previous context ends and new content begins
21 context_part = words[context_start:i]
22 new_part = words[i:i+window_size-len(context_part)]
23
24 # Combine with a special separator
25 chunk = (
26 "--- PREVIOUS CONTEXT ---\n" +
27 " ".join(context_part) +
28 "\n--- NEW CONTENT ---\n" +
29 " ".join(new_part)
30 )
31
32 if chunk:
33 chunks.append(chunk if isinstance(chunk, str) else " ".join(chunk))
34
35 # If we've processed all words, break
36 if i + window_size >= len(words):
37 break
38
39 return chunks

This approach is particularly effective for narrative text where continuity between chunks is critical for Ollama to maintain the flow of ideas.

Implementing Advanced Chunking with Ollama

Now let's see how we can apply these chunking strategies with Ollama's API for practical use cases:

python
1import json
2import requests
3
4def process_with_ollama(chunks, model="llama2", system_prompt=None):
5 """
6 Process a list of text chunks with Ollama
7 """
8 responses = []
9
10 # Base URL for Ollama API
11 url = "http://localhost:11434/api/generate"
12
13 for i, chunk in enumerate(chunks):
14 # Create a metadata-rich prompt for context
15 prompt = f"[Chunk {i+1} of {len(chunks)}]\n\n{chunk}"
16
17 # Prepare the request payload
18 payload = {
19 "model": model,
20 "prompt": prompt,
21 "stream": False
22 }
23
24 # Add system prompt if provided
25 if system_prompt:
26 payload["system"] = system_prompt
27
28 # Make the API call
29 try:
30 response = requests.post(url, json=payload)
31 response.raise_for_status() # Check for HTTP errors
32
33 # Extract and store the response
34 result = response.json()
35 responses.append(result["response"])
36
37 print(f"Processed chunk {i+1}/{len(chunks)}")
38 except Exception as e:
39 print(f"Error processing chunk {i+1}: {str(e)}")
40 responses.append(f"Error: {str(e)}")
41
42 return responses

Advanced Example: Document Analysis with Context Maintenance

Let's create a more complex workflow that uses semantic chunking for document analysis while maintaining context between chunks:

python
1def analyze_document_with_ollama(document_path, model="llama2:13b"):
2 """
3 Analyze a large document by:
4 1. Reading the document
5 2. Creating semantic chunks
6 3. Processing each chunk while maintaining context
7 4. Synthesizing a coherent analysis
8 """
9 # Read the document
10 with open(document_path, 'r', encoding='utf-8') as f:
11 document = f.read()
12
13 # Create semantic chunks
14 print("Creating semantic chunks...")
15 chunks = semantic_chunking(document, max_tokens=1800, overlap=200)
16 print(f"Document divided into {len(chunks)} semantic chunks")
17
18 # Process each chunk with Ollama
19 system_prompt = """
20 You are analyzing a document that has been divided into chunks.
21 For each chunk:
22 1. Identify key points, arguments, and evidence
23 2. Note how these connect to previous chunks if applicable
24 3. Maintain a coherent understanding of the document as it progresses
25 """
26
27 print("Processing chunks with Ollama...")
28 chunk_analyses = process_with_ollama(chunks, model=model, system_prompt=system_prompt)
29
30 # Create a final synthesis prompt
31 synthesis_prompt = "Below are analyses of different sections of a document:\n\n"
32 for i, analysis in enumerate(chunk_analyses):
33 synthesis_prompt += f"SECTION {i+1} ANALYSIS:\n{analysis}\n\n"
34
35 synthesis_prompt += """
36 Based on these section analyses, provide a comprehensive synthesis of the entire document.
37 Include:
38 1. The main thesis or argument
39 2. Key supporting points and evidence
40 3. Any significant counterarguments or limitations
41 4. Overall evaluation of the document's effectiveness
42 """
43
44 # Process the synthesis with Ollama
45 print("Creating final synthesis...")
46 synthesis_payload = {
47 "model": model,
48 "prompt": synthesis_prompt,
49 "stream": False
50 }
51
52 response = requests.post("http://localhost:11434/api/generate", json=synthesis_payload)
53 synthesis = response.json()["response"]
54
55 return {
56 "num_chunks": len(chunks),
57 "chunk_analyses": chunk_analyses,
58 "synthesis": synthesis
59 }

Advanced Chunking Considerations for Ollama

Token Estimation with Different Models

Different Ollama models have varying tokenization methods. Here's a simple utility to help estimate token counts across models:

python
1def estimate_tokens(text, model_type="llama2"):
2 """
3 Estimate token count for different Ollama models
4 """
5 # Average ratios of tokens to characters for different model families
6 # These are approximations and will vary
7 token_ratios = {
8 "llama2": 0.25, # ~4 characters per token
9 "mistral": 0.23, # ~4.3 characters per token
10 "mpt": 0.22, # ~4.5 characters per token
11 "falcon": 0.26 # ~3.8 characters per token
12 }
13
14 ratio = token_ratios.get(model_type.lower(), 0.25) # Default to llama2 ratio
15
16 # Simple estimation based on character count
17 return int(len(text) * ratio)

Handling Code and Technical Content

Code and technical content require special chunking considerations:

python
1def chunk_code_document(document):
2 """
3 Specialized chunking for technical documents with code blocks
4 """
5 # Split document by Markdown code blocks
6 parts = re.split(r'(```[\w]*\n[\s\S]*?\n```)', document)
7
8 chunks = []
9 current_chunk = ""
10 current_token_est = 0
11
12 for part in parts:
13 # If this is a code block, try to keep it intact
14 is_code_block = part.startswith('```') and part.endswith('```')
15 part_token_est = estimate_tokens(part)
16
17 # If adding this part would exceed our limit, start a new chunk
18 if current_token_est + part_token_est > 1800 and current_chunk:
19 chunks.append(current_chunk)
20 current_chunk = ""
21 current_token_est = 0
22
23 # If it's a code block that alone exceeds token limit, we need to split it
24 if is_code_block and part_token_est > 1800:
25 # Process the large code block separately
26 if current_chunk: # Save any accumulated content first
27 chunks.append(current_chunk)
28 current_chunk = ""
29 current_token_est = 0
30
31 # Split code by lines, preserving syntax highlighting info
32 code_lang = re.match(r'```([\w]*)\n', part)
33 code_lang = code_lang.group(1) if code_lang else ""
34
35 code_content = part[3+len(code_lang):-3].strip()
36 code_lines = code_content.split('\n')
37
38 code_chunks = []
39 current_code_chunk = f"```{code_lang}\n"
40 current_code_tokens = estimate_tokens(current_code_chunk)
41
42 for line in code_lines:
43 line_tokens = estimate_tokens(line + '\n')
44 if current_code_tokens + line_tokens > 1700: # Leave room for the closing ```
45 current_code_chunk += "```"
46 code_chunks.append(current_code_chunk)
47 current_code_chunk = f"```{code_lang}\n{line}\n"
48 current_code_tokens = estimate_tokens(current_code_chunk)
49 else:
50 current_code_chunk += line + '\n'
51 current_code_tokens += line_tokens
52
53 # Add the last code chunk if not empty
54 if current_code_chunk != f"```{code_lang}\n":
55 current_code_chunk += "```"
56 code_chunks.append(current_code_chunk)
57
58 chunks.extend(code_chunks)
59 else:
60 # Regular text or small code block
61 current_chunk += part
62 current_token_est += part_token_est
63
64 # Add the last chunk if not empty
65 if current_chunk:
66 chunks.append(current_chunk)
67
68 return chunks

Parallel Processing with Ollama

For large documents, you can process multiple chunks in parallel to save time:

python
1import concurrent.futures
2
3def process_chunks_in_parallel(chunks, model="llama2", max_workers=4):
4 """
5 Process multiple chunks in parallel with Ollama
6 """
7 def process_chunk(chunk_data):
8 i, chunk = chunk_data
9 url = "http://localhost:11434/api/generate"
10 prompt = f"[Chunk {i+1} of {len(chunks)}]\n\n{chunk}"
11
12 payload = {
13 "model": model,
14 "prompt": prompt,
15 "stream": False
16 }
17
18 try:
19 response = requests.post(url, json=payload)
20 response.raise_for_status()
21 return response.json()["response"]
22 except Exception as e:
23 return f"Error processing chunk {i+1}: {str(e)}"
24
25 results = [None] * len(chunks)
26
27 with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
28 # Submit all chunks for processing
29 future_to_index = {executor.submit(process_chunk, (i, chunk)): i
30 for i, chunk in enumerate(chunks)}
31
32 # Process results as they complete
33 for future in concurrent.futures.as_completed(future_to_index):
34 index = future_to_index[future]
35 try:
36 results[index] = future.result()
37 except Exception as e:
38 results[index] = f"Error: {str(e)}"
39
40 return results

Best Practices for Ollama Chunking

Based on extensive testing with various Ollama models, here are some best practices:

  1. Retain Document Structure: When possible, align chunk boundaries with natural document divisions like paragraphs, sections, or sentences.

  2. Context Windows: Use a smaller effective window size than the model's maximum to leave room for the model's response.

  3. Model-Specific Tuning:

    • Llama models generally perform better with slightly smaller chunks (1500-1800 tokens)
    • Mistral models can often handle larger coherent chunks (2000+ tokens)
    • Adjust based on your specific model
  4. Metadata Enhancement: Include metadata in each chunk that indicates its position and relationship to other chunks.

  5. Adaptive Chunking: Consider the content type—code, technical text, and narrative content may benefit from different chunking strategies.

  6. System Prompts: Use clear system prompts to tell Ollama how to handle chunked content.

Common Chunking Pitfalls with Ollama

When implementing chunking with Ollama, be aware of these common issues:

  1. Mid-Sentence Splitting: Avoid splitting sentences between chunks when possible, as this can disrupt the model's understanding.

  2. Losing Key Context: Critical information mentioned early in a document might be missing from later chunks if not properly carried forward.

  3. Tokenizer Mismatches: Remember that character or word counts aren't perfect proxies for token counts, which can lead to chunks that exceed token limits.

  4. Neglecting Document Structure: Splitting without respect to document structure (e.g., cutting across headers or code blocks) often produces poor results.

  5. Overloading Context Windows: Very dense information-rich chunks may overwhelm the model even if they're within token limits.

Conclusion: Mastering Ollama with Advanced Chunking

Advanced chunking techniques are essential for getting the most out of Ollama, especially when working with larger documents or complex content. By implementing semantic, hierarchical, or sliding window chunking approaches, you can process content that far exceeds the model's native context window while maintaining coherence and accuracy.

The techniques outlined in this guide will help you build more sophisticated applications with Ollama that can handle real-world document processing tasks efficiently. By understanding the nuances of different chunking strategies and how they interact with different Ollama models, you can create systems that make the most of local LLM capabilities without being constrained by context window limitations.

Remember that the ideal chunking strategy depends on your specific use case, content type, and chosen model. Experiment with the approaches outlined here and adapt them to your particular needs for optimal results.

Sovereign AI book cover

Sovereign AI: Building Local-First Intelligent Systems

by Daniel Kliewer · Paperback · 72 pages

The hands-on guide to building AI that runs on your hardware, keeps your data private, and eliminates cloud dependence. Working code included.