AI Text Chunker
CSV, TXT, PDF & DOCX to JSONL
Split large documents instantly for AI applications. Process unlimited file sizes with live progress tracking and intelligent chunking methods.
Choose Your Target Platform
Select where you'll use these chunks
Upload & Configure
Upload your file and configure chunking settings
No file selected
Supports: CSV, TXT, Markdown, PDF, DOCX
Export format: OpenAI Fine-tuning
Live Preview
Integration Examples
Quick-start code for your selected platform
OpenAI Fine-tuning
Upload JSONL file directly to OpenAI for model fine-tuning
const { OpenAI } = require('openai');
const fs = require('fs');
// Already formatted for OpenAI!
// Each line: {"messages":[...]}
// Upload directly
const response = await openai.files.create({
file: fs.createReadStream('data.jsonl'),
purpose: 'fine-tune'
});
// Create fine-tuning job
await openai.fineTuning.jobs.create({
training_file: response.id,
model: 'gpt-4o-mini-2024-07-18'
}); Cloudflare Workers AI + Vectorize
Generate embeddings and store in Cloudflare Vectorize
// Cloudflare Worker
export default {
async fetch(request, env) {
const chunks = await loadJSONL('data.jsonl');
// Generate embeddings
const vectors = await Promise.all(
chunks.map(async (chunk) => {
const { data } = await env.AI.run(
'@cf/baai/bge-base-en-v1.5',
{ text: chunk.text }
);
return {
id: chunk.id,
values: data[0],
metadata: chunk.metadata
};
})
);
// Insert to Vectorize
await env.VECTORIZE.insert(vectors);
}
}; Cloudflare Workers AI Batch API
Process large-scale embeddings with batch inference
// Cloudflare Batch API
const chunks = await loadJSONL('data.jsonl');
// Prepare batch requests
const batchInput = chunks.map(chunk => ({
prompt: chunk.text
}));
// Submit batch job
const response = await fetch(
'https://api.cloudflare.com/client/v4/accounts/{account_id}/ai/run/@cf/baai/bge-base-en-v1.5/batch',
{
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_TOKEN',
'Content-Type': 'application/json'
},
body: JSON.stringify({ inputs: batchInput })
}
);
// Poll for results
const job = await response.json();
const results = await pollBatchJob(job.result.id); Nebius AI Batch Inference
Large-scale batch processing with Nebius AI models
// Nebius Batch Inference
const chunks = await loadJSONL('data.jsonl');
// Prepare batch file (JSONL format)
const batchRequests = chunks.map((chunk, i) => ({
custom_id: `request-${i}`,
method: "POST",
url: "/v1/embeddings",
body: {
model: "text-embedding-3-small",
input: chunk.text
}
}));
// Upload batch file
const fs = require('fs');
fs.writeFileSync('batch.jsonl',
batchRequests.map(r => JSON.stringify(r)).join('\n')
);
// Submit batch job
const response = await fetch(
'https://api.tokenfactory.nebius.com/v1/batches',
{
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
input_file_id: uploadedFileId,
endpoint: "/v1/embeddings",
completion_window: "24h"
})
}
); Cohere Embed Jobs API
Async batch embeddings with Cohere's Embed Jobs
const { CohereClient } = require('cohere-ai');
const cohere = new CohereClient({ token: 'YOUR_API_KEY' });
// Load chunks
const chunks = await loadJSONL('data.jsonl');
const texts = chunks.map(c => c.text);
// Create embed job for batch processing
const job = await cohere.embed({
texts: texts,
model: 'embed-english-v3.0',
inputType: 'search_document',
embeddingTypes: ['float']
});
// Get embeddings
const embeddings = job.embeddings.float;
// Store with metadata
const vectors = embeddings.map((emb, i) => ({
id: chunks[i].id,
values: emb,
metadata: chunks[i].metadata
})); LangChain / LlamaIndex
Load chunks as documents for RAG pipelines
import { Document } from 'langchain/document';
import { MemoryVectorStore } from 'langchain/vectorstores/memory';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
// Load chunks
const chunks = await loadJSONL('data_langchain.jsonl');
// Convert to LangChain documents
const docs = chunks.map(c => new Document({
pageContent: c.page_content,
metadata: c.metadata
}));
// Create vector store
const vectorStore = await MemoryVectorStore.fromDocuments(
docs,
new OpenAIEmbeddings()
); Pinecone / Weaviate / Chroma
Generate embeddings and upsert to vector databases
import { Pinecone } from '@pinecone-database/pinecone';
import { OpenAI } from 'openai';
const pinecone = new Pinecone();
const openai = new OpenAI();
const index = pinecone.index('my-index');
// Load chunks
const chunks = await loadJSONL('data.jsonl');
// Generate embeddings and upsert
const vectors = await Promise.all(
chunks.map(async (chunk) => {
const embedding = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: chunk.text
});
return {
id: chunk.id,
values: embedding.data[0].embedding,
metadata: { text: chunk.text, ...chunk.metadata }
};
})
);
await index.upsert(vectors); Python / Jupyter Notebooks
Load and process chunks in Python environments
import json
# Load JSONL file
chunks = []
with open('data.jsonl', 'r') as f:
for line in f:
chunks.append(json.loads(line))
# Process chunks
for chunk in chunks:
print(f"ID: {chunk['id']}")
print(f"Text: {chunk['text'][:100]}...")
print(f"Tokens: {chunk['metadata']['estimatedTokens']}")
print("---") About Text Chunking
Why chunk text for AI?
- Stay within limits: Most LLMs have token limits (4K-128K tokens)
- Better embeddings: Smaller chunks create more focused vector embeddings
- RAG applications: Essential for semantic search and retrieval
- Cost optimization: Process only relevant chunks to reduce API costs
Chunking methods
Smart splitting that tries to keep related content together by using multiple separators in order (paragraphs β sentences β words β characters).
Splits on sentence boundaries to maintain grammatical coherence. Best for question-answering systems.
Simple splitting at exact token/character counts. Fast but may break in the middle of sentences.
Note: Token estimation uses the standard ~4 characters per token ratio. For production use with specific models, consider using the exact tokenizer (GPT uses tiktoken, Claude uses claude-tokenizer).