AI Text Chunker

CSV, TXT, PDF & DOCX to JSONL

Split large documents instantly for AI applications. Process unlimited file sizes with live progress tracking and intelligent chunking methods.

Memory Optimized Token-Based Instant Processing

Choose Your Target Platform

Select where you'll use these chunks

Upload & Configure

Upload your file and configure chunking settings

No file selected

Supports: CSV, TXT, Markdown, PDF, DOCX

128 2048
0% 50%

Export format: OpenAI Fine-tuning

Integration Examples

Quick-start code for your selected platform

OpenAI Fine-tuning

Upload JSONL file directly to OpenAI for model fine-tuning

const { OpenAI } = require('openai');
const fs = require('fs');

// Already formatted for OpenAI!
// Each line: {"messages":[...]}

// Upload directly
const response = await openai.files.create({
  file: fs.createReadStream('data.jsonl'),
  purpose: 'fine-tune'
});

// Create fine-tuning job
await openai.fineTuning.jobs.create({
  training_file: response.id,
  model: 'gpt-4o-mini-2024-07-18'
});
Cloudflare Workers AI + Vectorize

Generate embeddings and store in Cloudflare Vectorize

// Cloudflare Worker
export default {
  async fetch(request, env) {
    const chunks = await loadJSONL('data.jsonl');

    // Generate embeddings
    const vectors = await Promise.all(
      chunks.map(async (chunk) => {
        const { data } = await env.AI.run(
          '@cf/baai/bge-base-en-v1.5',
          { text: chunk.text }
        );
        return {
          id: chunk.id,
          values: data[0],
          metadata: chunk.metadata
        };
      })
    );

    // Insert to Vectorize
    await env.VECTORIZE.insert(vectors);
  }
};
Cloudflare Workers AI Batch API

Process large-scale embeddings with batch inference

// Cloudflare Batch API
const chunks = await loadJSONL('data.jsonl');

// Prepare batch requests
const batchInput = chunks.map(chunk => ({
  prompt: chunk.text
}));

// Submit batch job
const response = await fetch(
  'https://api.cloudflare.com/client/v4/accounts/{account_id}/ai/run/@cf/baai/bge-base-en-v1.5/batch',
  {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer YOUR_API_TOKEN',
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({ inputs: batchInput })
  }
);

// Poll for results
const job = await response.json();
const results = await pollBatchJob(job.result.id);
Nebius AI Batch Inference

Large-scale batch processing with Nebius AI models

// Nebius Batch Inference
const chunks = await loadJSONL('data.jsonl');

// Prepare batch file (JSONL format)
const batchRequests = chunks.map((chunk, i) => ({
  custom_id: `request-${i}`,
  method: "POST",
  url: "/v1/embeddings",
  body: {
    model: "text-embedding-3-small",
    input: chunk.text
  }
}));

// Upload batch file
const fs = require('fs');
fs.writeFileSync('batch.jsonl',
  batchRequests.map(r => JSON.stringify(r)).join('\n')
);

// Submit batch job
const response = await fetch(
  'https://api.tokenfactory.nebius.com/v1/batches',
  {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer YOUR_API_KEY',
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      input_file_id: uploadedFileId,
      endpoint: "/v1/embeddings",
      completion_window: "24h"
    })
  }
);
Cohere Embed Jobs API

Async batch embeddings with Cohere's Embed Jobs

const { CohereClient } = require('cohere-ai');
const cohere = new CohereClient({ token: 'YOUR_API_KEY' });

// Load chunks
const chunks = await loadJSONL('data.jsonl');
const texts = chunks.map(c => c.text);

// Create embed job for batch processing
const job = await cohere.embed({
  texts: texts,
  model: 'embed-english-v3.0',
  inputType: 'search_document',
  embeddingTypes: ['float']
});

// Get embeddings
const embeddings = job.embeddings.float;

// Store with metadata
const vectors = embeddings.map((emb, i) => ({
  id: chunks[i].id,
  values: emb,
  metadata: chunks[i].metadata
}));
LangChain / LlamaIndex

Load chunks as documents for RAG pipelines

import { Document } from 'langchain/document';
import { MemoryVectorStore } from 'langchain/vectorstores/memory';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';

// Load chunks
const chunks = await loadJSONL('data_langchain.jsonl');

// Convert to LangChain documents
const docs = chunks.map(c => new Document({
  pageContent: c.page_content,
  metadata: c.metadata
}));

// Create vector store
const vectorStore = await MemoryVectorStore.fromDocuments(
  docs,
  new OpenAIEmbeddings()
);
Pinecone / Weaviate / Chroma

Generate embeddings and upsert to vector databases

import { Pinecone } from '@pinecone-database/pinecone';
import { OpenAI } from 'openai';

const pinecone = new Pinecone();
const openai = new OpenAI();
const index = pinecone.index('my-index');

// Load chunks
const chunks = await loadJSONL('data.jsonl');

// Generate embeddings and upsert
const vectors = await Promise.all(
  chunks.map(async (chunk) => {
    const embedding = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: chunk.text
    });
    return {
      id: chunk.id,
      values: embedding.data[0].embedding,
      metadata: { text: chunk.text, ...chunk.metadata }
    };
  })
);

await index.upsert(vectors);
Python / Jupyter Notebooks

Load and process chunks in Python environments

import json

# Load JSONL file
chunks = []
with open('data.jsonl', 'r') as f:
    for line in f:
        chunks.append(json.loads(line))

# Process chunks
for chunk in chunks:
    print(f"ID: {chunk['id']}")
    print(f"Text: {chunk['text'][:100]}...")
    print(f"Tokens: {chunk['metadata']['estimatedTokens']}")
    print("---")

About Text Chunking

Why chunk text for AI?

  • Stay within limits: Most LLMs have token limits (4K-128K tokens)
  • Better embeddings: Smaller chunks create more focused vector embeddings
  • RAG applications: Essential for semantic search and retrieval
  • Cost optimization: Process only relevant chunks to reduce API costs

Chunking methods

Recursive (Recommended):

Smart splitting that tries to keep related content together by using multiple separators in order (paragraphs β†’ sentences β†’ words β†’ characters).

Sentence-Based:

Splits on sentence boundaries to maintain grammatical coherence. Best for question-answering systems.

Fixed Size:

Simple splitting at exact token/character counts. Fast but may break in the middle of sentences.

Note: Token estimation uses the standard ~4 characters per token ratio. For production use with specific models, consider using the exact tokenizer (GPT uses tiktoken, Claude uses claude-tokenizer).