MCP Core - Content Preprocessing

MCP Core is the content preprocessing engine powering the LanOnasis Memory Platform. It handles intelligent text transformation, chunking, validation, and context building before data is indexed and embedded for semantic search.

Overview

MCP Core transforms raw content into optimized memory items through a sophisticated pipeline:

Raw Content
    ↓
[Text Cleaning] → [Format Parsing] → [Chunking Strategy] → [Metadata Extraction]
    ↓                                                            ↓
[Validation] → [Context Building] → [Token Optimization] → [Ready for Embedding]

Core Responsibilities

Text Cleaning: Remove noise, normalize formatting, decode entities
Content Recognition: Detect and handle markdown, code, HTML, JSON, XML
Intelligent Chunking: Split content using 6 strategies (fixed-size, semantic, paragraph, sentence, code-aware, custom)
Metadata Extraction: Pull entities, keywords, language, complexity scores
Content Validation: Security checks, encoding validation, auto-fixing issues
Context Building: Generate AI-optimized context windows respecting token limits (GPT-4, Claude, etc.)

Architecture

Service Structure

mcp-core/
├── src/
│   ├── protocols/          # Protocol management (SSE, WebSocket)
│   ├── workers/            # Async queue workers (Embedding, Batch)
│   ├── ui/                 # Visualization components (Web, IDE)
│   ├── services/           # Business logic (Queue, Search, Content)
│   ├── utils/              # Utility modules (Chunking, Cleaning)
│   ├── api/                # API controllers (Dashboard)
│   └── index.ts            # Service entrypoint
├── tests/
│   ├── unit/               # Unit tests for each module
│   └── integration/            # End-to-end pipeline tests
└── README.md

Module Responsibilities

Module	Purpose	Key Components
Protocols	Multi-transport support	SSEHandler, WebSocketHandler, ProtocolManager
Workers	Asynchronous processing	WorkerLauncher, EmbeddingWorker, BatchWorker
UI / Visualization	Memory data visualization	MemoryDashboard (Web), MemoryPanel (IDE)
Cleaning	Remove noise, normalize	TextCleaner, HTML Entity Decoder
Parsing	Detect content type	Markdown/HTML/Plain Text Parsers
Chunking	Split intelligently	Fixed-Size, Semantic, Context-Aware
Queue Management	Task orchestration	QueueService, pg-boss Integration

Key Features

1. Text Cleaning

Remove formatting noise while preserving content structure:

# Input
"Hello, World!!!\n\n   Multiple    spaces"

# Output after cleaning
"Hello World Multiple spaces"

Features:

Control character removal
HTML entity decoding (  → space, < → <)
URL extraction (preserves or removes based on config)
Whitespace normalization
Custom pattern removal

2. Intelligent Chunking Strategies

Choose the right strategy for your content type:

Fixed-Size Chunking (Default)

Size: 512 tokens, Overlap: 20 tokens
→ Predictable chunks, simple to implement

Semantic Chunking (Recommended for documents)

Splits at natural boundaries:
- Section headers
- Paragraph breaks
- Thematic changes
→ Preserves meaning across chunk boundaries

Paragraph Chunking (For essays, articles)

One chunk per paragraph
→ Maintains narrative flow

Sentence Chunking (For fine-grained search)

One chunk per sentence
→ Enables precise keyword matching

Code-Block Chunking (For source code)

Language-aware splitting:
- Keep functions/classes intact
- Preserve import statements
- Maintain scoping
→ Enables code search by function/class

3. Content Type Support

Automatically detect and handle multiple formats:

Format	Handling	Example
Markdown	Preserve heading structure, code blocks	`# Title\n\n## Section`
Code	Language detection, function extraction	Python, JavaScript, SQL, etc.
HTML	Tag stripping, structure preservation	Convert to semantic markdown
JSON	Flatten structure, extract values	`{"user": {"name": "Alice"}}`
Plain Text	Line/paragraph detection	Standard text files

4. Metadata Extraction

Automatically pull structured information:

{
  "entities": {
    "emails": ["alice@example.com"],
    "urls": ["https://example.com"],
    "dates": ["2026-01-15"],
    "mentions": ["@alice", "@bob"]
  },
  "keywords": ["machine learning", "AI", "data science"],
  "language": "en",
  "complexity": 7.2,
  "sentiment": 0.6,
  "statistics": {
    "word_count": 1245,
    "sentence_count": 42,
    "avg_word_length": 5.3
  }
}

5. Context Building for AI

Generate optimized context windows for AI models:

// Input: multiple memories + user query
const context = await mcpCore.buildContext({
  query: "What are our Q2 goals?",
  memories: [...],  // 20+ memories to search
  modelLimit: 4096,  // Claude's token limit
  strategy: 'relevance'  // Use relevance scoring
});

// Output: optimized, ranked memories
{
  context: [
    { rank: 1, memory: "Q2 Goals: AI features...", score: 0.95 },
    { rank: 2, memory: "Team roadmap...", score: 0.82 },
    ...
  ],
  tokenCount: 2048,
  coverage: 0.92
}

Context Building Strategies:

Relevance – Multi-factor relevance scoring
Temporal – Recent content prioritized
Conversational – Build from conversation history
Diverse – Cover different topics
Hierarchical – Parent-child relationships
Hybrid – Combination of multiple factors

Enterprise Features (v2.0+)

6. Multi-Transport Protocol Manager

mcp-core supports multiple communication protocols for real-time and high-performance applications.

Supported Protocols:

SSE (Server-Sent Events): Optimized for unidirectional real-time updates (default for web clients).
WebSocket: Full-duplex communication for interactive sessions and IDE integrations.

Usage:

import { ProtocolManager } from './protocols/protocol-manager';

const pm = new ProtocolManager(config, logger);
await pm.initialize(httpServer);

7. Background Worker Architecture

For high-volume content processing, mcp-core offloads intensive tasks to specialized background workers via a queue system.

Embedding Worker: Calculates vector embeddings for chunks in parallel.
Batch Worker: Handles massive imports and re-indexing tasks.

Worker Management:

# Start worker processes
bun run workers:start

8. Visualization UI

Monitor and visualize your memory ecosystem with built-in UI components.

Memory Dashboard: A React-based web interface for monitoring statistics, queue status, and memory distribution.
IDE Panels: Specialized panels for VSCode, Cursor, and Windsurf that bring memory insights directly into the developer workflow.

Dashboard Features:

Real-time queue metrics
Search analytics & trend visualization
Tag cloud & distribution charts

Installation & Setup

Prerequisites

Node.js 18+ or Bun 1.1+
PostgreSQL 13+ (for distributed caching, optional)

Installation

# Install from npm
npm install @lanonasis/mcp-core

# Or using Bun
bun add @lanonasis/mcp-core

Local Development

# Clone repository
git clone https://github.com/lanonasis/mcp-core.git
cd mcp-core

# Install dependencies
bun install

# Build
bun run build

# Run tests
bun run test

# Start dev server
bun run dev

Configuration

Environment Variables

# Preprocessing config
MCP_CORE_CHUNK_SIZE=512              # Default chunk size in tokens
MCP_CORE_CHUNK_OVERLAP=50            # Token overlap between chunks
MCP_CORE_CHUNKING_STRATEGY=semantic  # Strategy: fixed, semantic, paragraph, sentence, code

# Content validation
MCP_CORE_MAX_CONTENT_SIZE=10000000   # Max input size (10 MB)
MCP_CORE_ENABLE_SECURITY_CHECK=true  # Enable XSS/injection checks
MCP_CORE_AUTO_FIX_ENCODING=true      # Auto-fix encoding issues

# Logging
MCP_CORE_LOG_LEVEL=info              # debug, info, warn, error
MCP_CORE_ENABLE_METRICS=true         # Prometheus metrics

# Caching (optional)
MCP_CORE_CACHE_ENABLED=false
MCP_CORE_CACHE_TTL=3600

Configuration File (`mcp-core.config.ts`)

import { MCPCoreConfig } from "@lanonasis/mcp-core";

export const config: MCPCoreConfig = {
  chunking: {
    strategy: "semantic",
    size: 512,
    overlap: 50,
    preserveMarkdownHeadings: true,
    preserveCodeBlocks: true,
  },
  cleaning: {
    removeUrls: false,
    decodeHtmlEntities: true,
    normalizeWhitespace: true,
    removeCustomPatterns: [/\[REDACTED\]/g],
  },
  validation: {
    enableSecurityChecks: true,
    maxContentSize: 10_000_000,
    allowedContentTypes: ["text/plain", "text/markdown", "application/json"],
  },
  contextBuilder: {
    strategies: ["relevance", "temporal"],
    modelTokenLimit: 4096,
    deduplicationThreshold: 0.85,
  },
};

API Reference

Preprocess Content

Transform raw content into optimized chunks:

import { MCPCore } from "@lanonasis/mcp-core";

const mcp = new MCPCore(config);

const result = await mcp.preprocess({
  content: "Your content here...",
  contentType: "markdown", // auto-detected if omitted
  chunkingStrategy: "semantic",
  extractMetadata: true,
  validateContent: true,
});

console.log(result);
// {
//   chunks: [
//     { id: "chunk_1", content: "...", tokens: 512 },
//     { id: "chunk_2", content: "...", tokens: 498 }
//   ],
//   metadata: { keywords: [...], entities: [...] },
//   statistics: { totalTokens: 1010, validationPassed: true }
// }

Build AI Context

Generate optimized context window:

const context = await mcp.buildContext({
  query: "What are our goals?",
  memories: [
    { id: "m1", content: "Q2 goals: AI features..." },
    { id: "m2", content: "Team roadmap..." },
  ],
  strategy: "relevance",
  modelTokenLimit: 4096,
  minRelevanceScore: 0.5,
});

console.log(context);
// {
//   context: [
//     { rank: 1, memoryId: 'm1', score: 0.95, content: '...' },
//     { rank: 2, memoryId: 'm2', score: 0.82, content: '...' }
//   ],
//   totalTokens: 2048,
//   coverage: 0.92
// }

Validate Content

Check content for security and integrity issues:

const validation = await mcp.validate({
  content: userInput,
  contentType: "html",
  autoFix: true,
  securityLevel: "strict", // strict, moderate, permissive
});

console.log(validation);
// {
//   valid: true,
//   issues: [
//     { type: 'xss', pattern: 'onclick=', severity: 'high', fixed: true }
//   ],
//   fixedContent: '...'
// }

Common Workflows

Scenario 1: Process Uploaded Document

const file = await req.file;
const content = await file.text();

const processed = await mcp.preprocess({
  content,
  contentType: "markdown",
  chunkingStrategy: "semantic", // Smart boundaries for documents
});

// Store chunks in memory service
for (const chunk of processed.chunks) {
  await memory.create({
    text: chunk.content,
    namespace: "documents",
    metadata: { source: file.name, chunkId: chunk.id },
  });
}

Scenario 2: Optimize for AI Response

// User asks a question
const query = "What are our Q2 priorities?";

// Fetch relevant memories
const relevant = await memory.search({
  query,
  namespace: "planning",
  limit: 20,
});

// Build optimized context for Claude
const context = await mcp.buildContext({
  query,
  memories: relevant.items,
  strategy: "relevance",
  modelTokenLimit: 4096, // Claude limit
});

// Feed to model
const response = await claude.messages.create({
  model: "claude-3-5-sonnet",
  system: `You have access to organizational context:\n\n${context.context.map((c) => c.content).join("\n\n")}`,
  messages: [{ role: "user", content: query }],
});

Scenario 3: Security-First Content Validation

// Validate user-submitted content
const validation = await mcp.validate({
  content: userSubmittedHTML,
  contentType: "html",
  autoFix: true,
  securityLevel: "strict",
});

if (!validation.valid) {
  return {
    error: "Content contains security issues",
    issues: validation.issues,
  };
}

// Use fixed content
const processed = await mcp.preprocess({
  content: validation.fixedContent,
  contentType: "markdown",
});

Performance & Optimization

Latency (Typical Values)

Operation	Latency (p50)	Latency (p99)	Notes
Text Cleaning	10ms	50ms	For 10KB text
Chunking	20ms	100ms	Semantic strategy slower than fixed
Metadata Extraction	15ms	75ms	Depends on content complexity
Context Building	50ms	200ms	Scales with # of memories
Full Pipeline	100ms	300ms	All steps combined

Throughput

Text Cleaning: 5 MB/second per instance
Chunking: 2 MB/second (semantic), 10 MB/second (fixed-size)
Validation: 3 MB/second per instance

Optimization Tips

Reuse instances: Don't create new MCPCore for each request
Batch processing: Process multiple items together
Cache metadata: Store extracted metadata, don't re-extract
Choose appropriate strategy: Use fixed for speed, semantic for quality
Adjust token limits: Larger chunks = fewer API calls, smaller chunks = finer granularity

Troubleshooting

Issue: Chunks are too small/large

Solution: Adjust chunking strategy and size

// Increase size for faster processing
const config = {
  chunking: {
    size: 1024, // Double the size
    overlap: 100,
  },
};

Issue: Encoding errors in non-English content

Solution: Enable auto-fix encoding

MCP_CORE_AUTO_FIX_ENCODING=true

Issue: Context is incomplete (missing relevant memories)

Solution: Lower relevance threshold

const context = await mcp.buildContext({
  minRelevanceScore: 0.3, // Lower threshold
  strategy: "relevance",
});

Issue: Performance degradation with large files

Solution: Use streaming API

const stream = await mcp.preprocessStream({
  content: largeContent,
  chunkSize: 512,
});

for await (const chunk of stream) {
  await memory.create({ text: chunk.content });
}

Memory Overview – Where MCP Core output is stored
Memory CLI – How to manage preprocessed content
MCP Integration – AI agent integration patterns
Data Masking – Privacy-preserving content handling

Support & Resources

GitHub: lanonasis/mcp-core
Issues: Report bugs
Discussions: Q&A and feature requests
Email: support@lanonasis.com

Last Updated: February 3, 2026
Version: 1.2.0+

Overview​

Core Responsibilities​

Architecture​

Service Structure​

Module Responsibilities​

Key Features​

1. Text Cleaning​

2. Intelligent Chunking Strategies​

3. Content Type Support​

4. Metadata Extraction​

5. Context Building for AI​

Enterprise Features (v2.0+)​

6. Multi-Transport Protocol Manager​

7. Background Worker Architecture​

8. Visualization UI​

Installation & Setup​

Prerequisites​

Installation​

Local Development​

Configuration​

Environment Variables​

Configuration File (mcp-core.config.ts)​

API Reference​

Preprocess Content​

Build AI Context​

Validate Content​

Common Workflows​

Scenario 1: Process Uploaded Document​

Scenario 2: Optimize for AI Response​

Scenario 3: Security-First Content Validation​

Performance & Optimization​

Latency (Typical Values)​

Throughput​

Optimization Tips​

Troubleshooting​

Issue: Chunks are too small/large​

Issue: Encoding errors in non-English content​

Issue: Context is incomplete (missing relevant memories)​

Issue: Performance degradation with large files​

Related Services​

Support & Resources​