MCP Core - Content Preprocessing
MCP Core is the content preprocessing engine powering the LanOnasis Memory Platform. It handles intelligent text transformation, chunking, validation, and context building before data is indexed and embedded for semantic search.
Overview
MCP Core transforms raw content into optimized memory items through a sophisticated pipeline:
Raw Content
↓
[Text Cleaning] → [Format Parsing] → [Chunking Strategy] → [Metadata Extraction]
↓ ↓
[Validation] → [Context Building] → [Token Optimization] → [Ready for Embedding]
Core Responsibilities
- Text Cleaning: Remove noise, normalize formatting, decode entities
- Content Recognition: Detect and handle markdown, code, HTML, JSON, XML
- Intelligent Chunking: Split content using 6 strategies (fixed-size, semantic, paragraph, sentence, code-aware, custom)
- Metadata Extraction: Pull entities, keywords, language, complexity scores
- Content Validation: Security checks, encoding validation, auto-fixing issues
- Context Building: Generate AI-optimized context windows respecting token limits (GPT-4, Claude, etc.)
Architecture
Service Structure
mcp-core/
├── src/
│ ├── protocols/ # Protocol management (SSE, WebSocket)
│ ├── workers/ # Async queue workers (Embedding, Batch)
│ ├── ui/ # Visualization components (Web, IDE)
│ ├── services/ # Business logic (Queue, Search, Content)
│ ├── utils/ # Utility modules (Chunking, Cleaning)
│ ├── api/ # API controllers (Dashboard)
│ └── index.ts # Service entrypoint
├── tests/
│ ├── unit/ # Unit tests for each module
│ └── integration/ # End-to-end pipeline tests
└── README.md
Module Responsibilities
| Module | Purpose | Key Components |
|---|---|---|
| Protocols | Multi-transport support | SSEHandler, WebSocketHandler, ProtocolManager |
| Workers | Asynchronous processing | WorkerLauncher, EmbeddingWorker, BatchWorker |
| UI / Visualization | Memory data visualization | MemoryDashboard (Web), MemoryPanel (IDE) |
| Cleaning | Remove noise, normalize | TextCleaner, HTML Entity Decoder |
| Parsing | Detect content type | Markdown/HTML/Plain Text Parsers |
| Chunking | Split intelligently | Fixed-Size, Semantic, Context-Aware |
| Queue Management | Task orchestration | QueueService, pg-boss Integration |
Key Features
1. Text Cleaning
Remove formatting noise while preserving content structure:
# Input
"Hello, World!!!\n\n Multiple spaces"
# Output after cleaning
"Hello World Multiple spaces"
Features:
- Control character removal
- HTML entity decoding (
→ space,<→<) - URL extraction (preserves or removes based on config)
- Whitespace normalization
- Custom pattern removal
2. Intelligent Chunking Strategies
Choose the right strategy for your content type:
Fixed-Size Chunking (Default)
Size: 512 tokens, Overlap: 20 tokens
→ Predictable chunks, simple to implement
Semantic Chunking (Recommended for documents)
Splits at natural boundaries:
- Section headers
- Paragraph breaks
- Thematic changes
→ Preserves meaning across chunk boundaries
Paragraph Chunking (For essays, articles)
One chunk per paragraph
→ Maintains narrative flow
Sentence Chunking (For fine-grained search)
One chunk per sentence
→ Enables precise keyword matching
Code-Block Chunking (For source code)
Language-aware splitting:
- Keep functions/classes intact
- Preserve import statements
- Maintain scoping
→ Enables code search by function/class
3. Content Type Support
Automatically detect and handle multiple formats:
| Format | Handling | Example |
|---|---|---|
| Markdown | Preserve heading structure, code blocks | # Title\n\n## Section |
| Code | Language detection, function extraction | Python, JavaScript, SQL, etc. |
| HTML | Tag stripping, structure preservation | Convert to semantic markdown |
| JSON | Flatten structure, extract values | {"user": {"name": "Alice"}} |
| Plain Text | Line/paragraph detection | Standard text files |
4. Metadata Extraction
Automatically pull structured information:
{
"entities": {
"emails": ["alice@example.com"],
"urls": ["https://example.com"],
"dates": ["2026-01-15"],
"mentions": ["@alice", "@bob"]
},
"keywords": ["machine learning", "AI", "data science"],
"language": "en",
"complexity": 7.2,
"sentiment": 0.6,
"statistics": {
"word_count": 1245,
"sentence_count": 42,
"avg_word_length": 5.3
}
}
5. Context Building for AI
Generate optimized context windows for AI models:
// Input: multiple memories + user query
const context = await mcpCore.buildContext({
query: "What are our Q2 goals?",
memories: [...], // 20+ memories to search
modelLimit: 4096, // Claude's token limit
strategy: 'relevance' // Use relevance scoring
});
// Output: optimized, ranked memories
{
context: [
{ rank: 1, memory: "Q2 Goals: AI features...", score: 0.95 },
{ rank: 2, memory: "Team roadmap...", score: 0.82 },
...
],
tokenCount: 2048,
coverage: 0.92
}
Context Building Strategies:
- Relevance – Multi-factor relevance scoring
- Temporal – Recent content prioritized
- Conversational – Build from conversation history
- Diverse – Cover different topics
- Hierarchical – Parent-child relationships
- Hybrid – Combination of multiple factors
Enterprise Features (v2.0+)
6. Multi-Transport Protocol Manager
mcp-core supports multiple communication protocols for real-time and high-performance applications.
Supported Protocols:
- SSE (Server-Sent Events): Optimized for unidirectional real-time updates (default for web clients).
- WebSocket: Full-duplex communication for interactive sessions and IDE integrations.
Usage:
import { ProtocolManager } from './protocols/protocol-manager';
const pm = new ProtocolManager(config, logger);
await pm.initialize(httpServer);
7. Background Worker Architecture
For high-volume content processing, mcp-core offloads intensive tasks to specialized background workers via a queue system.
- Embedding Worker: Calculates vector embeddings for chunks in parallel.
- Batch Worker: Handles massive imports and re-indexing tasks.
Worker Management:
# Start worker processes
bun run workers:start
8. Visualization UI
Monitor and visualize your memory ecosystem with built-in UI components.
- Memory Dashboard: A React-based web interface for monitoring statistics, queue status, and memory distribution.
- IDE Panels: Specialized panels for VSCode, Cursor, and Windsurf that bring memory insights directly into the developer workflow.
Dashboard Features:
- Real-time queue metrics
- Search analytics & trend visualization
- Tag cloud & distribution charts
Installation & Setup
Prerequisites
- Node.js 18+ or Bun 1.1+
- PostgreSQL 13+ (for distributed caching, optional)
Installation
# Install from npm
npm install @lanonasis/mcp-core
# Or using Bun
bun add @lanonasis/mcp-core
Local Development
# Clone repository
git clone https://github.com/lanonasis/mcp-core.git
cd mcp-core
# Install dependencies
bun install
# Build
bun run build
# Run tests
bun run test
# Start dev server
bun run dev
Configuration
Environment Variables
# Preprocessing config
MCP_CORE_CHUNK_SIZE=512 # Default chunk size in tokens
MCP_CORE_CHUNK_OVERLAP=50 # Token overlap between chunks
MCP_CORE_CHUNKING_STRATEGY=semantic # Strategy: fixed, semantic, paragraph, sentence, code
# Content validation
MCP_CORE_MAX_CONTENT_SIZE=10000000 # Max input size (10 MB)
MCP_CORE_ENABLE_SECURITY_CHECK=true # Enable XSS/injection checks
MCP_CORE_AUTO_FIX_ENCODING=true # Auto-fix encoding issues
# Logging
MCP_CORE_LOG_LEVEL=info # debug, info, warn, error
MCP_CORE_ENABLE_METRICS=true # Prometheus metrics
# Caching (optional)
MCP_CORE_CACHE_ENABLED=false
MCP_CORE_CACHE_TTL=3600
Configuration File (mcp-core.config.ts)
import { MCPCoreConfig } from "@lanonasis/mcp-core";
export const config: MCPCoreConfig = {
chunking: {
strategy: "semantic",
size: 512,
overlap: 50,
preserveMarkdownHeadings: true,
preserveCodeBlocks: true,
},
cleaning: {
removeUrls: false,
decodeHtmlEntities: true,
normalizeWhitespace: true,
removeCustomPatterns: [/\[REDACTED\]/g],
},
validation: {
enableSecurityChecks: true,
maxContentSize: 10_000_000,
allowedContentTypes: ["text/plain", "text/markdown", "application/json"],
},
contextBuilder: {
strategies: ["relevance", "temporal"],
modelTokenLimit: 4096,
deduplicationThreshold: 0.85,
},
};
API Reference
Preprocess Content
Transform raw content into optimized chunks:
import { MCPCore } from "@lanonasis/mcp-core";
const mcp = new MCPCore(config);
const result = await mcp.preprocess({
content: "Your content here...",
contentType: "markdown", // auto-detected if omitted
chunkingStrategy: "semantic",
extractMetadata: true,
validateContent: true,
});
console.log(result);
// {
// chunks: [
// { id: "chunk_1", content: "...", tokens: 512 },
// { id: "chunk_2", content: "...", tokens: 498 }
// ],
// metadata: { keywords: [...], entities: [...] },
// statistics: { totalTokens: 1010, validationPassed: true }
// }
Build AI Context
Generate optimized context window:
const context = await mcp.buildContext({
query: "What are our goals?",
memories: [
{ id: "m1", content: "Q2 goals: AI features..." },
{ id: "m2", content: "Team roadmap..." },
],
strategy: "relevance",
modelTokenLimit: 4096,
minRelevanceScore: 0.5,
});
console.log(context);
// {
// context: [
// { rank: 1, memoryId: 'm1', score: 0.95, content: '...' },
// { rank: 2, memoryId: 'm2', score: 0.82, content: '...' }
// ],
// totalTokens: 2048,
// coverage: 0.92
// }
Validate Content
Check content for security and integrity issues:
const validation = await mcp.validate({
content: userInput,
contentType: "html",
autoFix: true,
securityLevel: "strict", // strict, moderate, permissive
});
console.log(validation);
// {
// valid: true,
// issues: [
// { type: 'xss', pattern: 'onclick=', severity: 'high', fixed: true }
// ],
// fixedContent: '...'
// }
Common Workflows
Scenario 1: Process Uploaded Document
const file = await req.file;
const content = await file.text();
const processed = await mcp.preprocess({
content,
contentType: "markdown",
chunkingStrategy: "semantic", // Smart boundaries for documents
});
// Store chunks in memory service
for (const chunk of processed.chunks) {
await memory.create({
text: chunk.content,
namespace: "documents",
metadata: { source: file.name, chunkId: chunk.id },
});
}
Scenario 2: Optimize for AI Response
// User asks a question
const query = "What are our Q2 priorities?";
// Fetch relevant memories
const relevant = await memory.search({
query,
namespace: "planning",
limit: 20,
});
// Build optimized context for Claude
const context = await mcp.buildContext({
query,
memories: relevant.items,
strategy: "relevance",
modelTokenLimit: 4096, // Claude limit
});
// Feed to model
const response = await claude.messages.create({
model: "claude-3-5-sonnet",
system: `You have access to organizational context:\n\n${context.context.map((c) => c.content).join("\n\n")}`,
messages: [{ role: "user", content: query }],
});
Scenario 3: Security-First Content Validation
// Validate user-submitted content
const validation = await mcp.validate({
content: userSubmittedHTML,
contentType: "html",
autoFix: true,
securityLevel: "strict",
});
if (!validation.valid) {
return {
error: "Content contains security issues",
issues: validation.issues,
};
}
// Use fixed content
const processed = await mcp.preprocess({
content: validation.fixedContent,
contentType: "markdown",
});
Performance & Optimization
Latency (Typical Values)
| Operation | Latency (p50) | Latency (p99) | Notes |
|---|---|---|---|
| Text Cleaning | 10ms | 50ms | For 10KB text |
| Chunking | 20ms | 100ms | Semantic strategy slower than fixed |
| Metadata Extraction | 15ms | 75ms | Depends on content complexity |
| Context Building | 50ms | 200ms | Scales with # of memories |
| Full Pipeline | 100ms | 300ms | All steps combined |
Throughput
- Text Cleaning: 5 MB/second per instance
- Chunking: 2 MB/second (semantic), 10 MB/second (fixed-size)
- Validation: 3 MB/second per instance
Optimization Tips
- Reuse instances: Don't create new
MCPCorefor each request - Batch processing: Process multiple items together
- Cache metadata: Store extracted metadata, don't re-extract
- Choose appropriate strategy: Use
fixedfor speed,semanticfor quality - Adjust token limits: Larger chunks = fewer API calls, smaller chunks = finer granularity
Troubleshooting
Issue: Chunks are too small/large
Solution: Adjust chunking strategy and size
// Increase size for faster processing
const config = {
chunking: {
size: 1024, // Double the size
overlap: 100,
},
};
Issue: Encoding errors in non-English content
Solution: Enable auto-fix encoding
MCP_CORE_AUTO_FIX_ENCODING=true
Issue: Context is incomplete (missing relevant memories)
Solution: Lower relevance threshold
const context = await mcp.buildContext({
minRelevanceScore: 0.3, // Lower threshold
strategy: "relevance",
});
Issue: Performance degradation with large files
Solution: Use streaming API
const stream = await mcp.preprocessStream({
content: largeContent,
chunkSize: 512,
});
for await (const chunk of stream) {
await memory.create({ text: chunk.content });
}
Related Services
- Memory Overview – Where MCP Core output is stored
- Memory CLI – How to manage preprocessed content
- MCP Integration – AI agent integration patterns
- Data Masking – Privacy-preserving content handling
Support & Resources
- GitHub: lanonasis/mcp-core
- Issues: Report bugs
- Discussions: Q&A and feature requests
- Email: support@lanonasis.com
Last Updated: February 3, 2026
Version: 1.2.0+