Sources How it works Technology Pricing Support Products ๐Ÿ“ SharePoint Extractor ๐Ÿ’ฌ Teams & Search Portal ๐Ÿ““ OneNote Extractor โ˜๏ธ OneDrive Extractor โšก Processor ๐Ÿข Enterprise โ˜๏ธ Enterprise Cloud Book a Demo Sign in
โšก
End-to-End Automation

ChunkIQ Processor

The complete processing engine โ€” document extraction, semantic chunking, vector embedding, Azure AI Search indexing, and hybrid semantic search โ€” fully automated and run on Azure Functions.

Get started free See the tech stack
4
Pipeline stages
<5 min
Full run time
4
Source platforms
Zero
Manual steps

Four stages. Fully automated.

Each stage is independent, observable, and runs on Azure Functions โ€” trigger the full pipeline on demand or on a schedule.

1
๐Ÿ“ฅ

Ingest

The connector authenticates to your Microsoft 365 tenant and enumerates all files across the configured source platforms. Files are downloaded to Azure Data Lake Storage Gen2 with full provenance metadata and content-hash deduplication to skip unchanged files.

๐Ÿ“ SharePoint ๐Ÿ’ฌ Microsoft Teams ๐Ÿ““ OneNote โ˜๏ธ OneDrive
Microsoft 365 Connector ADLS Gen2 Content hashing Delta sync
2
๐Ÿ“„

Document Extraction

A format-aware dispatcher routes each file to the appropriate extractor. No external OCR or Document Intelligence service is required โ€” all parsing runs natively, keeping costs minimal and extraction fully portable.

.pdf .docx .xlsx / .xlsm .pptx .html csv / json / utf-8 โ†’ structured
3
โœ‚๏ธ

Chunk & Embed

Extracted text is split into semantic chunks using a hybrid chunking strategy โ€” respecting paragraph boundaries while keeping chunk sizes within the token budget. Each chunk is then embedded using Azure OpenAI's text-embedding-3-small model, producing a 1,536-dimensional vector per chunk.

Hybrid semantic chunker Token-aware splitting Azure OpenAI Embeddings 1,536-dim vectors Provenance metadata
4
โšก

Index & Search

Chunks are upserted to Azure AI Search with their embedding vectors and all metadata fields. The index supports hybrid BM25 + vector search with semantic re-ranking via Reciprocal Rank Fusion โ€” delivering best-in-class retrieval accuracy for RAG pipelines and search applications.

Azure AI Search BM25 keyword search HNSW vector index RRF score fusion Semantic re-ranking

Built for reliability at scale

Azure Functions runtime with full observability, error handling, and incremental processing โ€” designed to run reliably on large tenants.

โš™๏ธ

Azure Functions Runtime

Each pipeline stage is a separate Azure Function. Stages can be triggered individually or run end-to-end on a timer trigger or HTTP call from the Laravel portal.

๐Ÿ”

Incremental Processing

Content hashing on ingest and delta query tokens on OneDrive/SharePoint ensure only changed content is re-processed, keeping run times short on large tenants.

๐Ÿ“Š

Pipeline Monitoring

The Laravel dashboard shows live status of each source platform, chunks indexed, active sources, and last pipeline run time โ€” all in one place.

๐Ÿ”€

Parallel Extraction

Files are processed in parallel across Azure Functions workers. Large batches of documents are extracted concurrently to minimise total pipeline run time.

๐Ÿ›ก๏ธ

Error Isolation

Each file is processed in an isolated try/except block. A malformed document causes a logged warning and skips to the next file โ€” it never halts the pipeline.

๐Ÿ”’

Zero Data Egress

All compute and storage runs within your Azure subscription. Managed identity authentication throughout โ€” no API keys stored in code or config files.

From file to searchable in under 5 minutes

Typical run times on a mid-size Microsoft 365 tenant.

~60s
๐Ÿ“ฅ

Ingest

File enumeration and download to ADLS Gen2. Time scales with number of new/changed files, not total tenant size.

~90s
๐Ÿ“„

Extraction

Parallel extraction across all file types. PDF and PowerPoint files are typically the most time-intensive format to parse.

~60s
โœ‚๏ธ

Chunking & Embedding

Chunking is near-instant. Embedding time is proportional to the number of new chunks โ€” batched calls to Azure OpenAI keep latency low.

~30s
โšก

Indexing

Batched upsert to Azure AI Search. The index is updated incrementally โ€” live search continues to work throughout the upsert.

Every component, at a glance

Ingest
Microsoft 365 Connectors
Storage
Azure Data Lake Storage Gen2
Runtime
Azure Functions
Document Extraction
Native format parsers
Chunking
Hybrid chunker
Embeddings
Azure OpenAI text-embedding-3-small
Search
Azure AI Search (Hybrid + Semantic)
Vector Index
1,536-dim HNSW Index
Portal
Laravel 12 ยท Blade ยท Tailwind CSS

Process, index, and search โ€” all in one product

Deploy ChunkIQ Processor to your Azure subscription and run the complete pipeline from raw ADLS files to a live hybrid search index today.

Create your account Sign in