The complete processing engine โ document extraction, semantic chunking, vector embedding, Azure AI Search indexing, and hybrid semantic search โ fully automated and run on Azure Functions.
Each stage is independent, observable, and runs on Azure Functions โ trigger the full pipeline on demand or on a schedule.
The connector authenticates to your Microsoft 365 tenant and enumerates all files across the configured source platforms. Files are downloaded to Azure Data Lake Storage Gen2 with full provenance metadata and content-hash deduplication to skip unchanged files.
A format-aware dispatcher routes each file to the appropriate extractor. No external OCR or Document Intelligence service is required โ all parsing runs natively, keeping costs minimal and extraction fully portable.
Extracted text is split into semantic chunks using a hybrid chunking strategy โ respecting paragraph boundaries while keeping chunk sizes within the token budget. Each chunk is then embedded using Azure OpenAI's text-embedding-3-small model, producing a 1,536-dimensional vector per chunk.
Chunks are upserted to Azure AI Search with their embedding vectors and all metadata fields. The index supports hybrid BM25 + vector search with semantic re-ranking via Reciprocal Rank Fusion โ delivering best-in-class retrieval accuracy for RAG pipelines and search applications.
Azure Functions runtime with full observability, error handling, and incremental processing โ designed to run reliably on large tenants.
Each pipeline stage is a separate Azure Function. Stages can be triggered individually or run end-to-end on a timer trigger or HTTP call from the Laravel portal.
Content hashing on ingest and delta query tokens on OneDrive/SharePoint ensure only changed content is re-processed, keeping run times short on large tenants.
The Laravel dashboard shows live status of each source platform, chunks indexed, active sources, and last pipeline run time โ all in one place.
Files are processed in parallel across Azure Functions workers. Large batches of documents are extracted concurrently to minimise total pipeline run time.
Each file is processed in an isolated try/except block. A malformed document causes a logged warning and skips to the next file โ it never halts the pipeline.
All compute and storage runs within your Azure subscription. Managed identity authentication throughout โ no API keys stored in code or config files.
Typical run times on a mid-size Microsoft 365 tenant.
File enumeration and download to ADLS Gen2. Time scales with number of new/changed files, not total tenant size.
Parallel extraction across all file types. PDF and PowerPoint files are typically the most time-intensive format to parse.
Chunking is near-instant. Embedding time is proportional to the number of new chunks โ batched calls to Azure OpenAI keep latency low.
Batched upsert to Azure AI Search. The index is updated incrementally โ live search continues to work throughout the upsert.
Deploy ChunkIQ Processor to your Azure subscription and run the complete pipeline from raw ADLS files to a live hybrid search index today.