Connect to any SharePoint site, crawl document libraries, and extract every file — PDFs, Word docs, Excel sheets, PowerPoints, and more — directly into the ChunkIQ pipeline.
Automatic discovery, recursive crawling, and format-specific extraction — all without leaving your Azure tenant.
Authenticates via Azure AD app registration using client credentials. Discovers all document libraries across every site collection automatically.
Traverses nested folder structures of any depth. Captures full file paths, modification dates, and author metadata for every item.
Detects and extracts files embedded inside Word, Excel, and PowerPoint documents. Each attachment is processed and indexed with lineage back to its parent file.
Content hashing ensures only new or modified files are re-processed on subsequent runs. Keeps the index fresh without reprocessing unchanged content.
Every chunk is tagged with site URL, library name, folder path, file name, page number, chunk index, content type, and last-modified timestamp.
Files are ingested directly to Azure Data Lake Storage Gen2 within your own subscription. No data leaves your Azure environment at any point.
Native extraction — no Azure Document Intelligence or OCR service required.
ChunkIQ authenticates to your Microsoft 365 tenant via an Azure AD app registration with the required SharePoint and Files.Read permissions.
ChunkIQ enumerates all site collections, document libraries, and folder hierarchies. Files are downloaded to Azure Data Lake Storage Gen2.
Dedicated extractors parse each file format, clean the text, and split it into semantic chunks with token-based length control.
Each chunk is embedded with Azure OpenAI and pushed to Azure AI Search for hybrid BM25 + vector + semantic retrieval.
Connect your Microsoft 365 tenant and have your SharePoint documents extracted, chunked, and indexed in minutes.