📁

Microsoft 365 Connector

SharePoint Extractor

Connect to any SharePoint site, crawl document libraries, and extract every file — PDFs, Word docs, Excel sheets, PowerPoints, and more — directly into the ChunkIQ pipeline.

Get started free See the pipeline

File formats

∞

Sites & libraries

Secure Auth

Authentication

100%

Native extraction

Capabilities

Everything from every SharePoint library

Automatic discovery, recursive crawling, and format-specific extraction — all without leaving your Azure tenant.

🔌

Microsoft 365 Integration

Authenticates via Azure AD app registration using client credentials. Discovers all document libraries across every site collection automatically.

📂

Recursive Library Crawling

Traverses nested folder structures of any depth. Captures full file paths, modification dates, and author metadata for every item.

📎

Embedded Attachment Extraction

Detects and extracts files embedded inside Word, Excel, and PowerPoint documents. Each attachment is processed and indexed with lineage back to its parent file.

🔄

Incremental Sync

Content hashing ensures only new or modified files are re-processed on subsequent runs. Keeps the index fresh without reprocessing unchanged content.

🗺️

Rich Provenance Metadata

Every chunk is tagged with site URL, library name, folder path, file name, page number, chunk index, content type, and last-modified timestamp.

🔒

Stays in Your Tenant

Files are ingested directly to Azure Data Lake Storage Gen2 within your own subscription. No data leaves your Azure environment at any point.

How it works

From SharePoint to searchable index in 4 steps

Step 01

🔑

Authenticate

ChunkIQ authenticates to your Microsoft 365 tenant via an Azure AD app registration with the required SharePoint and Files.Read permissions.

Step 02

🔍

Discover & Crawl

ChunkIQ enumerates all site collections, document libraries, and folder hierarchies. Files are downloaded to Azure Data Lake Storage Gen2.

Step 03

📄

Extract & Chunk

Dedicated extractors parse each file format, clean the text, and split it into semantic chunks with token-based length control.

Step 04

⚡

Embed & Index

Each chunk is embedded with Azure OpenAI and pushed to Azure AI Search for hybrid BM25 + vector + semantic retrieval.