Automatically ingest files from SharePoint, Teams, OneNote, and OneDrive โ extract text with Python, chunk, embed, and index for AI-powered semantic search.
Connect to your Microsoft 365 environment and let the pipeline handle the rest โ files of every format, automatically extracted and indexed.
Ingest documents from SharePoint document libraries across any site. Handles all Office formats, PDFs, and embedded attachments with full metadata.
Reads files shared in Teams channels and private chats. Discovers all team sites and libraries via Microsoft Graph API automatically.
Extracts notebook pages as HTML, parses clean text via BeautifulSoup, and processes any Office attachments embedded in notes.
Connects to personal and shared OneDrive drives. Crawls folders recursively and processes all supported document types.
Python-based extraction โ no external OCR service required. Each format has a dedicated extractor.
A fully automated pipeline โ ingest, extract, chunk, embed, and index โ with no manual steps.
Microsoft Graph API pulls files from SharePoint, Teams, OneNote, and OneDrive into Azure Data Lake Storage Gen2 with full provenance metadata.
Python-native extractors parse every file type โ pypdf for PDFs, python-docx for Word, openpyxl for Excel, BeautifulSoup for OneNote HTML. No external OCR service needed.
Extracted text is split into semantic chunks with full provenance metadata. Each chunk is embedded using Azure OpenAI (1,536-dim vectors).
Chunks are pushed to Azure AI Search with hybrid BM25 + vector search and semantic re-ranking for best-in-class retrieval accuracy.
Built on Python and Azure with a fully configurable processing pipeline โ no proprietary extraction service lock-in.
All file parsing uses pure Python libraries (pypdf, python-docx, python-pptx, openpyxl, BeautifulSoup). Fast, cost-free, and fully portable โ no Azure DI dependency.
Hybrid BM25 full-text search combined with 1,536-dim vector embeddings and Azure AI semantic re-ranking for highly accurate retrieval.
Automatically extracts and indexes files embedded inside Word, Excel, and PowerPoint documents alongside their parent with full lineage tracking.
Smart deduplication via content hashes ensures only new or changed files are re-processed, keeping costs low and the index fresh.
Every chunk carries source platform, site, library, file path, page number, chunk index, block type, modification dates, and more.
All data stays within your Azure tenant. Managed identity auth, ADLS Gen2 encryption at rest, and role-based access throughout.
Managed Azure services for storage, search, and embeddings โ open-source Python for all document parsing.
Connect your Microsoft 365 tenant and have your first documents extracted, chunked, and indexed in minutes.