samsarka describes how to use Azure AI Search and OCR to extract and index PDF page numbers, enabling fast and context-aware knowledge discovery in large document repositories.

Extracting Page Numbers from PDFs with Azure AI Search and OCR

In document-heavy environments—contracts, manuals, or other large PDFs—being able to map search results to precise page numbers helps users navigate directly to relevant content. This tutorial outlines how to set up Azure AI Search (formerly Cognitive Search) to provide page-level granularity for knowledge mining solutions.

Why Page Numbers Matter in AI Search

Contextual navigation: Users can jump straight to relevant sections in large documents.
Precise citations: Better support for knowledge bases and chatbots, allowing exact references.
Increased user trust: Showing where the answer came from increases confidence in AI responses.

Prerequisites: Azure Blob Storage & AI Search Setup

Azure Blob Storage: Prepare a container to store your PDFs.
Permissions: The AI Search service requires ‘Storage Blob Data Reader’ access. If you’re using Azure RBAC, assign the managed identity accordingly.
- How to assign blob permissions
- How to index Azure Blobs

Technical Implementation

1. Skillsets: Document Cracking, OCR, and Index Projection

Document Cracking splits a PDF into pages and separates text/image content.
OCR Skill (#Microsoft.Skills.Vision.OcrSkill) processes images per page, extracting machine-readable text. This is crucial for PDFs that mix text and embedded images.
Index Projection: Map each page’s content, file name, URL, and page number into a search index. Parent-child indexing enables querying at the page level.

Example skillset configuration snippet:

{
  "name": "pagenumskillset",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
      "context": "/document/normalized_images/*",
      "outputs": [
        { "name": "text", "targetName": "text" },
        { "name": "layoutText", "targetName": "layoutText" }
      ]
    }
  ],
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "pagenumidx",
        "parentKeyFieldName": "ParentKey",
        "sourceContext": "/document/normalized_images/*",
        "mappings": [
          { "name": "DocText", "source": "/document/normalized_images/*/text" },
          { "name": "DocName", "source": "/document/metadata_storage_name" },
          { "name": "DocURL", "source": "/document/metadata_storage_path" },
          { "name": "PageNum", "source": "/document/normalized_images/*/pageNumber" }
        ]
      }
    ]
  }
}

2. Index Definition

Define an index schema that includes fields for document text, name, page number, and URLs. The index enables fast lookup and faceted search across documents and their pages.

Example fields:

ID (string, key)
DocText (string, searchable text for the page)
DocName (string, file name)
PageNum (int, page number)
DocURL (string, PDF link)
ParentKey (string, reference for parent-child relationship)

3. Indexer Configuration: Using OCR and Image Actions

Configure the indexer to render each PDF page as an image, then pass it to the OCR skill for text extraction. Map outputs according to the index definition.

Example indexer snippet:

{
  "name": "indexer-pagenum",
  "dataSourceName": "azureblob-...-datasource",
  "skillsetName": "pagenumskillset",
  "targetIndexName": "pagenumidx",
  "parameters": {
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "imageAction": "generateNormalizedImagePerPage"
    }
  }
}

4. Validation: Explore the Results

Use Azure AI Search’s Search Explorer to query your index and confirm that search results include page numbers:

{
  "@search.score": 1,
  "ID": "...",
  "DocText": "...PDF content...",
  "DocName": "docname.pdf",
  "PageNum": 33,
  "DocURL": "https://storageaccount.blob.core.windows.net/containername/docname.pdf",
  "ParentKey": "..."
}

Conclusion

With this setup, search and AI-driven applications can return not just relevant snippets, but exactly where in a PDF the answer resides—supporting granular, reliable, and user-friendly document discovery.

This post appeared first on “Microsoft Tech Community”. Read the entire article here