Loading...
All Articles
Compliance · 10 min read

SOC 2 Type II + AI Controls: Extending Your Audit for LLM Systems

Your SOC 2 auditor just asked about your LLM features. Here is the controls matrix, the evidence to collect, the common findings, and how to extend an existing audit scope without starting from scratch.

The Question You're Being Asked

If you've held a SOC 2 Type II report for more than a year, this conversation is already happening. Your auditor, prompted by updated AICPA guidance and customer security questionnaires, asks: "Do any of your production systems use large language models or other AI components, and how are they covered in your controls?" You say yes. They ask to see the controls. You don't have a clean answer.

This post is the map we walk clients through when they have to answer that question. It covers which Trust Service Criteria actually touch AI components, what controls to add (not replace), how to evidence them, and what goes wrong. It's framed for teams that already have a working SOC 2 program and need to extend scope — not for first-time SOC 2 engagements.

Which TSCs Apply To AI Components

SOC 2 reports cover five Trust Service Criteria: Security, Availability, Processing Integrity, Confidentiality, and Privacy. Security is mandatory; the others are optional. Here is how each maps to a typical LLM-backed feature.

TSCRelevance to LLM systemsPriority
SecurityAccess control over model endpoints, prompt logs, training data. Same as any service.Standard
AvailabilityVendor dependency for OpenAI/Anthropic. Uptime SLAs, fallbacks.Medium
Processing IntegrityAccuracy, completeness, validity of outputs. The big one for AI.High
ConfidentialityPrompts can contain customer confidential data. Vendor data handling.High
PrivacyPII in prompts, user-level data in training. GDPR overlap.High if in scope

Most mature SOC 2 reports include Security, Availability, and Confidentiality. For LLM features, you'll almost certainly need to either add Processing Integrity or document that your AI components don't touch the flows Processing Integrity would govern.

Mapping Your AI Feature To Controls

Start with a simple inventory. For each AI feature, answer:

  1. What does it do?
  2. What data enters the prompt?
  3. What data comes out?
  4. Who or what can invoke it?
  5. What vendor(s) does it depend on?
  6. What happens if it's wrong?

Then map each answer to a control. Below is a controls matrix we bring to client engagements. It's not exhaustive — you will add company-specific controls — but it is the baseline we use to structure the conversation with the auditor.

The Controls Matrix

Security (CC6.x)

Control IDDescriptionEvidence
AI-SEC-01Access to LLM API keys and provider credentials is restricted to authorized services and rotated on scheduleSecrets manager policy, rotation logs
AI-SEC-02Access to training data and fine-tuning datasets is role-based and loggedIAM policies, CloudTrail, access reviews
AI-SEC-03MCP/tool servers exposed to agents require authentication and authorization per callAuth middleware code, integration tests, audit logs
AI-SEC-04Prompt/response logs containing customer data are encrypted at rest with customer-managed or provider-managed keysKMS configuration, S3 bucket policies
AI-SEC-05AI components are included in quarterly access reviewsAccess review worksheet, sign-off

Availability (A1.x)

Control IDDescriptionEvidence
AI-AVL-01Model provider dependencies are documented with fallback strategies for eachArchitecture doc, LLM gateway config
AI-AVL-02AI-dependent user journeys have graceful degradation when the LLM is unavailableFeature flag config, tested failure modes
AI-AVL-03Incident response runbooks cover LLM provider outagesRunbook, game day records
AI-AVL-04Monitoring includes LLM latency, error rate, and cost SLOsGrafana dashboards, alert rules

Processing Integrity (PI1.x)

Control IDDescriptionEvidence
AI-PI-01Model changes (base model upgrade, prompt changes, fine-tune updates) follow a documented change management processPR records, change advisory meeting notes
AI-PI-02Regression evaluation is run against a golden dataset before any material model change reaches productionEval harness, CI logs, evaluation reports
AI-PI-03Outputs used for automated decisions are validated against a schema and rejected if malformedSchema definitions, validation logs
AI-PI-04Hallucination and safety signals are monitored in production with alerting on regressionMonitoring dashboards, alert config
AI-PI-05Human oversight is required for irreversible actions generated by AI systemsWorkflow definitions, human review logs

Confidentiality (C1.x)

Control IDDescriptionEvidence
AI-CONF-01Sensitive data categories are documented; prompts are scanned for secrets and PII before transmissionData classification, DLP logs
AI-CONF-02Data Processing Agreements with model providers include appropriate zero-retention, training opt-out, and region clausesSigned DPAs
AI-CONF-03Prompt and response logs have defined retention periods and deletion workflowsRetention policy, lifecycle rules
AI-CONF-04Customer data is tenant-isolated in prompts, caches, and logsCode review, penetration test
AI-CONF-05Confidentiality commitments are reflected in customer contracts and internal policiesMSAs, policy docs

Privacy (P1.x to P8.x, if in scope)

Control IDDescriptionEvidence
AI-PRIV-01Users are notified when they interact with an AI systemUI screenshots, product requirements
AI-PRIV-02User requests to delete or export data include AI-generated records and associated prompt logsDSAR workflow, completion records
AI-PRIV-03Model training (if any) excludes customer data unless consent is obtainedTraining data manifest, consent records
AI-PRIV-04Vendor DPAs cover cross-border data transfers with appropriate safeguardsSCCs, transfer impact assessment

What Good Evidence Looks Like

SOC 2 auditors want evidence that is reproducible, dated, and tied to a specific control. For AI components we try to automate as much of it as possible.

Evidence For AI-PI-02 (Regression Eval)

The control says: a regression eval runs before any material model change reaches production. Good evidence looks like:

  • A evals/ directory in the repo with the golden dataset and the eval script.
  • CI logs showing the eval ran on every PR that changed the model or prompt config.
  • A summary output file committed to the release artifact.
  • An alert that fires if the eval fails or is skipped.
# .github/workflows/model-eval.yaml
name: model-eval

on:
  pull_request:
    paths:
      - "config/models/**"
      - "prompts/**"
      - "evals/**"

jobs:
  eval:
    runs-on: ubuntu-24.04
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install -r requirements.txt
      - name: Run golden dataset eval
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python -m evals.run \
            --dataset evals/golden.jsonl \
            --report out/eval-report.json
      - name: Fail on regression
        run: python -m evals.gate --report out/eval-report.json --min-score 0.88
      - uses: actions/upload-artifact@v4
        with:
          name: eval-report-${{ github.sha }}
          path: out/eval-report.json

On audit day, you export the list of merged PRs for the period, cross-reference against artifacts, and show the auditor that every relevant change has a corresponding eval report. Job done.

Evidence For AI-CONF-01 (Prompt Scanning)

The control says: prompts are scanned for secrets and PII before transmission. Good evidence looks like:

  • The scanning code, with tests.
  • Production logs showing redaction events over a sample period.
  • A runbook for incidents where redaction fails.
# app/llm/redact.py
import re
from dataclasses import dataclass


PATTERNS: list[tuple[re.Pattern[str], str]] = [
    (re.compile(r"sk-ant-[A-Za-z0-9-]{20,}"), "[REDACTED_ANTHROPIC_KEY]"),
    (re.compile(r"AKIA[0-9A-Z]{16}"), "[REDACTED_AWS_KEY]"),
    (re.compile(r"\b[\w.+-]+@[\w-]+\.[\w.-]+\b"), "[REDACTED_EMAIL]"),
    (re.compile(r"\b(?:\d[ -]*?){13,16}\b"), "[REDACTED_CARD]"),
]


@dataclass
class RedactionResult:
    text: str
    hits: dict[str, int]


def redact(raw: str) -> RedactionResult:
    result = raw
    hits: dict[str, int] = {}
    for pattern, label in PATTERNS:
        matches = pattern.findall(result)
        if matches:
            hits[label] = len(matches)
            result = pattern.sub(label, result)
    return RedactionResult(text=result, hits=hits)

Paired with structured logging of redaction counts (not the content), the auditor can verify the control is operating over the audit period.

Vendor Due Diligence

For SOC 2 purposes, each model provider is a sub-service organization or a vendor, depending on how you scope. Either way the auditor wants to see your due diligence on them. Every major provider publishes a SOC 2 report under NDA and will share it on request.

Vendor evidence package we collect for every model provider in scope:

  • Latest SOC 2 Type II report (under NDA).
  • ISO 27001 certificate if available.
  • DPA with appropriate annexes.
  • Model-specific data handling terms (zero retention, training opt-out).
  • Incident notification commitments.
  • Contact for security issues.

File this in your vendor management system with an annual review date.

Common Findings And How To Avoid Them

In the SOC 2 engagements we've advised on, these are the findings that recur when AI features are added to scope.

Finding 1: Model changes bypass the change management process. Fine-tuning a model or updating a prompt feels like a config tweak. Auditors see it as a change to a system that produces customer-facing output. Require PR review and a change ticket for every prompt and model change.

Finding 2: No evidence of regression testing before release. "We eyeballed it" is not evidence. Build the eval harness even if it has ten test cases.

Finding 3: Prompt logs retained indefinitely. Without a lifecycle policy, S3 buckets pile up. Auditors flag this under Confidentiality and Privacy. Apply lifecycle rules and document the retention period.

Finding 4: No DPA with the model provider. Teams sign up with a credit card, paste production traffic through, and never sign a DPA. Fix before audit.

Finding 5: Training data provenance unclear. If you fine-tune, the auditor wants to know where the data came from, who approved it, and whether PII was removed. Maintain a dataset manifest.

Finding 6: No incident response plan for AI-specific issues. "What happens if the model outputs customer PII?" should have a documented answer and a tabletop exercise on file.

Finding 7: Customer contracts do not disclose AI use. Processing Integrity and Confidentiality commitments can be misaligned with product behavior. Update MSAs and privacy notices.

Extending An Existing Report

The mechanical steps to extend a SOC 2 to cover AI:

  1. Scope workshop with your auditor. Define which AI features are in scope and which TSCs apply.
  2. Gap assessment against the controls matrix above. Mark each as in place, partial, or missing.
  3. Remediate the partials and missings. For a team with existing hygiene, this is typically 4-8 weeks.
  4. Evidence collection for the audit period. If the new controls have only been in place for a month, your Type II will only attest to operating effectiveness for that month. Plan accordingly.
  5. Walkthrough with the auditor to confirm understanding.
  6. Fieldwork — standard SOC 2 sample testing.
  7. Report issuance.

In practice, most clients can add AI coverage to an in-progress audit at the next reporting period, not the current one — the auditor needs time-boxed evidence and you need a clean six months of operation.

A Pragmatic Short List

If you want to get ahead of this before your next audit, do these six things:

  1. Inventory every AI feature and classify it against the controls matrix.
  2. Build the regression eval harness. Run it in CI.
  3. Add prompt scanning middleware and log the hits.
  4. Sign DPAs with every model provider you use.
  5. Set a retention policy on prompt and response logs.
  6. Update your incident runbook with an AI branch.

Six items, roughly a quarter of focused work. The auditor will be happy; more importantly, so will your customers' security teams.

Next Steps

Extending SOC 2 to cover AI components is a project, not a rebuild. The foundations you already have — change management, access control, incident response — mostly carry over. The net-new work is in evaluation, data handling, and vendor management. Start with the matrix, map your features, and talk to your auditor early. If you want help running the gap assessment or building the evidence collection pipeline, get in touch.

filed under
soc2aicomplianceaudit
work with us

Want our team to help with your infrastructure?

talk to an engineerFree 30-min discovery callBook
close