Dataset sanitization - HITRUST AI Security Assessment and Certification Specification

TOPIC: Filtering and sanitizing AI data, inputs, and outputs

HITRUST CSF requirement statement [?] (07.10bAISecSystem.2)

Data for 
(1) AI model training, 
(2) AI model fine-tuning, or 
(3) prompt enhancement via RAG—if used—is checked prior to usage (e.g., using statistical methods, through manual inspection, 
or through automated means) for suspicious unexpected values or patterns which could be adversarial or malicious in nature 
(e.g., poisoned samples).  
Identified anomalous entries are 
(4) removed.

Evaluative elements in this requirement statement [?]

1. Data for AI model training is checked prior to usage (e.g., using statistical 
methods, through manual inspection, or through automated means) for suspicious 
unexpected values or patterns which could be adversarial or malicious in nature 
(e.g., poisoned samples).

2. Data for AI model fine-tuning is checked prior to usage (e.g., using statistical 
methods, through manual inspection, or through automated means) for suspicious 
unexpected values or patterns which could be adversarial or malicious in nature 
(e.g., poisoned samples).

3. Data for prompt enhancement for RAG (if used) is checked prior to usage 
(e.g., using statistical methods, through manual inspection, or through automated 
means) for suspicious unexpected values or patterns which could be adversarial 
or malicious in nature (e.g., poisoned samples).

4. Identified suspicious unexpected values or patterns are removed.

Illustrative procedures for use during assessments [?]

Policy: Examine policies related to each evaluative element within the requirement statement. Validate the existence of a written or undocumented policy as defined in the HITRUST scoring rubric.
Procedure: Examine evidence that written or undocumented procedures exist as defined in the HITRUST scoring rubric. Determine if the procedures and address the operational aspects of how to perform each evaluative element within the requirement statement.
Implemented: Examine evidence that all evaluative elements within the requirement statement have been implemented as defined in the HITRUST scoring rubric, using a sample based test where possible for each evaluative element. Example test(s):
- For example, review the AI system to ensure data for AI model training, AI model tuning, or prompt enhancement via RAG if used, is checked prior to usage for anomalies such as unexpected values or patterns (e.g., using statistical methods, through manual inspection). Further, confirm that the identified anomalous entries are removed.
Measured: Examine measurements that formally evaluate and communicate the operation and/or performance of each evaluative element within the requirement statement. Determine the percentage of evaluative elements addressed by the organization’s operational and/or independent measure(s) or metric(s) as defined in the HITRUST scoring rubric. Determine if the measurements include independent and/or operational measure(s) or metric(s) as defined in the HITRUST scoring rubric. Example test(s):
- For example, measures indicate if the data for AI model training, AI model tuning, or prompt enhancement via RAG if used, is checked prior to usage for anomalies such as unexpected values or patterns (e.g., using statistical methods, through manual inspection). Reviews, tests, or audits are completed by the organization to measure the effectiveness of the implemented controls and to confirm the identified anomalous entries are removed.
Managed: Examine evidence that a written or undocumented risk treatment process exists, as defined in the HITRUST scoring rubric. Determine the frequency that the risk treatment process was applied to issues identified for each evaluative element within the requirement statement.

Placement of this requirement in the HITRUST CSF [?]

Assessment domain: 07 Vulnerability Management
Control category: 10.0 – Information Systems Acquisition, Development, and Maintenance
Control reference: 10.b – Input Data Validation

Specific to which parts of the overall AI system? [?]

AI application layer:

Prompt enhancement via RAG, and associated RAG data sources

AI platform layer

AI datasets and data pipelines

Discussed in which authoritative AI security sources? [?]

OWASP 2023 Top 10 for LLM Applications
Oct. 2023, © The OWASP Foundation
- Where:
  - LLM03: Training data poisoning > Prevention and Mitigation Strategies > Bullet #5
  - LLM06: Sensitive information disclosure > Prevention and Mitigation Strategies > Bullet #1

OWASP 2025 Top 10 for LLM Applications
2025, © The OWASP Foundation
- Where:
  - LLM02: Sensitive Information Disclosure > Prevention and Mitigation Strategies > Sanitization > Bullet #1
  - LLM08: Vector and embedding weaknesses > Prevention and Mitigation Strategies > Bullet #2

OWASP Machine Learning Security Top 10
2023, © The OWASP Foundation
- Where:
  - ML02:2023 Data poisoning attack > How to prevent > Bullet #1

OWASP AI Exchange
2024, © The OWASP Foundation
- Where:
  - #DATAQUALITYCONTROL

MITRE ATLAS
2024, © The MITRE Corporation
- Where:
  - AML.M0007

NIST AI 100-2 E2023: Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations
Jan. 2024, National Institute of Standards and Technology (NIST)
- Where:
  - 2. Predictive AI taxonomy > 2.3. Poisoning attacks and mitigations > 2.3.1. Availability poisoning
  - 2. Predictive AI taxonomy > 2.3. Poisoning attacks and mitigations > 2.3.3. Backdoor poisoning

Guidelines for Secure AI System Development
Nov. 2023, Cybersecurity & Infrastructure Security Agency (CISA)
- Where:
  - 1. Secure design > Design your system for security as well as functionality and performance

Securing Machine Learning Algorithms
2021, © European Union Agency for Cybersecurity (ENISA)
- Where:
  - 4.1- Security Controls > Technical: Control all data used by the ML model > Use methods to clean the training dataset from suspicious samples

Discussed in which commercial AI security sources? [?]

Databricks AI Security Framework
Sept. 2024, © Databricks
- Where:
  - DASF 7: Enforce data quality checks on batch and streaming datasets
  - DASF 15: Explore datasets and identify problems
Snowflake AI Security Framework
2024, © Snowflake Inc.
- Where:
  - Adversarial samples > Mitigations > Input preprocessing
  - Model poisoning > Mitigations > Bullet 2
  - Training data poisoning > Mitigations > Dataset sanitization

Control functions against which AI security threats? [?]

Control function: Corrective
Control function: Detective

Additional information

Q: When will this requirement included in an assessment? [?]
- This requirement is included when the assessment’s in-scope AI system(s) leverage data-driven AI models (e.g., non-generative machine learning models, generative AI models).
- The Security for AI systems regulatory factor must also be present in the assessment.

Q: Will this requirement be externally inheritable? [?] [?]
- Yes, fully. This requirement may be the sole responsibility of the AI platform provider and/or model creator. Or, depending on the AI system’s architecture, only evaluative elements that are the sole responsibility of the AI platform provider and/or model creator apply.

TOPIC: Filtering and sanitizing AI data, inputs, and outputs

Input filtering