Dicom Tools

DICOM De-Identification: How to Safely Remove PHI from Medical Images

Introduction: Why DICOM De-Identification Matters

Medical imaging is one of the most valuable data sources in modern healthcare research. From training artificial intelligence models for radiology to conducting multi-site clinical trials, sharing DICOM files across institutions accelerates medical breakthroughs. However, every DICOM file carries a hidden cargo: Protected Health Information (PHI) embedded in its metadata tags, and sometimes even burned directly into the pixel data itself.

Under HIPAA and equivalent privacy regulations worldwide, sharing identifiable patient data without proper authorization is a serious violation that can result in fines, institutional sanctions, and loss of patient trust. De-identification is the process of systematically removing or transforming PHI so that DICOM files can be shared safely for secondary use such as research, education, and quality improvement.

This article provides a comprehensive guide to DICOM de-identification. We cover which tags contain PHI, the three main approaches to removing it, hidden risks like burned-in annotations and UID re-linking, and a practical audit workflow you can follow using a DICOM tag viewer to verify your results.

Which DICOM Tags Contain PHI

The DICOM standard defines hundreds of metadata tags that describe the patient, the study, the equipment, and the acquisition parameters. A significant subset of these tags contain information that can directly or indirectly identify a patient. The DICOM Supplement 142 (Attribute Confidentiality Profiles) provides the authoritative list, but understanding the key categories is essential for any de-identification effort.

Direct Patient Identifiers

These tags contain information that directly identifies the patient:

  • Patient Name (0010,0010) — the full name of the patient, often in the format LastName^FirstName^MiddleName.
  • Patient ID (0010,0020) — the medical record number or other institutional identifier assigned to the patient.
  • Patient Birth Date (0010,0030) — the date of birth, which in combination with other data points is a powerful identifier.
  • Patient Birth Time (0010,0032) — the time of birth, less commonly populated but still PHI when present.
  • Patient Sex (0010,0040) — while not uniquely identifying on its own, it narrows the identification pool and is considered PHI under Safe Harbor.
  • Patient Age (0010,1010) — can reveal identity when combined with other demographic data, especially for patients over 89.
  • Other Patient IDs (0010,1000) — additional identifiers from other systems, such as insurance numbers or national IDs.

Institutional and Provider Identifiers

These tags reveal where and by whom the study was performed, which can be used to trace back to the patient through scheduling records:

  • Institution Name (0008,0080) — the name of the hospital or imaging center.
  • Institution Address (0008,0081) — the physical address of the facility.
  • Referring Physician Name (0008,0090) — the name of the ordering physician.
  • Performing Physician Name (0008,1050) — the name of the physician who performed the procedure.
  • Operators Name (0008,1070) — the technologist who operated the equipment.
  • Station Name (0008,1010) — the name or identifier of the imaging device, which can reveal facility details.

Study and Accession Identifiers

These tags link the DICOM file to specific medical events:

  • Accession Number (0008,0050) — the unique identifier assigned by the radiology information system for the imaging order.
  • Study ID (0020,0010) — an identifier for the study within the institution.
  • Study Date (0008,0020) and Study Time (0008,0030) — when the study was performed.
  • Acquisition Date (0008,0022) and Series Date (0008,0021) — timestamps at the series and acquisition level.

Other PHI-Bearing Tags

Several less obvious tags also carry PHI:

  • Study Description (0008,1030) — free-text field that may contain patient-identifying information entered by the technologist.
  • Image Comments (0020,4000) — free-text annotations that may reference the patient by name.
  • Request Attributes Sequence (0040,0275) — may contain scheduling information with patient details.
  • Content Sequence tags — structured reporting elements that may embed patient identifiers in nested sequences.

Three Approaches to De-Identification

There is no single correct way to remove PHI from DICOM files. The right approach depends on your use case, regulatory requirements, and whether you need to re-link data later. The three primary strategies are removal, replacement, and pseudonymization.

Approach 1: Remove (Delete) Tags

The simplest approach is to delete PHI-bearing tags entirely. The tag is removed from the DICOM header, leaving no trace of the original value. This is the most conservative approach and provides the strongest privacy guarantee.

Advantages: Maximum privacy protection. No residual data to leak. Simple to implement and verify.

Disadvantages: Removes information that may be needed for research. For example, deleting Study Date makes it impossible to analyze temporal patterns in disease progression. Some DICOM applications may fail if required Type 1 tags are missing entirely.

Best for: Public datasets, educational materials, and situations where no re-linking to clinical data is ever needed.

Approach 2: Replace with Dummy Values

Instead of deleting tags, you can replace PHI values with generic placeholders. Patient Name becomes "ANONYMOUS", Patient ID becomes "000000", dates are shifted or set to a fixed value, and institution names are replaced with "INSTITUTION".

Advantages: Preserves the DICOM structure. Applications that require certain tags to be present will still function correctly. Date shifting (adding or subtracting a random but consistent offset) preserves temporal relationships between studies.

Disadvantages: Requires careful selection of replacement values. Naive replacements (e.g., setting all dates to January 1, 2000) can inadvertently create patterns that aid re-identification. Replacement values must be consistent across all files for the same patient to maintain data integrity.

Best for: Multi-site research collaborations where DICOM-compliant file structure must be maintained.

Approach 3: Hash or Pseudonymize

Pseudonymization replaces identifiers with coded values derived from the original data, typically using a one-way cryptographic hash (such as SHA-256) with a secret salt. The patient name "John Smith" might become "A7F3B2C1D4E5". The same input always produces the same output, enabling re-linking across datasets without exposing the original identity.

Advantages: Enables longitudinal research by linking records across time points and institutions. The original identity cannot be recovered without the salt, which is held separately under strict access controls.

Disadvantages: The data is pseudonymized, not truly anonymous. If the salt is compromised, all identities can be recovered. Regulatory frameworks like GDPR treat pseudonymized data as personal data, meaning additional safeguards are required.

Best for: Clinical trials, longitudinal studies, and biobank research where subjects must be tracked over time.

Burned-In Annotations: The Hidden PHI Risk in Pixel Data

One of the most dangerous and frequently overlooked sources of PHI in DICOM files is burned-in annotations. These are text overlays that have been rendered directly into the image pixel data, making them invisible to tag-level de-identification tools.

Where Burned-In PHI Appears

Burned-in annotations are particularly common in certain modalities:

  • Ultrasound (US) — ultrasound machines frequently burn the patient name, date of birth, and facility name directly into the image frame. This is a legacy practice from the era of film-based recording.
  • Computed Radiography (CR) and Digital Radiography (DR) — some CR/DR systems include header bars at the top or bottom of the image containing patient demographics and study information.
  • Secondary Capture (SC) — images captured from screens or converted from non-DICOM formats often carry embedded text overlays from the original display.
  • Nuclear Medicine (NM) — some gamma camera systems burn patient information into the image matrix.

Detecting and Removing Burned-In PHI

Because burned-in annotations live in the pixel data, they cannot be removed by simply deleting or replacing DICOM tags. Detection and removal requires different strategies:

Manual review: A human reviewer examines each image for visible text. This is the most reliable method but does not scale for large datasets.

Region masking: For modalities known to have burned-in annotations in consistent locations (e.g., the top 50 pixels of an ultrasound frame), you can apply a black rectangle to mask those regions. This is fast but risks obscuring diagnostic content or missing annotations in unexpected locations.

OCR-based detection: Optical character recognition can scan the pixel data for text strings that match known PHI patterns (names, dates, MRNs). This approach is more scalable than manual review but may miss unusual fonts or low-contrast text.

The DICOM tag Burned In Annotation (0028,0301) is supposed to indicate whether an image contains burned-in text, but it is not always populated accurately. Never rely solely on this tag to determine whether pixel-level PHI exists.

UID Re-Linking Risk: Why Instance UIDs Need Regeneration

Every DICOM study, series, and individual image carries unique identifiers (UIDs) that enable PACS systems to organize and retrieve imaging data. These UIDs include:

  • Study Instance UID (0020,000D) — identifies the entire imaging study.
  • Series Instance UID (0020,000E) — identifies a series within the study.
  • SOP Instance UID (0008,0018) — identifies the individual DICOM object.

While UIDs do not contain human-readable PHI, they pose a significant re-linking risk. If the original UIDs are preserved in the de-identified dataset, anyone with access to the source PACS can search for those UIDs and retrieve the original, fully identified study. This completely defeats the purpose of de-identification.

The solution is to regenerate all UIDs during de-identification, replacing them with new, globally unique values. The mapping between old and new UIDs must be maintained internally to preserve the study/series/instance hierarchy, but the mapping table itself must be stored securely and never shared with the de-identified data.

Pay special attention to UIDs in Referenced SOP Instance UID fields within structured reports, presentation states, and key image notes. These cross-references must be updated consistently to maintain data integrity.

Private Tags: Vendor-Specific PHI

The DICOM standard reserves tag groups with odd numbers (e.g., 0009, 0019, 0029) for private tags defined by equipment vendors. These tags may contain proprietary acquisition parameters, calibration data, or reconstruction settings. Critically, some vendors also store PHI in private tags.

Examples of PHI found in private tags include:

  • Patient name or ID duplicated in vendor-specific fields for proprietary workflows.
  • Technologist names or operator IDs.
  • Facility identifiers or department names.
  • Free-text fields that may contain clinical notes with patient references.

Because private tags are not standardized, there is no universal list of which ones contain PHI. The safest approach is to remove all private tags unless you have verified with the vendor that specific tags are safe to retain. Some private tags contain valuable technical data (e.g., diffusion tensor imaging parameters on Siemens scanners), so researchers may want to whitelist specific known-safe private tags after careful review.

Step-by-Step Audit Workflow Using a DICOM Tag Viewer

De-identification is only as good as its verification. A rigorous audit workflow ensures that no PHI slips through. Here is a practical step-by-step process using a DICOM tag viewer to inspect and validate your de-identified files.

Step 1: Inspect the Original File

Before de-identification, load the original DICOM file into the tag viewer to establish a baseline. Document all tags that contain PHI. Pay attention to free-text fields like Study Description and Image Comments, which may contain patient names even though they are not primary identifier tags. Note the Study Instance UID and SOP Instance UID for later comparison.

Step 2: Run De-Identification

Apply your chosen de-identification tool or script to the file. Use a profile based on DICOM Supplement 142 (Basic Application Level Confidentiality Profile) as your starting point, then customize it for your specific research requirements.

Step 3: Verify Tag-Level Removal

Load the de-identified file into the tag viewer and systematically check every tag that was flagged in Step 1. Verify that Patient Name, Patient ID, Date of Birth, Accession Number, and all other identified PHI tags have been removed, replaced, or pseudonymized as expected. Check that UIDs have been regenerated and do not match the originals.

Step 4: Check Private Tags

Filter the tag list to show only private tags (odd-numbered groups). Verify that all private tags have been removed, or that any retained private tags have been individually reviewed and confirmed to be free of PHI.

Step 5: Review Pixel Data

View the image itself to check for burned-in annotations. Pay particular attention to the corners and edges of the image, header and footer regions, and any text overlays. For ultrasound images, check every frame in a multi-frame file, not just the first one.

Step 6: Spot-Check a Random Sample

For large datasets, perform the full audit on a statistically significant random sample (typically 5-10% of files, with a minimum of 30 files). Document the sample selection method and results. If any failures are found, increase the sample size or audit the entire dataset.

Step 7: Document and Archive

Maintain a de-identification log that records the date, the tool and profile used, the number of files processed, sample audit results, and the name of the reviewer. This documentation is essential for regulatory compliance and institutional review board (IRB) audits.

Common De-Identification Failures and How to Catch Them

Even experienced teams encounter de-identification failures. Here are the most common mistakes and how to detect them:

  • Incomplete tag lists: Using an outdated or incomplete list of PHI tags. Always reference the latest version of DICOM Supplement 142 and check for new tags added in recent editions of the standard.
  • Missed sequence tags: PHI can be nested inside DICOM sequences (e.g., Request Attributes Sequence, Referenced Patient Sequence). Ensure your de-identification tool traverses nested sequences recursively.
  • Inconsistent date shifting: If dates are shifted by different offsets for the same patient across studies, temporal relationships are destroyed. Use a consistent offset per patient, stored securely.
  • Preserved UIDs: Forgetting to regenerate UIDs is one of the most common and dangerous failures. Always verify that Study, Series, and SOP Instance UIDs have changed.
  • Burned-in text ignored: Relying solely on tag-level de-identification without checking pixel data. This is especially risky for ultrasound and secondary capture modalities.
  • Private tag retention: Assuming private tags are safe because they do not appear in the standard PHI list. Always remove or individually vet private tags.
  • File name leakage: The DICOM file may be de-identified internally, but the filename on disk might still contain the patient name or MRN. Always rename files as part of the de-identification process.

Regulatory Context: HIPAA Safe Harbor vs Expert Determination

In the United States, HIPAA provides two methods for de-identifying health information, and understanding both is important for DICOM de-identification projects.

Safe Harbor Method

The Safe Harbor method (45 CFR 164.514(b)(2)) requires the removal of 18 specific categories of identifiers, including names, geographic data smaller than a state, dates (except year) related to the individual, phone numbers, email addresses, Social Security numbers, medical record numbers, and more. For DICOM files, this means removing all direct identifiers, truncating ZIP codes to the first three digits (or removing them entirely for low-population areas), and ensuring that ages over 89 are aggregated into a single category.

The advantage of Safe Harbor is that it provides a clear, prescriptive checklist. The disadvantage is that it may require removing data elements that would be valuable for research.

Expert Determination Method

The Expert Determination method (45 CFR 164.514(b)(1)) allows a qualified statistical expert to determine that the risk of identifying an individual from the data is "very small." This approach is more flexible and can permit retention of data elements that Safe Harbor would require removing, such as specific dates or geographic regions, provided the expert can demonstrate that the re-identification risk is acceptably low given the intended recipient and data environment.

For large-scale imaging research projects, Expert Determination is often preferred because it preserves clinically important data while still meeting the legal standard for de-identification. However, it requires engaging a qualified expert and documenting the statistical analysis, which adds cost and time to the project.

International Considerations

Outside the United States, regulations like the EU General Data Protection Regulation (GDPR), Canada's PIPEDA, and Australia's Privacy Act impose similar but not identical requirements. GDPR in particular treats pseudonymized data as personal data, meaning that even hashed identifiers trigger data protection obligations. When sharing DICOM data internationally, apply the most restrictive applicable standard.

Conclusion

DICOM de-identification is a critical step in enabling medical imaging research while protecting patient privacy. It requires more than simply deleting a few obvious tags. A thorough de-identification process addresses tag-level PHI across all standard and private tags, regenerates UIDs to prevent re-linking, detects and removes burned-in annotations from pixel data, and follows a documented audit workflow to verify results.

By understanding the risks and applying the techniques described in this article, you can share medical imaging data confidently and compliantly. Start by using a DICOM tag viewer to inspect your files before and after de-identification, and always document your process for regulatory review. The goal is not just to meet a compliance checkbox but to genuinely protect the patients whose images make research possible.

← Back to Blog