"Unstructured data contains the REAL value in research. Everyone should be using it..."
Dr. Keith Argenbright, Associate Professor, Medical Director, UT Southwestern/Moncrief Cancer Institute
Removing PHI from textual records is the “last mile” challenge to fully realizing the benefits of healthcare data mining. Access to data is increasingly important to achieve objectives for clinical quality and outcomes, but access to patient records is severely limited under Federal patient privacy laws. Anything that contains PERSONAL HEALTH INFORMATION (“PHI”) – defined by 18 categories – is protected. Even within an institution, access is limited by the ‘right to review” which means that hospital administration is limited in terms of what kind of data they can use without permission. Data sharing across institutions – an important way to get good case aggregation – is virtually impossible.
These restrictions can generally be addressed through the use of appropriate agreements, but de-identified data is HIPAA-exempt and offers highest degree of flexibility of access and use.
In this light, the ability to remove PHI becomes a critical success factor for clinical research , analytics and data mining that is standard in the business world. De-identification options are limited; there are intensive and expensive manual methods, and some very limited open-source software , but both of these methodologies are cumbersome, prone to error, and have the added problem of removing too much information – overmarking – leaving the records with limited usefulness from a data perspective.


