Vishal Gupta
Data classification can be viewed as the act of putting data in buckets, based on the criteria of confidentiality, criticality, sensitivity/access control and retention.
DLP and IRM success stories in India
The basis for framing a data classification policy generally decides the selection of controls. Whether the onus of data classification should be on the individual or on a central automated mechanism is the million dollar question.
Both methods have their own challenges. When data classification is left to the end user, training is required for data classification to be correct. This challenge multiplies, as the organization grows. One example of human-based data classification is the solution developed by security and compliance solution provider TITUS.
On the other hand, automated systems are based on contextual rules that look for patterns in data. The challenge with pattern-based data classification is the incidence of false positives. No data classification policy can foresee every situation, and thus sensitive data not matching predefined patterns will be missed by the system. Vendors such as Websense, Symantec and Autonomy provide such automated services.
The need of the hour is a hybrid model that can seamlessly integrate human intelligence with an automated system for data classification.
For instance, one might conclude that combining the solution from Websense with that of TITUS fits the bill. This might not always work in practice – there is no solution available that simultaneously handles both these aspects of data classification.
Mapping data classification policy to controls
Any data classification policy must primarily address two factors: confidentiality and length of retention. Thus, controls broadly fall into two categories – namely, control over access and distribution, as well as control over retention.
1. Control over access and distribution:
In a data classification policy, access control technology determines sensitivity of data based on the content, owners and origin, in order to establish the extent to which it can be distributed. Depending on the maturity, the refinements that the solution provides for translating data classification policies into controls differ. There are two kinds of controls:
a) By distribution:
Content-aware data loss prevention (DLP) is the data classification technology used to prevent the movement of data by either allowing or stopping its flow after content analysis. For example, for an email with a sensitive attachment, the control mechanism prevents it from being attached to the email. A more granular system may allow the file to be attached but only allow it to be circulated internally. A further refinement would be to only allow the content to be distributed within a select group.
b) By access:
With information rights management (IRM), the data security controls are built into the information itself, leveraging data classification. Thus, irrespective of where the file goes, controls are present within the document. Control is post-distribution and permissions on the information can be amended post-distribution.
2. Control over retention:
Data classification policy controls might also be based on how long the information needs to be retained. Most good enterprise archival and backup systems can read this data classification, enabling selective backup and secure erasure. This optimizes backups. For instance, such systems can be instructed, based on the data classification policy, to backup P&L (Profit and loss) financial data for a period of seven years. Systems can also be configured to retain specific versions of files in circulation.
Effective data classification policies
Any data classification policy needs to evolve. If organizations try to initially make their data classification too granular, this would result in an error prone system with high training costs. The data classification policy must start coarse and undergo refinement gradually. For instance, an organization might initially carry out data classification of documents as internal, external and public. Over time this data classification could be refined, with “internal” covering HR-specific documents, financial reports, R&D and compliance documents, and so on. At a later stage, HR could be refined further under salary data, resumés and so forth. If the data classification policy is too granular at the outset, a greenhorn user would be confronted with hundreds of options, potentially leading to confusion. Educating the end user effectively on the data classification policy is the bedrock of an effective data control regime. From an environmental perspective, there are regulatory and compliance issues to take into account. In an increasingly borderless world, access-based controls will carry the day while flow-based controls will become more and more difficult to implement.