Building Regular Expressions

Building Regular Expressions

The DLP engine contains 3000+ predefined data identifiers that can be used in DLP rules. The DLP engine also supports custom data identifiers that use either a keyword search or regular expression search. This page describes how to write custom data identifiers for DLP using regular expressions.

When building a Regular Expression, always consider the following metrics:

  • Coverage – If a dartboard represents the data matched by the regular expression, how likely is it that all possible matches are included on the surface area of the dartboard? If you throw a dart at the board, miss the board, and still have the possibility of hitting real matches, then the dartboard Coverage is not complete.
  • Accuracy – Accuracy is a measure of whether something appropriate is matched. How often will the dart hit an appropriate match when it strikes the dartboard? More importantly, how often does the dart hit a bad match even though it hits the dartboard? Is the regular expression “too wide”, including too much garbage?
  • False Positives – Consider the string: 95054. While that is Netskope’s ZIP code in Santa Clara, it may not really be a reference to a ZIP code. Perhaps it’s the last five digits of an American Express credit card number, a LEGO accessory part number, a case number in Price George’s County planning department, etc.
  • Performance – Performance is a subjective measurement of how well the regular expression performs in different real-world conditions, and isn’t as simple as it seems. If you throw darts at the dartboard at an extremely high rate, but you’re not hitting the dartboard, what good is High Performance? Conversely, if you’re throwing darts so slowly that the darts don’t stick in the fibers or cork (the equivalent of engine timeouts), you’re going to miss all sensitive data.

Syntax

This section describes the regular expressions syntax that the DLP engine supports. The DLP engine parser interprets regular expression syntax differently compared to the UNIX regular expression syntax and certain expressions and common usages must be adapted for the DLP engine.

The DLP engine is limited to matching up to 256 characters.

Common Regex Features below are supported:

  • Character classes — e.g. [a-z]
  • Common shorthand classes — e.g. \d, \w, \s, and others. These are not recommended either as [0-9] is far more clear compared to \d.
  • Negative character classes — e.g. [^0-9] or \D. These are not recommended for performance reasons.
  • Groups and alternatives — e.g. (A|B)
  • Quantifiers — + and *. Also supports range quantifiers {n,m}; however, the range should be small and the preceding class should be positive and narrow. There may be compile-time issues if these guidelines are not adhered to.

Supported Operators

OperatorMatched Pattern
\Quote (escape) the next metacharacter allowing it be used as a literal. A metacharacter is a character that has a special meaning during pattern processing such as any of these operators.
.Match any character (except newline). Includes letters, numbers, punctuation, Japanese, emoji, etc. There are 140,000 characters.
(Start of subpattern
)End of subpattern
|Alternation. This is a logical OR. For example (cat|dog) matches cat or dog.
[xy]Character x or y
[x-z]The range of characters between x and z
[^z]Any character except z. Not recommended for performance reasons.

Supported Quantifiers

OperatorMatched Pattern
*Match 0 or more times
+Match 1 or more times
?Match 0 or 1 times
{n}Match exactly n times
{n,}Match at least n times
{n,m}Match at least n times, but no more than m times

Frequently Asked Questions and Best Practices

Use Predefined Entities whenever possible.

Predefined entities will generally perform better than using a custom entity. Unless Netskope Support has verified that there is an existing problem with a predefined entity, it is always best practice to use a predefined entity.

Some True Positives may be Functional False Positives

For example, if you’re looking for Passcode, you will find Zoom meeting invitations contain some string like Passcode: 123456. This is a true positive when you’re looking for Passcode, but functionally it is a false positive because the impact of this string is so low.

The most requested features that are not supported include:

  • positive/negative look-around assertions — e.g. (?! or (?<=
  • back-references, or indexed or named capture groups — e.g. \1
  • non-capturing groups — e.g. (?:
  • non-greedy (or lazy) quantifiers — e.g. .+?, \d*?, or (?U)
  • possessive quantifiers — e.g. .++, \w++, or (?-U)
  • mode-switching modifiers — e.g. (?i) or (?x)
  • atomic groups — e.g. (?>

Always put something around your expected match in the Test Input field.

Remember to include spaces.

Merge your Entities whenever possible.

Consider the following phrases:

  • password
  • passwd
  • pwd
  • pword

Rather than creating four separate Entities for each phrase, create a pattern which can match them all such as:

p(ass)?w(or)?d

Or even:

(password|passwd|pwd|pword)

Form-validation Regex should NOT be used as Content Inspection Regex.

For example, to verify that a user input a valid year between 1900 and 2099, a form-validation RegEx might look like:

^(19|20)[0-9]{2}$

The carat (^) means that the pattern match must start at the beginning of the line or input. Conversely, the dollar sign ($) means to match the end. Both are usually nonsensical when looking for data within text, a spreadsheet, or any typical documents. The above RegEx cannot match a year within text like the following: January 31, 2022.

As a rule, avoid anchor assertions.

Examples of Regular Expressions

  • Looking for any occurrences of Acme, Budget Forecast, or Confidential in any document, with the upper and lower case letters exactly as shown

    (Acme|Budget\ Forecast|Confidential)

    | – The pipe symbol means “OR”, and the open and close parentheses define a group of choices. In this example, the parentheses are not actually required, but add them anyway because it’s a best practice and will keep you out of trouble when writing more complicated RegExes later.

    \ – This backslash symbol escapes the next character (in this case the space ” “). When you create a long or complicated RegEx that wraps (or even has the potential to wrap) on screen or paper, you cannot be 100% certain if that space exists, should exist, or perhaps shouldn’t exist when at the wrapped edge. Escape the space to remove that ambiguity.

  • Matching Similar Phrases like Acme, Acme Inc, Acme Inc., or Acme Incorporated.

    Acme(\ Inc(\.|orporated)?)?

    ? – Matches 0 or 1 times. In this example, this looks for Acme, followed by an optional group (marked by ?) that consists of a space and Inc, which itself is optionally followed by a group of either a dot or the rest of the word orporated (marked by another ?).

Share this Doc

Building Regular Expressions

Or copy link

In this topic ...