URL List Best Practices
URL List Best Practices
The following sections describe the supported formats for the URL lists. Admins can specify list type (regex or exact) when calling the Netskope REST API V2 to upload URL lists.
URL Supported Formats
This section provides examples of syntax allowed in URL lists such as supported characters or spaces in URLs. In addition, validation using exact match or wildcard URLs. Specific error messages display for any validation failures.
- Malformed URL (spaces, non-ASCII characters) are not allowed. URL containing non-ASCII characters must be properly percent-encoded (“%” followed by two hexadecimal digits).
- Percent-encoding is not allowed in the host name. Punycode should be used instead.
- TLD (*.gov) are allowed.
- Error if more than one ‘*’ is present. If ‘*’ is present, it must be the first character. And it must be followed by ‘.’ (dot). This will allow matching all subdomains of a specified domain. Thus “*.google.com” is ok but “*google.com” or “www.google.*” are not.
- We encourage users NOT to add scheme (http/https) to the URLs since they are ignored.
- user:password@host is not supported
- Empty/Blank lines are ignored
- Lines beginning with comment characters ([#;]) are ignored.
- Hostname: constructed only from letters, digits and dashes, cannot start or end with dash, percent encoding not allowed, puny code accepted
- path, query and fragment: Any combination of unreserved ([a-z0-9-._~]), percent encoding (%[0-9a-f]{2} and delimiters ([!$&'()*+,;=]) allowed
- *.foo.com automatically includes foo.com as well. This allows you to have a single entry (*.foo.com) instead of two entries (*.foo.com and foo.com) in a URL list to derive a custom category. That custom category can be used in various places such as a policy. If you create other configurations where domain names are accepted directly (such as a policy), you will need to specify two separate entries to match subdomains as well as the domain itself. Changes in other Netskope subsystems to merge *.domain.com and domain.com are not currently supported.
Valid URL Examples
"# wildcard", "*.google.com", "# punycode", "xn--jp-cd2fp15c.xn--fsq.jp", "# percent encoded space", "www.apple.com/i%20mac", "https://onedrive.live.com/?authkey=%21AKD9vi-K9pXhFlw&cid=F3CDA5103641D53D&id=F3CDA5103641D53D%21157&parId=F3CDA5103641D53D%21110&o=OneUp",
Invalid URL Examples
# url with space "www.domain with space.com/some/path", "www.acme.com/path with space", "www.acme.com/some/path/foo.bar?q=query with space", # percent-encoding not allowed in domain; must use punycode "www.domain%20with%20space.com/some/path", # Invalid ip addresses "http://0.0.0.0", "http://1.2.3/some/path", "http://1.1.1.1.1/foo/bar", "http://123456789", # invalid port "WWW1.PYTHON2.ORG:65536/doc/#frag", "www.example.net:foo", "www.example.net:0", # invalid wildard "www.google.*", "*google.com", # empty host "?param=1", # username/password in host "https://user:password@www.acme.com:8080/path/to/search?P1=foo&P2=bar#Results"
IP Address Ranges and CIDR Validation
The following are considerations to make for IP/CIDR ranges and examples are provided.
- IP address range is specified as A.B.C.D-W.X.Y.Z. IP address with CIDR is specified as A.B.C.D/<bits>
- Error is flagged if IP range is followed by other URL components such as path and query. If IP address/CIDR is followed by path/query, it will be interpreted as exact URL.
- IP address cannot be 0.0.0.0. In an IP address range, start address must be less than end address. IP address in CIDR notation must have host portion zero.
- Overlapping ranges are supported. If such ranges are associated with different categories, lookup of IP address in overlapping range would result in multiple categories. For example considering the following two ranges, a lookup of 192.186.1.2 would result in deriving “Category A” and “Category B”.
- 192.186.1.1 – 192.168.1.4 (Category A)
- 192.186.1.1 – 192.168.1.20 (Category B)
Valid Ranges and CIDR Examples
# Range ok 192.168.1.10-192.168.1.20 # CIDR ok 192.168.1.0/24
Invalid Ranges and CIDR Examples
# Invalid ip ranges and CIDR "http://1.2.3.20-1.2.3.10 "http://1.2.3.10-1.2.3.20/some/path" "http://1.2.3.4/24"
The following is treated as an exact URL instead of IP address and CIDR because the URL path can start with a number.
"http://1.2.3.4/24/some/path"
Supported Regex
The following are considerations to make for regex lists and in addition this section shows regex examples and supported formats.
- Our allowed set is PCRE (without lookahead/lookbehind).
- Schema (http/https) in the regex is not supported. For example “https://.google.com” will not match incoming URL “travel.google.com”
- URL path and query parameters are allowed in regex and are available for matching
- Back-reference and capturing subexpressions are not supported
- Literal characters and strings
- Character classes such as . (dot), [abc], and [^abc], as well as the predefined character classes s, d, w, v, and h and their negated counterparts (S, D, W, V, and H).
- Quantifiers:
Quantifiers such as ?, * and + are supported when applied to arbitrary supported sub-expressions.
Bounded repeat qualifiers such as {n}, {m,n}, {n,} are supported with limitations. For arbitrary repeated sub-patterns: n and m should be either small or infinite, e.g. (a|b){4}, (ab?c?d){4,10} or (ab(cd)*){6,}.
- Parenthesization, including the named and unnamed capturing and non-capturing forms. However, capturing is ignored.
- Alternation with the | symbol, as in foo|bar.
- The anchors ^, $, A, Z and z.
- Option modifiers – These allow behavior to be switched on (with (?<option>)) and off (with (?-<option>)) for a sub-pattern. The supported options are:
- Case-insensitive matching
- Multi-line matching
- Interpret . as “any character”
- Extended syntax, which will ignore most whitespace in the pattern
For example, the expression foo(?i)bar(?-i)baz will switch on case-insensitive matching only for the bar portion of the match.
Unsupported Regex
The following regex constructs are not supported:
- Backreferences and capturing sub-expressions
- Backtracking control verbs such as (*SKIP) and (*PRUNE)
- Subroutine references such as (?1) where 1 is the number of capturing group.
- Recursive patterns (?R), (?0) etc
Valid Regex Examples
"^client[0-9]\.google\.com" # match client1.google.com, client2.google.com ... "^app\.slack\.com\/.*\/netskope" # app.slack.com/foobar/netskope etc. "^google\.com" # match google.com "^www.foobar.com\/api\?action=create" # Match specific query parameter and value "^sgr\d{1,3}.apple.com" # match sgr0.apple.com through sgr999.apple.com
Invalid Regex Examples
"((foo|bar)" # Missing close parenthesis for group started at index 0 "http://beginwith^[/]*/path" #Embedded start anchors not supported
Tip
- Minimize the use of asterisks (*) in your regex because it can cause backtracking and impact the performance.
- The Enhanced URL list feature, currently does not support parsing and saving reserved characters or ASCII hex codes. Backlash is a reserved character and needs to be escaped before the URL list can be saved successfully. This applies to the API only, the UI is not affected.
URL List Limits
Currently, the total URL List limit per tenant across all URL lists is 300K. The URL List limit using Regex across all URL lists in that tenant is 1K (this 1K count includes only the regex written not the expanded format).
There is a per-upload limit of 8MB (file size) along with the above limit of URL and Regex count enforced for uploads through the Web UI and REST API V2. You can upload multiple files of 8MB size as long as the URL List count limit per tenant is not exceeded.
Validation Errors
For any validation failures, a specific error message is returned. For example:
"errors": [ [ "www.domain with space.com/some/path", "Invalid host" ], [ "www.acme.com/path with space", "Invalid Path" ],