Wednesday, 6 October 2021

Regular Expressions (RegEx) in Modern ABAP

In this blog post, I would like to share the latest news and changes made to Regular Expressions in modern ABAP, mainly from OP release 7.55 & 7.56.

Previously, POSIX style regular expressions or “Portable Operating System Interface for uniX” was used in ABAP. Hence, from now on, regular expressions in POSIX syntax are obsolete, then using this kind of regular expressions syntax leads to a warning from the syntax check. Although this can be hidden by the pragma ##regex_posix, it is strongly recommended to migrated to the other regular expression syntax supported by ABAP like PCRE regular expressions, XPath regular expressions or XSD regular expressions.

Recap on RegEx

Regular expressions, or regex as they’re commonly called, are usually complicated and intimidating for new users. Before digging into the new features, I would like to give a short introduction on RegEx in general &  presenting examples which is explicitly written in ABAP. Those of you who are an expert of this topic and might be get bored, please feel free to skip this section ahead.

RegEx concept is around for quite some time. It is used when complex patterns are expected. Like searching for numbers, alphabets, special characters or validating an Email etc. Many text search and replacement problems are difficult to handle without using regular expression pattern matching. Also, in ABAP, a search using a regular expression is more powerful than traditional SAP patterns. Let’s take this simple example:

 FIND 'A' IN 'ABCD1234EFG'

MATCH COUNT sy-tabix.

WRITE: sy-tabix.

Now if you want to find all alphabets in the string without using RegEx and by means of normal search pattern, you need a loop over all the 26 characters. Using RegEx, it would be easy to search and find all the seven character:

FIND ALL OCCURRENCES OF PCRE '[A-Z]' IN 'ABCD1234EFG'

  MATCH COUNT  sy-tabix.

WRITE: sy-tabix.

ABAP supports Regex in the statements FIND and REPLACE and via the classes CL_ABAP_REGEX and CL_ABAP_MATCHER. Class CL_ABAP_MATCHER applies a regular expression generated using CL_ABAP_REGEX to either a character string or an internal table.

Regular Expressions are generally composed of symbols and characters (literals). I try to cover some of the commonly used symbols in the table below.

Special Character Meaning
\ Escape character for special characters
\.  Placeholder for any single character 
\d Placeholder for any single digit 
[]  Definition of a value set for single characters 
[ – ]  Definition of a range in a value set for single characters 
One or no single characters 
Concatenation of any number of single characters including ‘no characters’ 
Concatenation of any number of single characters excluding ‘no characters’ 
Linking of two alternative expressions 
Anchor character for the start of a line 
Anchor character for the end of a line 
\< Start of a word
\> End of a word
\b Start or end of a word
\w matches any letter, digit and underscore character
\s matches a whitespace character — that is, a space or tab

From what mentioned above, we can write regular expressions like this:

\w{5} matches any five-letter word or a five-digit number. a{5} will match “aaaaa”.

\d{11} matches an 11-digit number such as a phone number.

[a-z]{3,} will match any word with three or more letters such as “cat”, “room” or “table.

Or, for example, the expression c+at will match “cat”, “ccat” and “ccccccat” while the expression c*at will match “at”, “cat” , “ccat” and “ccccccat”. More symbols can be found in ABAP documentation.

Greedy or Lazy?


Another concept which might be interesting to know is the meaning of greedy or lazy quantifiers in RegEx. In the greedy mode defined with (*,+,…) a quantified character is repeated as many times as it possible. The RegEx engine adds to the match as many characters as it can and then shortens that one by one in case the rest of the pattern doesn’t match. Its opposite will be called the lazy mode which match as few characters as possible. Means for example in ABAP, by placing a question mark after the * , (.*?), you ask to make a subexpression match as few characters as possible. The default behaviour of regular expressions is to be greedy (in fact POSIX imply greedy quantifier which cannot be switched off).

Example

DATA(text) = `"Jack" and "Jill" went up the "hill"`.

FIND ALL OCCURRENCES OF PCRE  `"(.*?)"` IN text IGNORING CASE
    RESULTS DATA(result_tab).
IF sy-subrc = 0.
  LOOP AT result_tab ASSIGNING FIELD-SYMBOL(<result>).
    cl_demo_output=>write(  substring( val = text off = <result>-offset len = <result>-length )  ).
  ENDLOOP.
ENDIF.
cl_demo_output=>display( ).

The greedy symbol “(.*)”, give the whole input sentence as an output while the lazy one, “(.*?)” , give us the three words “Jack”, “Jill” ,”hill” .  If you omit the expression ‘ALL OCCURRENCES OF’ in lazy case, only the substring between the first “and the following “is found, namely “Jack”.

Up to 7.55 release, ABAP only used POSIX library for RegEx. Since then, Perl library is also supported. Both libraries differ significantly in how matches are computed. As POSIX is outdated, we will use Perl-style regexes hereinafter. You can try different expressions by playing around with the Regex by simply running the report DEMO_REGEX in AS ABAP.

Regular Expressions (RegEx) in Modern ABAP, SAP ABAP Exam Prep, SAP ABAP Tutorial and Material, SAP ABAP Learning, SAP ABAP Study Materials

PCRE Syntax


The PCRE library is a set of functions written in C that implement regular expression pattern matching using the same syntax and semantics as Perl 5 and has its own native API. The PCRE syntax which stands for “Perl Compatible Regular Expressions”, is more powerful and flexible than the POSIX syntax or many other regular-expression libraries and perform better than the POSIX regular expressions supported by ABAP.

RegEx with PCRE syntax, can be specified after the addition PCRE of the statements FIND and REPLACE and the argument PCRE of built-in functions for strings. Objects for PCRE regular expressions can be created with the factory method CREATE_PCRE of the system class CL_ABAP_REGEX to be used in statements FIND and REPLACE or with the system class CL_ABAP_MATCHER.

Example

DATA(text) = `oooababboo`.

FIND PCRE 'a.|[ab]+|b.*' IN text
     MATCH OFFSET DATA(moff)
     MATCH LENGTH DATA(mlen).
IF sy-subrc = 0.
  cl_demo_output=>write( substring( val = text off = moff    len = mlen ) ).
ENDIF.

The search uses PCRE regular expression syntax and finds the ‘ab’ from offset 3 with length 2. However, using addition REGEX instead of PCRE, the search finds the substring ‘ababb’ from offset 3 or higher with length 5.

Callouts in PCRE Regular Expressions


By means of RegEx callouts, one is able to temporary pass the control to the function in the middle of regular expression pattern matching. PCRE callouts specifiess with the syntax (?C…) where the dots stand for an optional argument. If you specify a callout function before calling PCRE’s matching function, whenever the engine runs into (?C…), it temporarily suspends the match and passes control to that callout function, to which it provides information about the match so far. The callout function then performs any task it supposed to do and then it returns a code to the engine, letting it know whether to proceed normally with the rest of the match.

In ABAP, the PCRE syntax supports callouts that call ABAP methods during matching a regular expression with CL_ABAP_MATCHER. The special characters (?C…) of a PCRE regular expression then call the interface method CALLOUT when the method MATCH is executed. The example demonstrates how to call an ABAP method from a PCRE regular expression.

Example

REPORT demo_pcre_callout.

CLASS handle_regex DEFINITION.
  PUBLIC SECTION.
    INTERFACES if_abap_matcher_callout.
ENDCLASS.

CLASS handle_regex IMPLEMENTATION.
  METHOD if_abap_matcher_callout~callout.
    cl_demo_output=>write( |{ callout_num } { callout_string }| ).
  ENDMETHOD.
ENDCLASS.

CLASS demo_pcre DEFINITION.
  PUBLIC SECTION.
    CLASS-METHODS main.
ENDCLASS.

CLASS demo_pcre IMPLEMENTATION.
  METHOD main.
    DATA(regex) = cl_abap_regex=>create_pcre(
      pattern = `a(?C1)b(?C2)c(?C3)d(?C"D")e(?C"E")` ).

    DATA(matcher) = regex->create_matcher( text = `abcde` ).

    DATA(handler) = NEW handle_regex( ).
    matcher->set_callout( handler ).
    matcher->match( ).

    cl_demo_output=>display( ).
  ENDMETHOD.
ENDCLASS.

START-OF-SELECTION.
  demo_pcre=>main( ).

The regular expression contains the special characters (?C…) for callouts. The first three callouts pass numerical data, the other two pass string data.

A local class ‘handle_regex’ implements the interface IF_ABAP_MATCHER_CALLOUT and an instance of that class is set as the callout handler. When the regular expression is matched, the interface method CALLOUT is called for each callout position and can access the passed parameter.

PCRE syntax for ABAP SQL and ABAP CDS


ABAP SQL and ABAP CDS also support the PCRE syntax with the built-in functions REPLACE_REGEXPR, LIKE_REGEXPR and OCCURRENCES_REGEXPR. These functions access the PCRE1 library implemented in the SAP HANA database. The regular expressions of general ABAP work with the PCRE2 library implemented in the ABAP Kernel.

SQL Function Result  CDS View Entities   ABAP SQL 
LIKE_REGEXPR Checks whether a string contains any occurrence of PCRE   × 
OCCURRENCES_REGEXPR   Counts and returns all occurrences of a PCRE     × 
REPLACE_REGEXPR   A PCRE is replaced in a string with another specified character string   × × 

CDS View Entity


This SQL functions searches a string for a regular expression pattern and returns the string with either one or every occurrence of the regular expression pattern that is replaced using a replacement string in a CDS view entity.

REPLACE_REGEXPR(PCRE => pcre,
                VALUE => arg1,
                WITH => arg2,
                RESULT_LENGTH => res[,
                OCCURRENCE => occ][,
                CASE_SENSITIVE => case][,
                SINGLE_LINE => bool][,
                MULTI_LINE => bool][,
                UNGREEDY => bool])

The following table shows the requirements made on the arguments.

Result & Result Type Valid Argument Types

Result: A PCRE is replaced in arg1 with the character string specified in arg2. occ is optional and determines the number of occurrences of pcre to be replaced. By default, all occurrences are replaced.

The search is case-sensitive by default, but this can be overridden using the parameter case. Single-line, multi-line and ungreedy regular expression matching can be set with the parameter bool.

Result Type: SSTRING with the maximum possible length of res.

res: positive numeric literal greater than 0 and less than or equal to 1333

occ: Positive numeric literal of type INT1, INT2, or INT4 greater than or equal to 1. Alternatively, ‘ALL’ can be specified. In this case, all occurrences of the value arg1 are replaced.

case: ‘X’ or ‘ ‘. Alternatively, the character literals ‘true’ or ‘false’ (case-sensitive). The default value is ‘true’.

bool: ‘X’ or ‘ ‘. Alternatively, the character literals ‘true’ or ‘false’ (case-sensitive), The default value is ‘false’.


The valid argument types for arg1, arg2 are CHAR, CLNT, LANG, NUMC, CUKY, UNIT, DATS, TIMS, and SSTRING.

If an argument of a string function has the null value, the result of the full string function is the null value.

Example

The following CDS view entity applies built-in SQL functions for strings in the SELECT list to columns of the DDIC database table SPFLI to replace the distance id wherever it is MI to KM with the conversion value.

@AccessControl.authorizationCheck: #NOT_REQUIRED

define view entity ZI_regex_test

  as select from spfli

{

  concat_with_space( cityfrom, cityto, 4    )    as from_City_to,

  distance                                       as Distance,

  distid                                         as DistanceId,

  case

   when distid = 'MI' then

    replace_regexpr(   pcre => '[^<]+',

                       value => distid,

                       with => '1.6 KM',

                       result_length => 6  )

                       else 'KM'

                                             end as DistanceIdInKM

}

SQL Expressions


ABAP SQL now supports some new Regular processing function. The following table shows these string functions and the requirements on the arguments.

Syntax Meaning Valid Argument Types Result Type

LIKE_REGEXPR(

PCRE = pcre,

VALUE = sql_exp1[,

CASE_SENSITIVE = case])

Checks whether sql_exp contains any occurrence of a PCRE and returns 1 if yes and 0 if no. The search is case-sensitive by default, but this can be overridden using the parameter case. case: ‘X’ or ‘ ‘ INT4

OCCURRENCES _REGEXPR(

PCRE = pcre,

VALUE = sql_exp1[,

CASE_SENSITIVE = case])

Counts all occurrences of a PCRE in sql_exp and returns the number of occurrences. The search is case-sensitive by default, but this can be overridden using the parameter case. case: ‘X’ or ‘ ‘ INT4
 

REPLACE_REGEXPR(

PCRE = pcre,

VALUE = sql_exp1,

WITH = sql_exp2,

OCCURRENCE => occ][,

CASE_SENSITIVE => case])

A PCRE is replaced in sql_exp1 with the character string specified in sql_exp2. occ is optional and determines the number of occurrences of pcre to be replaced. By default, all occurrences are replaced. The search is case-sensitive by default, but this can be overridden using the parameter case. occ: Literal or host constant with the ABAP type b, s, i, or int8 greater than 0 and less than or equal to 1333
case: ‘X’ or ‘ ‘
SSTRING

The arguments sql_exp, sql_exp1and sql_exp2 can be any SQL expressions with the appropriate data types. The possible data types are the dictionary types CHAR, CLNT, CUKY, DATS, LANG, NUMC, TIMS, UNIT, and SSTRING. The possible data types for literals, host variables, and host expressions are the ABAP types assigned to the dictionary types above. The result types are also dictionary types.

Example

This example selects those flights from spfli table which are from either Berlin or Tokyo and give them in lt_table.

SELECT * FROM spfli
   WHERE
      like_regexpr( pcre = '\bBERLIN\b|\bTOKYO\b', value = cityfrom ) = '1'
   INTO @ls_table.
   APPEND ls_table TO lt_table.
ENDSELECT.

Example

The following example uses a regular expression to replace destination ‘Rome’ of flights from Tokyo to Neapel.

SELECT
  carrid as Airline,
  connid as flightNo,
  deptime as Departure_time,
  cityfrom as Departure,
  replace_regexpr( pcre = '\bROME\b', value = cityto , with = 'Neapel' ) as Destination
  from spfli where cityfrom = 'TOKYO'  
  into TABLE @data(lt_replace) .

Not all the parameters that can be specified for the REPLACE_REGEXPR function in ABAP CDS view entities (UNGREEDY, for example) can be specified for ABAP SQL as well. This functionality can be implemented through the PCRE syntax itself.

The other two possible regex syntaxes which found their ways onto ABAP language are XPath and XSD syntaxes.

Xpath and XSD Syntax


Feature-packed PCRE regex can be used in almost every situation. However, Regex queries as used by Perl are not well equipped to break down XML/HTML into its meaningful parts and parsing them easily. To tackle this difficulty, ABAP also support Xpath & XSD regular expression and transforming it to PCRE syntax internally.

XPath which stands for “XML Path Language”, is an expression language used to specify parts of an XML document. XPath can also be used in documents with a structure that is similar to XML, like HTML. A regular expression in XPath syntax can be compiled in a normal and extended mode. In the extended mode, most unescaped whitespace (blanks and line breaks) of the pattern are ignored outside character classes and comments can be placed behind #. In ABAP built-in functions, the extended mode is switched on by default and can be switched off with (?-x) in the regular expression.

Unlike regular expressions, we do not need to know the pattern of the data in advance as we use XPath. Since XML documents are structured using nodes, XPath makes use of that structure to navigate through the nodes to return objects containing the nodes that we are looking for.

Example

A special feature of XPath regular expressions is the subtraction of character sets. In the following example, the letters a to c are subtracted from character set BasicLatin and the first match is d at offset 3.

FIND REGEX
  cl_abap_regex=>create_xpath2( pattern = '[\p{IsBasicLatin}-[a-c]]' )
  IN 'abcd' MATCH OFFSET DATA(moff).

Example

Compared to PCRE, XPath regular expressions allow the escape character \ not only in front of special characters. In the following example, the match function with parameter xpath finds x while the match function with parameter pcre does not. Accordingly, the first FIND statement returns in sy-subrc the value 0 while the second FIND statement returns 4.

DATA(x) = match( val = `abxcd` xpath = `\x`  occ = 1  ).
DATA(y) = match( val = `abxcd` pcre =  `\x`  occ = 1  ).

FIND REGEX cl_abap_regex=>create_xpath2( pattern = '\x' ) IN 'abxcd'.
FIND REGEX cl_abap_regex=>create_pcre(   pattern = '\x' ) IN 'abxcd'.

XSD Syntax


XSD stands for  “Xml Schema Definition” and is a subset of XPath syntax. Compared to other regular expressions, the XML schema flavor has its own regular expression syntax and specialized notation and is quite limited in features. This feature shortage would not be an obstacle as XSD is only used to validate whether an entire element matches a pattern or not, rather than for extracting matches from large blocks of data.

XML schema anchors the entire regular expression. So, you must not add regex delimiters and you don’t need to use anchors (i.e., the ^ at the beginning and $ at the end). The regex must match the whole element for the element to be considered as valid. The dot never matches line breaks, and patterns are case sensitive. XML regular expressions don’t have any tokens like \xFF or \uFFFF to match special characters nor provide a way to specify matching modes.

There is no XSD syntax for non-greedy behavior. Lazy quantifiers are also not available by XSD. As the pattern is anchored at the start and the end of the subject string, and only a success/failure result is returned, it’s just a matter of performance which cause a difference between a greedy and lazy quantifier. It is not possible to make a fully anchored pattern match or fail by changing a greedy quantifier into a lazy one or vice versa. Besides, there is no XSD syntax for subgroups without registration or back references.

Regardless of its limitations, XML schema regular expressions provide two handy features. The special short-hand character classes \i and \c makes it easy to match XML names. No other regex flavor supports such possibility.

You can not use XSD syntax directly in FIND & REPLACE statements, but you can use the objects of RegEx class created with method CREATE_XSD with the addition REGEX instead.

Example

The following example uses XSD syntax that is invalid for PCRE and does not find any matches for POSIX. It would work also for XPath.

DATA(xml) = `<A><B>...<Y><Z>`.

REPLACE ALL OCCURRENCES OF
        REGEX cl_abap_regex=>create_xsd( pattern = `\i\c*` )
        IN xml WITH `option:$0`.
  cl_demo_output=>display( xml ).

The result of the replacement is <option:A><option:B>…<option:Y><option:Z>

Character class subtraction makes it easy to match a character that is in a certain list, but not in another list. This feature is particularly useful when working with Unicode properties. E.g. [\p{L}-[\p{IsBasicLatin}]] matches any letter that is not an English letter.

No comments:

Post a Comment