Preface:
Character encodings are no alien to SAP systems or any computer systems for that matter, as they form the basis for data storage in and communication between computer systems. In the recent times, Unicode has become the dominant encoding scheme, of which UTF-8 representation is quite popular, especially with web content.
It is very rare that we get to deal with encoding schemes directly in ABAP. But, recently, there was a unique requirement, wherein it was required to convert the emoji characters in a unicode string to their equivalent Unicode code points in Hexadecimal so that they could be properly displayed in a HTML compliant client.
As much as it appeared interesting at first, it seemed very straightforward as well. But the reality was quite different upon the realisation that I had bare understanding of how unicode data is stored using UTF-8 encoding.
Here in this article, I’m going to explain what the actual requirement was and how a ABAP solution was provided for the same. Though the need for such a solution is very uncommon, the key takeaways from this solution could be our better understanding in the following areas:
◈ How does SAP store the data in the default code page configured?
◈ How to convert from one code page to another in SAP?
◈ How to handle the conversion between the data types such as C, I and X and so on?
◈ How to perform bit manipulation in ABAP?
◈ How does UTF-8 bit distribution logic work?
OK, let’s get started.
Requirement:
The actual requirement goes as follows:
Let’s consider the below Unicode string as input.
Test emoji 😀
As we see, this string has an emoji icon, technically a unicode character, whose code point is shown below:
Now, as per the requirement, the emoji icon 😀 needs to be converted to 😀 (Code point in Hex). As & is an unsafe character in HTML context, it needs to be escaped with & and hence the expected output would be:
Test emoji &#x1F600
Let’s look at the ABAP solution below.
ABAP Solution:
*&---------------------------------------------------------------------*
*& Report ZGP_EMOJI_CONV
*&---------------------------------------------------------------------*
*& Convert Emoji Characters in a Unicode String to Unicode Codepoints
*& Author: Gopu Packirisamy
*&---------------------------------------------------------------------*
REPORT zgp_emoji_conv NO STANDARD PAGE HEADING.
* Constants
CONSTANTS c_semicolon TYPE c VALUE ';'.
CONSTANTS c_uc_codepoint TYPE string VALUE '&#x'.
* Selection screen
SELECTION-SCREEN BEGIN OF BLOCK b WITH FRAME.
PARAMETERS p_string TYPE char255 LOWER CASE.
SELECTION-SCREEN END OF BLOCK b.
* Processing Logic
PERFORM conv_emoji2codepoint.
*&---------------------------------------------------------------------*
* Convert Emojis to Unicode Codepoint
*&---------------------------------------------------------------------*
FORM conv_emoji2codepoint.
DATA lv_xstr_idx TYPE sy-index.
DATA lv_hex TYPE xstring.
DATA lv_hex_i TYPE i.
DATA lv_cur_pos TYPE sy-index.
DATA lv_unicode_cp TYPE string.
DATA lv_string_utf8 TYPE char255.
DATA lv_string_conv TYPE string.
FIELD-SYMBOLS <fs_char>.
lv_xstr_idx = 0.
lv_cur_pos = 0.
* Convert text UTF-8 format (Hex string)
DATA(lo_converter) = cl_abap_conv_out_ce=>create( encoding = 'UTF-8' ).
lo_converter->reset( ).
lo_converter->write( EXPORTING data = p_string ).
lv_string_utf8 = lo_converter->get_buffer( ).
DATA(lv_len) = strlen( p_string ).
* Parse through the Hex string and identify each Unicode character
* according to its UTF-8 bit distribution pattern and
* apply codepoint conversion, if necessary
WHILE lv_cur_pos < lv_len.
ASSIGN lv_string_utf8+lv_xstr_idx(2) TO <fs_char> TYPE 'C'.
lv_hex = <fs_char>.
lv_hex_i = lv_hex.
IF lv_hex_i >= 240. " >= F0
ASSIGN lv_string_utf8+lv_xstr_idx(8) TO <fs_char> TYPE 'C'.
lv_hex = <fs_char>.
PERFORM conv_utf8_4b USING lv_hex lv_unicode_cp.
lv_xstr_idx = lv_xstr_idx + 8.
lv_cur_pos = lv_cur_pos + 2.
ELSEIF lv_hex_i >= 224. " >= E0
ASSIGN lv_string_utf8+lv_xstr_idx(6) TO <fs_char> TYPE 'C'.
lv_hex = <fs_char>.
PERFORM conv_utf8_3b USING lv_hex lv_unicode_cp.
lv_xstr_idx = lv_xstr_idx + 6.
lv_cur_pos = lv_cur_pos + 1.
ELSEIF lv_hex_i >= 192. " >= C0
ASSIGN lv_string_utf8+lv_xstr_idx(4) TO <fs_char> TYPE 'C'.
lv_hex = <fs_char>.
PERFORM conv_utf8_2b USING lv_hex lv_unicode_cp.
lv_xstr_idx = lv_xstr_idx + 4.
lv_cur_pos = lv_cur_pos + 1.
ELSE. " Other cases
lv_unicode_cp = COND #( WHEN p_string+lv_cur_pos(1) IS NOT INITIAL
THEN p_string+lv_cur_pos(1)
ELSE | | ).
lv_xstr_idx = lv_xstr_idx + 2.
lv_cur_pos = lv_cur_pos + 1.
ENDIF.
lv_string_conv = |{ lv_string_conv }{ lv_unicode_cp }|.
ENDWHILE.
WRITE: lv_string_conv.
ENDFORM.
*&---------------------------------------------------------------------*
* Convert 4 bytes UTF-8 character to Unicode Codepoint
*&---------------------------------------------------------------------*
FORM conv_utf8_4b USING iv_hex TYPE xstring ev_emoji_cp TYPE string.
DATA lv_emoji_hex TYPE xstring VALUE '000000'.
PERFORM copy_hex_bits USING: 6 iv_hex 4 lv_emoji_hex,
7 iv_hex 5 lv_emoji_hex,
8 iv_hex 6 lv_emoji_hex,
11 iv_hex 7 lv_emoji_hex,
12 iv_hex 8 lv_emoji_hex,
13 iv_hex 9 lv_emoji_hex,
14 iv_hex 10 lv_emoji_hex,
15 iv_hex 11 lv_emoji_hex,
16 iv_hex 12 lv_emoji_hex,
19 iv_hex 13 lv_emoji_hex,
20 iv_hex 14 lv_emoji_hex,
21 iv_hex 15 lv_emoji_hex,
22 iv_hex 16 lv_emoji_hex,
23 iv_hex 17 lv_emoji_hex,
24 iv_hex 18 lv_emoji_hex,
27 iv_hex 19 lv_emoji_hex,
28 iv_hex 20 lv_emoji_hex,
29 iv_hex 21 lv_emoji_hex,
30 iv_hex 22 lv_emoji_hex,
31 iv_hex 23 lv_emoji_hex,
32 iv_hex 24 lv_emoji_hex.
ev_emoji_cp = |{ c_uc_codepoint }{ lv_emoji_hex }{ c_semicolon }|.
ENDFORM.
*&---------------------------------------------------------------------*
* Convert 3 bytes UTF-8 character to Unicode Codepoint
*&---------------------------------------------------------------------*
FORM conv_utf8_3b USING iv_hex TYPE xstring ev_emoji_cp TYPE string.
DATA lv_emoji_hex TYPE xstring VALUE '0000'.
PERFORM copy_hex_bits USING: 5 iv_hex 1 lv_emoji_hex,
6 iv_hex 2 lv_emoji_hex,
7 iv_hex 3 lv_emoji_hex,
8 iv_hex 4 lv_emoji_hex,
11 iv_hex 5 lv_emoji_hex,
12 iv_hex 6 lv_emoji_hex,
13 iv_hex 7 lv_emoji_hex,
14 iv_hex 8 lv_emoji_hex,
15 iv_hex 9 lv_emoji_hex,
16 iv_hex 10 lv_emoji_hex,
19 iv_hex 11 lv_emoji_hex,
20 iv_hex 12 lv_emoji_hex,
21 iv_hex 13 lv_emoji_hex,
22 iv_hex 14 lv_emoji_hex,
23 iv_hex 15 lv_emoji_hex,
24 iv_hex 16 lv_emoji_hex.
ev_emoji_cp = |{ c_uc_codepoint }{ lv_emoji_hex }{ c_semicolon }|.
ENDFORM.
*&---------------------------------------------------------------------*
* Convert 2 bytes UTF-8 character to Unicode Codepoint
*&---------------------------------------------------------------------*
FORM conv_utf8_2b USING iv_hex TYPE xstring ev_emoji_cp TYPE string.
DATA lv_emoji_hex TYPE xstring VALUE '0000'.
PERFORM copy_hex_bits USING: 4 iv_hex 6 lv_emoji_hex,
5 iv_hex 7 lv_emoji_hex,
6 iv_hex 8 lv_emoji_hex,
7 iv_hex 9 lv_emoji_hex,
8 iv_hex 10 lv_emoji_hex,
11 iv_hex 11 lv_emoji_hex,
12 iv_hex 12 lv_emoji_hex,
13 iv_hex 13 lv_emoji_hex,
14 iv_hex 14 lv_emoji_hex,
15 iv_hex 15 lv_emoji_hex,
16 iv_hex 16 lv_emoji_hex.
ev_emoji_cp = |{ c_uc_codepoint }{ lv_emoji_hex }{ c_semicolon }|.
ENDFORM.
*&---------------------------------------------------------------------*
* Copy HEX bits from source byte to target byte
*&---------------------------------------------------------------------*
FORM copy_hex_bits USING iv_src_bit TYPE i
iv_src_str TYPE xstring
iv_trgt_bit TYPE i
CHANGING cv_trgt_str TYPE xstring.
GET BIT iv_src_bit OF iv_src_str INTO DATA(lv_bit).
SET BIT iv_trgt_bit OF cv_trgt_str TO lv_bit.
ENDFORM.
Solution Explanation:
◉ Read input Unicode string from selection screen via parameter p_string.
◉ Convert the input string to UTF-8 Hex string (xstring) using ABAP Conversion APIs.
◉ We can find default code page of the system by running the FM RFC_SYSTEM_INFO and checking the exporting parameter RFCSI_EXPORT-RFCCHARTYP. In my case, it happened to be 4102.
◉ We can find the details of SAP code page by running the FM SCP_CODEPAGE_INFO. It’s found that code page 4102 is UTF-16BE Unicode / ISO/IEC 10646.
◉ Loop through the converted UTF-8 text until the end of xstring and parse each each character as per the UTF-8 Bit Distribution Logic shown below.
◉ As per the Bit Distribution logic above, check the first byte in the following order:
◉ Case 1: If first byte value >= F0 (Hex) or 240 (Decimal) or 11110000 (Binary), the Unicode character is placed in 4 bytes.
◉ Case 2: Else if first byte value >= E0 (Hex) or 224 (Decimal) or 11100000 (Binary), the Unicode character is placed in 3 bytes.
◉ Case 3: Else if first byte value >= C0 (Hex) or 192 (Decimal) or 11000000 (Binary), the Unicode character is placed in 2 bytes.
◉ Case 4: Else, in rest of the cases, the Unicode character is placed in 1 byte.
◉ Once bit distribution pattern is identified as in above step, read the required follow up bytes and prepare the scalar value bytes for a single Unicode character, by setting its individual bits copied from distributed bits (Refer table 3-6 above).
◉ Apply the conversion logic in the following manner for each character in the UTF-8 string.
◉ For the cases 1, 2 and 3 above, make up the equivalent HTML entity (&#x followed by scalar value in Hexadecimal) for emoji characters.
◉ For case 4 above, no conversion is required as they are 7-bit ASCII characters.
◉ Concatenate each character from above step and output the converted string.
Sample Test Results:
Please note that Emoji character is not displayed in SAP GUI screen below, though it’s considered for the input to the parameter p_string.
Test 1:
Input:
Emoji test =>
Output:
Code Point Reference:
Test 2:
Input:
sap
Output:
Code Point Reference:
Test 3:
Input:
No Emoji text
Output:
Code Explanation with an example:
OK, now it’s time to deep dive. Let’s apply our learning with an example to have the better understanding of how the whole conversion logic works.
No comments:
Post a Comment