add

Useful Java functions that work on incoming char/Blob stream.

utf8_ASCII

I Came across an integration problem days back where i had to work on integrating an application that supports all UTF-8 characters like emoji’s and ascent/descent problems.

For example, source system could send a UTF-8 rich fancy gift card messages that includes many  emoji’s and fancy characters and the target system supports only a pure ASCII char set  between \x00 – \x7F.

I could simply clean up all the non-ASCII incoming bytes by cleaning up all the emoji’s (which are multi-byte that could occupy upto 3 to 4 bytes based on its glyphy) but the actual problem was when i encounter the ascent characters like character ‘'ñ' or  'ń' , For example: in name ñancy.

Now since there are so many Latin letters with  diaeresis, acute, circumflex, grave etc., its very difficult to keep everything in some abstract datatype and convert when we need it.

Below are useful Java functions that could be useful some day in an integration developer life.      

Below functions can be called through your ESQL code or Java compute and also below functions can work on blob stream too.

package com.rmetta.utils;

import java.nio.charset.Charset;
import java.text.Normalizer;
import java.util.regex.Pattern;

Public class JavaUtils {

public static Boolean isPureAscii(String inputStr) {

if (inputStr != null) {
return Charset.forName("US-ASCII").newEncoder().canEncode(inputStr);
//or "ISO-8859-1" for ISO Latin 1}

return false;

}

public static String deAccentUTF8Chars(String str) {

String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");

return pattern.matcher(nfdNormalizedString).replaceAll("");

}

public static String cleanNonAscii(String inputStr) {

if (inputStr != null) {
String descentStr = deAccentUTF8Chars(inputStr);
return descentStr.replaceAll("[^\\x00-\\x7F]+", "");
}

return inputStr;

}

}

Thanks to the actual solution provided by respected member in stackoverflow.  follow link to get more info on internals : https://stackoverflow.com/a/46118158/7766764

Also complete Unicode guide : “Unicode Demystified” by Richard Gillam has a very in-depth coverage on unicode.

Written by Ramesh Metta


Leave a Reply

Your email address will not be published. Required fields are marked *

*
*