한글 인코딩에서 맞게 문자열 자르기

UTF-8 인코딩일 때 문자열 “안녕하세요”는 15바이트입니다.

여기서 DB 컬럼 사이즈는 10바이트일 때 자바 기본 API를 사용하면 substring과 length 체크를 하면
유니코드나 UTF-8 구별 없이 그냥 5자 나올 것이다.

그래서 아래와 같이 만들었다.

[예제]

String plainText = “동해물과 백두산이 마르고 닳도록”;
System.out.println(“원문 : ” + plainText);

int textLength = length(plainText);
System.out.println(“원문 길이 : ” + textLength);

String convText = substring(plainText, 10);
System.out.println(“10바이트만 문자열 자르기 : ” + convText);

[결과]

원문 : 동해물과 백두산이 마르고 닳도록
원문 길이 : 45
10바이트만 문자열 자르기 : 동해물

깨진 바이트 글자는 버린다 ㅋㅋㅋ

    // 문자열 인코딩을 고려해서 문자열 자르기
    public static String substring(String parameterName, int maxLength) {
        int DB_FIELD_LENGTH = maxLength;

        Charset utf8Charset = Charset.forName("UTF-8");
        CharsetDecoder cd = utf8Charset.newDecoder();

        try {
            byte[] sba = parameterName.getBytes("UTF-8");
            // Ensure truncating by having byte buffer = DB_FIELD_LENGTH
            ByteBuffer bb = ByteBuffer.wrap(sba, 0, DB_FIELD_LENGTH); // len in [B]
            CharBuffer cb = CharBuffer.allocate(DB_FIELD_LENGTH); // len in [char] <= # [B]
            // Ignore an incomplete character
            cd.onMalformedInput(CodingErrorAction.IGNORE);
            cd.decode(bb, cb, true);
            cd.flush(cb);
            parameterName = new String(cb.array(), 0, cb.position());
        } catch (UnsupportedEncodingException e) {
            System.err.println("### 지원하지 않는 인코딩입니다." + e);
        }

        return parameterName;
    }

    // 문자열 인코딩에 따라서 글자수 체크
    public static int length(CharSequence sequence) {
        int count = 0;
        for (int i = 0, len = sequence.length(); i < len; i++) {
            char ch = sequence.charAt(i);

            if (ch <= 0x7F) {
                count++;
            } else if (ch <= 0x7FF) {
                count += 2;
            } else if (Character.isHighSurrogate(ch)) {
                count += 4;
                ++i;
            } else {
                count += 3;
            }
        }
        return count;
    }

태그: java, length, substring, utf-8, 문자열, 인코딩, 자르기

개발자들의 개발 토크!

한글 인코딩에서 맞게 문자열 자르기에 대한 1개 댓글

답글 남기기 응답 취소