Skip to main content

Command Palette

Search for a command to run...

Java 9 Compact Strings Example

Updated
Java 9 Compact Strings Example
Y

Tech Lead & Architect | 13+ Years in Cloud, Backend, and AI - Experienced software engineer with expertise in Java, Spring Boot, Microservices, Angular, React, Kafka, DevOps, Python, PySpark, Databricks, and Generative AI. Certified in TOGAF, AWS, and Google Cloud. Passionate about building scalable, secure, and high-performance systems. Enthusiast in Data Engineering & Agentic AI. Author of 1,200+ technical articles sharing insights across diverse tech stacks.

Date: 2017-07-04

The Evolution of Strings in Java: A Deep Dive into Compact Strings

Java, a language known for its robustness and widespread use, has undergone continuous evolution. One area of significant improvement, particularly noticeable in Java 9, is the optimization of string handling. Strings, being fundamental data structures used in virtually every Java application, are prime targets for performance enhancements. Java 9's introduction of "Compact Strings" exemplifies this ongoing effort to improve efficiency and reduce memory consumption.

The journey toward compact strings began long before Java 9. Early versions of Java used UCS-2 (later called Unicode), a 16-bit encoding scheme capable of representing 65,536 characters. This meant each character in a Java string occupied two bytes of memory. While sufficient for many languages, this approach proved inefficient when dealing with languages requiring more characters or when many strings contained characters representable within a single byte. The addition of UTF-16 support in Java 5 allowed for the representation of a wider range of characters, but the underlying two-byte-per-character structure remained.

Internally, a Java string object wasn't just a simple sequence of characters. It consisted of two parts: the string object itself and a character array holding the actual string data. This character array, composed of 16-bit characters, was the source of significant memory overhead, especially when dealing with strings predominantly containing characters from the basic Latin alphabet (LATIN-1). These characters, frequently used in English and many other languages, can be represented using only eight bits. The extra eight bits in each character within the array were essentially wasted space. This inefficiency became particularly pronounced in applications with a large number of strings, consuming substantial amounts of the Java Virtual Machine's (JVM) heap memory. The memory footprint of these strings persisted until garbage collection intervened, making efficient string management crucial.

To address this memory inefficiency, Java 9 introduced JEP 254 (JDK Enhancement Proposal 254), focusing on Compact Strings. The core idea was to reduce the memory footprint of strings by storing characters using only eight bits when possible. This change was purely an internal implementation detail; the public interface for interacting with strings remained unchanged. Extensive analysis of various Java applications had revealed that many strings were primarily composed of LATIN-1 characters. This observation underpinned the decision to optimize for this common case.

The mechanism behind compact strings involves an optimistic compression strategy during string creation. The system first attempts to represent the string using one byte per character (the ISO-8859-1 encoding for LATIN-1). If a character requires more than eight bits for representation, the entire string is then encoded using the standard two-byte UTF-16 representation. This means that strings consisting solely of LATIN-1 characters are stored more compactly, while strings with characters outside this range maintain the previous UTF-16 representation.

Internal changes were required to support this new approach. A new internal field named "coder" was added to the String class to indicate whether the string is encoded using LATIN-1 (one byte per character) or UTF-16 (two bytes per character). This coder field significantly impacted the implementation of various string methods, particularly those dealing with the length and character access. Calculating the length, for example, necessitates checking the coder field to determine whether to count bytes directly or halve the byte array size (in the case of UTF-16).

The change also affected the performance of specific string operations. In some instances, such as finding the index of a character (indexOf(char)) within a LATIN-1 string, performance was unexpectedly slower compared to the same operation on a UTF-16 string. This was attributed to the lack of intrinsic (optimized) methods for the LATIN-1 case, which was subsequently identified as an issue targeted for improvement in later Java releases.

While Compact Strings are enabled by default in Java 9, there are situations where disabling this feature may be beneficial. For applications heavily reliant on strings predominantly composed of characters beyond the LATIN-1 range, the overhead of the optimistic compression attempt and the subsequent fallback to UTF-16 could outweigh any memory savings. The -XX:-CompactStrings JVM flag allows developers to disable this feature when deemed necessary.

In contrast to a previous attempt at string compression in Java 6 (Compressed Strings), the Compact Strings implementation avoids the performance pitfalls of its predecessor. Compressed Strings required significant unpacking and repacking, leading to runtime inefficiencies. It was a separate, less integrated implementation, and ultimately removed in later Java versions. Compact Strings are more seamlessly integrated, requiring less conversion overhead. Performance tests and independent analyses have shown that Compact Strings significantly reduce memory footprint and improve garbage collection performance, especially when processing strings consisting largely of LATIN-1 characters.

In summary, Java 9's Compact Strings represent a significant advancement in Java's string handling capabilities. By implementing an efficient compression scheme, this feature optimizes memory usage and improves performance without altering the familiar string API. Although some initial performance quirks were identified, the overall impact on many Java applications is overwhelmingly positive, illustrating Java's ongoing commitment to enhancing performance and efficiency. The transition to compact strings demonstrates a sophisticated approach to memory management and demonstrates the continuous evolution of this widely used programming language.

Read more

More from this blog

The Engineering Orbit

1174 posts

The Engineering Orbit shares expert insights, tutorials, and articles on the latest in engineering and tech to empower professionals and enthusiasts in their journey towards innovation.