International Text Segmentation with Intl.Segmenter in JavaScript

Discover how Intl.Segmenter improves text segmentation in JavaScript for internationalization

Zachary Lee
JavaScript in Plain English

--

Photo by Stefan Cosma on Unsplash

In the vast realm of JavaScript, the Intl object plays a pivotal role in facilitating language-sensitive string comparison, number formatting, and date and time formatting. Among its many impressive features, one that stands out is Intl.Segmenter. This powerful tool enables locale-sensitive text segmentation, making it easier to extract meaningful items like graphemes, words, or sentences from a string.

The Need for Intl.Segmenter

You might be wondering why we need Intl.Segmenter when we have methods like split(). To understand this, let's take a simple example. Let's say we want to break down a string into sentences. A basic approach might look like this:

'Hello there! How are you doing?'.split(/[.!?]/);

This code would give us an array: ['Hello there', ' How are you doing', '']. While it seems to work, there are a few problems:

  1. The punctuation marks that were used as separators are now gone.
  2. There are leading spaces in some of the resulting strings.
  3. This approach isn’t language-sensitive. It won’t work well for languages that don’t use ., !, or ? to end sentences.

Suppose we’re dealing with a Japanese string: ‘吾輩は猫である。名前はたぬき。’, which translates to “I am a cat. My name is Tanuki.”. Our simple split() method falls short. This is where Intl.Segmenter comes to the rescue!

The Mechanics of Intl.Segmenter

Intl.Segmenter allows us to split strings into meaningful parts. We just need to define a locale and granularity (which could be a sentence, word, or grapheme), and it's ready to segment any string.

Here’s an example of using Intl.Segmenter:

const germanSegmenter = new Intl.Segmenter('de', { 
granularity: 'word'
});
const germanSegments = germanSegmenter.segment('Was geht ab, Freunde?');

In this snippet, we’re breaking down a German string into words.

The Return of Segmenter.segment

The segment() method doesn't return an array, but rather, an iterable. To access all segments, we can use array spreading, Array.from(), or a for...of loop.

Here’s how you can do it:

const germanSegmenter = new Intl.Segmenter('de', {
granularity: 'sentence'
});
const germanSegments = germanSegmenter.segment('Was geht ab, Freunde?');

console.log([...germanSegments]);
console.log(Array.from(germanSegments));
for (let segment of germanSegments) {
console.log(segment);
}

Each segment includes the original string value (input), the character index in the original string (index), and the actual segment string (segment).

Mapping Segments to Their String Values

If you want to map the segments to their string values, you can use the second argument of Array.from(), which is a mapping function.

Here’s an example:

const germanSegmenter = new Intl.Segmenter('de', {
granularity: 'sentence'
});
const germanSegments = germanSegmenter.segment('Was geht ab?');

console.log(Array.from(germanSegments, s => s.segment));

Using the isWordLike Property

When you split a string into words, all segments include spaces and line breaks. Thankfully, Intl.Segmenter provides an isWordLike property that can help filter these out. Here's how:

const germanSegmenter = new Intl.Segmenter('de', {
granularity: 'word'
});
const germanSegments = germanSegmenter.segment('Was geht ab?');

console.log([...germanSegments].filter(s => s.isWordLike));

In this example, we filter out all segments that aren’t words. So, spaces, punctuation marks, and line breaks will be excluded from the result.

Handling Emojis with Intl.Segmenter

One of the exciting applications of Intl.Segmenter is its ability to split a string into visual emojis. Emojis can be quite complex, especially with the advent of compound emojis that consist of multiple code points.

Let’s consider this string of emojis: ‘🫣🫵👨‍👨‍👦‍👦’.

If we try to split it by code units or even code points, we won’t get the expected results:

const emojis = '🫣🫵👨‍👨‍👦‍👦';

console.log(emojis.split('')); // Split by code units
console.log([...emojis]); // Split by code points

However, Intl.Segmenter handles this gracefully:

const emojis = '🫣🫵👨‍👨‍👦‍👦';

const segmenter = new Intl.Segmenter('en', {
granularity: 'grapheme'
});
console.log(Array.from(segmenter.segment(emojis), s => s.segment));

In this case, we get each compound emoji as a separate grapheme.

Wrapping Up

With the ever-growing complexity and globalization of web applications, a tool like Intl.Segmenter can be a game-changer. It brings language-aware string handling to JavaScript, making tasks like segmentation not only possible but relatively straightforward.

Thanks for reading! Love these stories? Support me with a membership for monthly handpicked articles. Or join Medium via my link.

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

--

--