# `Unicode.String.Break.Word`
[🔗](https://github.com/elixir-unicode/unicode_string/blob/v2.1.0/lib/unicode/string/break/word.ex#L1)

Single-pass DFA-style implementation of UAX #29 word break.

## State

Per-position state is intentionally compact:

* `prev`, `prev2` — the *effective* Word_Break property of the
  previous and second-previous codepoint, where `Extend`, `Format`,
  and `ZWJ` are skipped over (per WB4). `prev2` is only needed for
  WB7 (`AHLetter MidLetter|MidNumLetQ × AHLetter`),
  WB7c (`HebrewLetter DoubleQuote × HebrewLetter`), and
  WB11 (`Numeric MidNum|MidNumLetQ × Numeric`).

* `ri_parity` — `:odd` or `:even`, parity of the run of
  Regional_Indicators ending at `prev` (WB15/16).

* `prev_actual` — the Word_Break property of the codepoint
  *immediately* preceding the current one (without WB4 skipping).
  Required by rules that don't allow transparent characters in
  between, namely WB3 (`CR × LF`), WB3c (`ZWJ × ExtPict`), and
  WB3d (`WSegSpace × WSegSpace`).

## Lookahead

Some rules require knowing the character *after* the candidate
break (WB6, WB7b, WB12). The walker therefore reads codepoints with
one codepoint of buffered lookahead and resolves these rules at
decision time.

# `break?`

```elixir
@spec break?(String.t(), String.t()) :: boolean()
```

Boundary predicate: `true` if there is a word boundary between
`string_before` and `string_after`.

# `next`

```elixir
@spec next(String.t()) :: {String.t(), String.t()} | nil
```

Returns `{first_word, rest}` for `string`, or `nil` for the empty string.

# `split`

```elixir
@spec split(String.t()) :: [String.t()]
```

Splits `string` into a list of word-break segments.

---

*Consult [api-reference.md](api-reference.md) for complete listing*