Files
BoredOS/docs/appdev/inputs_api_(utf8).md
Lluciocc 8d0e744991 doc: Add UTF-8 byte structure section and resources (#10)
Added a section on UTF-8 byte structure with a diagram and a recommended video for further understanding.
2026-04-25 00:51:54 +02:00

252 lines
4.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# UTF-8 Library — Application Development Guide
## Overview
The userland libc provides a lightweight UTF-8 utility module located in:
- src/userland/libc/utf-8.c
- src/userland/libc/utf-8.h
This module is designed for **direct use in applications** requiring UTF-8 handling. It provides basic primitives for decoding, encoding, and traversing UTF-8 strings safely.
It is intended for:
- text rendering
- terminal input/output
- cursor movement
- string processing at the character level
---
## Synopsis
```c
#include "utf-8.h"
uint32_t text_decode_utf8(const char *s, int *advance);
int text_encode_utf8(uint32_t cp, char *out);
const char* text_next_utf8(const char *s);
const char* text_prev_utf8(const char *start, const char *s);
int text_strlen_utf8(const char *s);
```
---
## API Reference
### text_decode_utf8
```c
uint32_t text_decode_utf8(const char *s, int *advance);
```
Decodes a UTF-8 sequence into a Unicode code point.
- `s`: pointer to current position in a UTF-8 string
- `advance`: receives number of bytes consumed
Returns:
- decoded Unicode code point (`uint32_t`)
- `0` if input is null or empty
- `0xFFFD` for invalid sequences
---
### text_encode_utf8
```c
int text_encode_utf8(uint32_t cp, char *out);
```
Encodes a Unicode code point into UTF-8.
- `cp`: Unicode code point
- `out`: buffer receiving encoded bytes
Returns:
- number of bytes written (14)
- writes replacement character if `cp` is invalid
---
### text_next_utf8
```c
const char* text_next_utf8(const char *s);
```
Advances to the next UTF-8 character.
Returns a pointer to the next character boundary.
---
### text_prev_utf8
```c
const char* text_prev_utf8(const char *start, const char *s);
```
Moves backward to the previous UTF-8 character.
- `start`: beginning of the buffer
- `s`: current position
Used for reverse traversal and cursor movement.
---
### text_strlen_utf8
```c
int text_strlen_utf8(const char *s);
```
Counts UTF-8 characters (code points), not bytes.
---
## Usage Examples
### Iterating over UTF-8 characters
```c
const char *p = text;
while (*p) {
int adv;
uint32_t cp = text_decode_utf8(p, &adv);
/* process cp */
p += adv;
}
```
---
### Cursor movement
```c
cursor = text_next_utf8(cursor);
cursor = text_prev_utf8(buffer_start, cursor);
```
---
### Encoding a character
```c
char out[4];
int len = text_encode_utf8(0x20AC, out);
```
---
### Backspace handling
```c
char *prev = (char*)text_prev_utf8(buffer, cursor);
cursor = prev;
```
---
## Implementation Notes
### UTF-8 Encoding
The implementation supports:
- 1 byte: `0x00 0x7F`
- 2 bytes: `0x80 0x7FF`
- 3 bytes: `0x800 0xFFFF`
- 4 bytes: `0x10000 0x10FFFF`
---
### Replacement Character
Invalid sequences are replaced with:
- code point: `0xFFFD`
- UTF-8 encoding: `0xEF 0xBF 0xBD`
---
### UTF-8 Byte Structure
The following diagram illustrates how UTF-8 bytes are structured, including
ASCII, continuation bytes, and multi-byte sequence headers:
<img width="815" height="1003" alt="image" src="https://github.com/user-attachments/assets/0d289a94-6037-4039-87a3-125c0c0e83d0" />
<sub>Source: <a href="https://www.youtube.com/watch?v=vpSkBV5vydg">Nic Barker — "UTF-8, Explained Simply"</a> (YouTube)</sub>
---
### Control Signals
Some decoded code points correspond to control signals instead of printable characters.
ASCII control range:
- `0x00 0x1F`
Examples:
- `0x08` → Backspace
- `0x09` → Tab
- `0x0A` → Line Feed
- `0x0D` → Carriage Return
- `0x1B` → Escape
These are typically interpreted by:
- terminal logic
- shell input handling
- system interfaces
---
### Non-ASCII Characters
Characters outside the ASCII range (`0x00 0x7F`) are encoded using multi-byte UTF-8 sequences.
Examples:
- 'é' → `0xC3 0xA9`
- '€' → `0xE2 0x82 0xAC`
Decoded values:
- 'é' → `U+00E9`
- '€' → `U+20AC`
---
### Modifiers and Layout
Character output depends on:
- keyboard layout
- modifier keys (Shift, Ctrl, AltGr)
Example:
- `KEY_E` → 'e'
- `KEY_E + SHIFT` → 'E'
- `KEY_E + AltGr` → '€'
---
## Also worth watching
If you want to dive deeper or simply get a better intuitive understanding of UTF-8, the video below is highly recommended:
[Nic Barker — "UTF-8, Explained Simply"](https://www.youtube.com/watch?v=vpSkBV5vydg)