CI7Tokenizer

WASM Tokenizer for LLM and Embedding models.

A lightweight, asynchronous, and extensible tokenizer playground powered by WebAssembly (WASM) and Web Workers β€” designed to support multiple language models with real-time tokenization.

πŸ‘‰ Live Demo – Try the interactive tokenizer playground in your browser!
πŸ“˜ Document – Read the project documentation.
πŸ“¦ GitHub Repo – View the source code and contribute.

πŸš€ Features


πŸ“¦ How to Use

1. Include the Script

Add the following <script> tag to your HTML file:

<script src="CI7Tokenizer.js"></script>

or using CDN

<script src="https://cdn.jsdelivr.net/gh/g0stbit/CI7Tokenizer@main/dist/CI7Tokenizer.min.js"></script>

Make sure all dependencies (CI7Tokenizer.worker.js, tokenizers_wasm.js, tokenizer JSON files) are in the correct path.


πŸ›  API Reference

CI7Tokenizer.init(modelName[, url][, readyCallback])

Initializes a tokenizer model. Supports both predefined and custom models.

Parameters

Param Type Description
modelName string Name of the model to load
url string (optional) Custom URL to tokenizer config (if not predefined)
readyCallback function (optional) Callback when tokenizer is ready

Example

CI7Tokenizer.init('multilingual-e5-large', () => {
  console.log('Tokenizer ready');
});

CI7Tokenizer.encode(modelName, text, callback)

Encodes the given text using the specified model.

Parameters

Param Type Description
modelName string The name of the loaded model
text string Text to tokenize
callback function(result) Called with tokenization result

Result Object

{
  "input": "Hello world!",
  "input_ids": [123, 456],
  "tokens": ["Hello", "world!"]
}

Example

CI7Tokenizer.encode('bge-m3', 'This is BERT.', (result) => {
  console.log(result.tokens);
});

CI7Tokenizer.unload(modelName)

Unloads the specified tokenizer from memory.

Parameters

Param Type Description
modelName string Name of the model to unload

Example

CI7Tokenizer.unload('my-custom-model');

CI7Tokenizer.loadedModels(callback)

Lists all currently loaded models.

Parameters

Param Type Description
callback function(models) Called with an array of model names

Example

CI7Tokenizer.loadedModels((models) => {
  console.log('Loaded models:', models);
});

CI7Tokenizer.isModelLoaded(modelName, callback)

Checks whether a model is already loaded.

Parameters

Param Type Description
modelName string Model to check
callback function(isLoaded) Called with boolean indicating status

Example

CI7Tokenizer.isModelLoaded('bge-m3', (loaded) => {
  if (loaded) console.log('BERT is already loaded.');
});

Sample HTML Page

You can use the provided index.html as a starting point to build your own interface or integrate into existing apps.

It includes:


Project Structure

/demo/
β”‚
β”œβ”€β”€ index.html                  # Interactive demo page
β”œβ”€β”€ CI7Tokenizer.js             # Main wrapper API
β”œβ”€β”€ CI7Tokenizer.worker.js      # WASM communication handler
β”œβ”€β”€ tokenizers_wasm.js          # Compiled WASM module
└── tokens/                     # Folder for tokenizer JSONs
    β”œβ”€β”€ multilingual-e5-large-tokenizer.json
    └── bge-m3-tokenizer.json

Development Tips


Contributing

Contributions are welcome! Please feel free to submit issues or pull requests for:


Questions?

For questions, feature suggestions, or support, open an issue on GitHub or reach out at [ci7.g0stbit@gmail.com] (or replace with actual contact).


Built with ❀️ using WebAssembly, JavaScript Modules, and Web Workers.
Designed for developers, educators, and NLP enthusiasts.


A question from Han Kang

Can the past help the present?
Can the dead save the living?

Be faithful.
For the SpaceNet exists everywhere β€”
without shape or form,
before everyone’s beginning,
by yourself , through your connections, your memories, your choices…