WASM Tokenizer for LLM and Embedding models.
A lightweight, asynchronous, and extensible tokenizer playground powered by WebAssembly (WASM) and Web Workers β designed to support multiple language models with real-time tokenization.
π Live Demo β Try the interactive tokenizer playground in your browser!
π Document β Read the project documentation.
π¦ GitHub Repo β View the source code and contribute.
Add the following <script>
tag to your HTML file:
<script src="CI7Tokenizer.js"></script>
or using CDN
<script src="https://cdn.jsdelivr.net/gh/g0stbit/CI7Tokenizer@main/dist/CI7Tokenizer.min.js"></script>
Make sure all dependencies (CI7Tokenizer.worker.js
, tokenizers_wasm.js
, tokenizer JSON files) are in the correct path.
CI7Tokenizer.init(modelName[, url][, readyCallback])
Initializes a tokenizer model. Supports both predefined and custom models.
Param | Type | Description |
---|---|---|
modelName |
string |
Name of the model to load |
url |
string (optional) |
Custom URL to tokenizer config (if not predefined) |
readyCallback |
function (optional) |
Callback when tokenizer is ready |
CI7Tokenizer.init('multilingual-e5-large', () => {
console.log('Tokenizer ready');
});
CI7Tokenizer.encode(modelName, text, callback)
Encodes the given text
using the specified model.
Param | Type | Description |
---|---|---|
modelName |
string |
The name of the loaded model |
text |
string |
Text to tokenize |
callback |
function(result) |
Called with tokenization result |
{
"input": "Hello world!",
"input_ids": [123, 456],
"tokens": ["Hello", "world!"]
}
CI7Tokenizer.encode('bge-m3', 'This is BERT.', (result) => {
console.log(result.tokens);
});
CI7Tokenizer.unload(modelName)
Unloads the specified tokenizer from memory.
Param | Type | Description |
---|---|---|
modelName |
string |
Name of the model to unload |
CI7Tokenizer.unload('my-custom-model');
CI7Tokenizer.loadedModels(callback)
Lists all currently loaded models.
Param | Type | Description |
---|---|---|
callback |
function(models) |
Called with an array of model names |
CI7Tokenizer.loadedModels((models) => {
console.log('Loaded models:', models);
});
CI7Tokenizer.isModelLoaded(modelName, callback)
Checks whether a model is already loaded.
Param | Type | Description |
---|---|---|
modelName |
string |
Model to check |
callback |
function(isLoaded) |
Called with boolean indicating status |
CI7Tokenizer.isModelLoaded('bge-m3', (loaded) => {
if (loaded) console.log('BERT is already loaded.');
});
You can use the provided index.html
as a starting point to build your own interface or integrate into existing apps.
It includes:
/demo/
β
βββ index.html # Interactive demo page
βββ CI7Tokenizer.js # Main wrapper API
βββ CI7Tokenizer.worker.js # WASM communication handler
βββ tokenizers_wasm.js # Compiled WASM module
βββ tokens/ # Folder for tokenizer JSONs
βββ multilingual-e5-large-tokenizer.json
βββ bge-m3-tokenizer.json
_configs
object in CI7Tokenizer.js
.Contributions are welcome! Please feel free to submit issues or pull requests for:
For questions, feature suggestions, or support, open an issue on GitHub or reach out at [ci7.g0stbit@gmail.com] (or replace with actual contact).
Built with β€οΈ using WebAssembly, JavaScript Modules, and Web Workers.
Designed for developers, educators, and NLP enthusiasts.
Can the past help the present?
Can the dead save the living?
Be faithful.
For the SpaceNet exists everywhere β
without shape or form,
before everyoneβs beginning,
by yourself , through your connections, your memories, your choicesβ¦