aitokenizer - input tokenization microservice for AI workloads
Aitokenizer is a Nulladmin.com microservice which performs input tokenization for AI workloads. It supports various encodings used by OpenAI, Anthropic and others.
- Home: https://nulladmin.com/aitokenizer/
- Github: https://github.com/nulladmin/aitokenizer
- OpenAPI Spec: https://nulladmin.com/aitokenizer/openapi/
The aitokenizer is primarily designed to be deployed as a stateless Docker container in production Kubernetes clusters. It includes support for structured JSON logging, graceful shutdown, health/readiness API, informative error handling, and standard YAML/ENV configuration.
Overview
aitokenizer is used to perform BPE input tokenization on text files. It is fully compatible with the OpenAPI tiktoken encoder.
Support is available for the following BPE AI text encodings as used in various models:
- o200k_base - used by OpenAI gpt-4o-* models
- cl100k_base - used by OpenAI gpt-3- and gpt-4- models
- claude - used by Anthropic Claude models
The following are also supported encodings for older deprecated OpenAI models.
- p50k_base
- p50k_edit
- r50k_base
- gpt2
The encoding functions are provided through a well specified OpenAPI REST API
Additional features
Additional features can be made available on demand. Some possibilities:
- Various forms of authentication (JWT, Basic, Token, OAuth, Cert)
- Support for other encodings, e.g. the SentancePiece tokenizer used by Google Gemma models.
- OpenTracing support
- Support for Amazon S3 or a database as input and output sources.
Inquire at contact@nulladmin.com
Building and Running the server
To build and run the server, follow these simple steps within the git repository.
While you can skip this step Nulladmin recommends that you always vendor your dependencies:
go mod vendor
To build and run the server with a default configuration:
make build
./aitokenizer
http://localhost:8080
.
Running in Docker
After building the server above you can generate a Docker container:
docker build --network=host -t aitokenizer .
Once the image is built you can run it on localhost as follows. Note that port 8080 is the default port for the server in the Docker container.
docker run --rm -p 8080:8080 -v /tmp:/tmp --hostname localhost -it aitokenizer
Usage
$ aitokenizer --help
aitokenizer - a microservice to perform text input tokenization for AI workloads
Usage Examples:
# Configuration from file
aitokenizer -c config.yaml
Options:
-c, --config Path to the configuration file, yaml format
-d, --debug Turn on debug level logging/tracing
-h, --help Print help
-v, --version Print version
All configuration values are specified either in the configuration file
or taken from environment variables.
Example
Give a text file test.txt
with the following content:
Sing, O goddess, the anger of Achilles son of Peleus, that brought
countless ills upon the Achaeans. Many a brave soul did it send hurrying
down to Hades, and many a hero did it yield a prey to dogs and vultures,
for so were the counsels of Jove fulfilled from the day on which the son of
Atreus, king of men, and great Achilles, first fell out with one another.
We can tokenize it using the o200k_base
encoding used by OpenAI gpt-4o
series
of models:
$ curl -i -s -F "filename=@test.txt" http://example.com/v1/encode/o200k_base
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Date: Fri, 04 Oct 2024 20:08:01 GMT
Content-Length: 452
[54138,11,532,109209,11,290,32129,328,170018,2391,328,141304,385,11,484,11311,198,5420,2695,220,3678,7557,290,355,50763,616,13,11048,261,53768,16741,2242,480,4952,312,1057,20370,198,4653,316,487,4250,11,326,1991,261,20917,2242,480,14376,261,78166,316,16798,326,165872,1609,412,1938,813,1504,290,2692,199640,328,643,1048,61554,591,290,2163,402,1118,290,2391,328,198,2243,264,385,11,13793,328,1966,11,326,2212,170018,11,1577,18153,842,483,1001,3613,558]
Configuration
The server has a small number of configuration variables which can be specified either via environment variables or a configuration YAML file.
#
# Example Configuration file for aitokenizer showing default values
#
http_server:
# Server bind address and port
# address: "" is the default for serving on all interfaces
address: ""
port: 8080
# HTTP read header timeout in seconds, the time from the connection is
# accepted to when the API request headers are fully read.
read_header_timeout: 20
# HTTP read timeout in seconds, the time from when the connection is
# accepted to when the API request body is fully read
read_timeout: 60
# HTTP write timeout in seconds, the time from the end of the request
# header read to the end of the API response write
write_timeout: 60
# HTTP idle timeout in seconds
idle_timeout: 120
# Maximum size of request header in bytes, default is 1MB
max_header_bytes: 1000000
# Enable verbose debug logging, true or false
debug: false
All configuration options can be set as environment variables on server startup, for example:
DEBUG=true HTTP_SERVER_PORT=8081 aitokenizer
OpenAPI Specification
The full aitokenizer OpenAPI specification can be found here: