CoderAI multimodal capabilities

Capability families

Text and embeddings

LLM chat/completions, embeddings, streaming responses, prompt caching, tool calling, per-model context, quantization, and model capability discovery.

Image

Text-to-image, image-to-image, inpainting, upscaling, deblur, unpixelate, outfit change, face swap, depth estimation, segmentation, ControlNet-style conditioning, and restoration workflows.

Video

Text-to-video, image-to-video, video-to-video, Ti2V, interpolation, upscaling, subtitles, dubbing, Wav2Lip/SadTalker lip sync, and 3D video processing.

Audio

Kokoro TTS, Whisper STT, MusicGen/AudioGen/AudioLDM2, F5-TTS voice cloning, Seed-VC voice conversion, singing mode, saved voice profiles, and stem separation.

Profiles and consistency

CoderAI uses named profile collections to preserve identity or scene feel across generations.

Character profiles: reference images for appearance conditioning via IP-Adapter. Up to six profiles can be selected per generation.
Environment profiles: reference scene/background images for environmental style. Same multi-slot selection model.
Voice profiles: reference audio plus transcript for reusable voice cloning and conversion workflows.

2D / 3D conversion

Image → stereo pair, anaglyph, depth map, or mesh.
3D model → rendered image from a specified viewpoint.
Video → frame-by-frame 3D/depth processing.
3D model → turntable video.
Text/image → GLB model with compatible 3D generation models.

Pipelines

Built-in pipelines chain common long workflows, while the custom pipeline builder can chain many step types using variables such as {{input}}, {{stepN.output}}, and {{stepN.url}}.

Endpoint	Description
`POST /v1/pipelines/image-to-video`	Generate an image, animate it, optionally add audio.
`POST /v1/pipelines/video-dub`	Transcribe, translate, TTS dub, and optionally burn subtitles.
`POST /v1/pipelines/story`	LLM script, images per scene, video, and narration.
`POST /v1/pipelines/audio-dub`	Transcribe audio/video, translate, clone voice, replace audio.

Bundled demo/example tools

The repository also includes three demo/example web applications in tools/. They are not required to use the API, but they show how CoderAI can act as a backend for larger media workflows. The Docker / OCI image exposes all three through nginx on the same published port.

`tools/video_editor.py`

Browser video editor using CoderAI TTS plus local ffmpeg/ffprobe for timeline editing, generated voiceover, music tracks, speed ramps, uploads, and final rendering.

Docker / OCI route: /editor/

`tools/videogen.py`

VideoGen Studio manages character/environment profiles and builds multi-clip short movies with video generation, speech/lip-sync, music, and sound effects.

Docker / OCI route: /videogen/

`tools/gen_township_fighters.py`

Township Fighters is an example app for generating fighter-match videos in an MMA-style flow: characters, environments, fight clips, progress, and output review.

Docker / OCI route: /township/

Model capability indicators

The repository documents capability detection in the model UI and cache scanner. Search results and local model tables can show compact badges such as Text, T2I, I2T, T2V, STT, TTS, embeddings, lip sync, and video dubbing. This matters operationally: users can choose models by capability before downloading or routing work.