AI WebGPU Lab Multimodal

Browser VLM Multimodal Readiness

`exp-vlm-browser-multimodal` records a deterministic browser vision-language baseline before real VLM runtimes, image preprocessors, and multimodal token pipelines land.

The harness fixes image fixture metadata, prompt set, image preprocess time, image-to-first-token latency, answer total latency, and task accuracy in one schema-aligned result.

Run Controls

Probe capability first, then run the deterministic multimodal prompt set to export vision latency and answer accuracy metadata.

Fixture Image

Desk fixture with monitor, development board, notebook, keyboard, and mug

Checks

  • Keep image resolution, patch count, prompt set, and expected answers fixed.
  • Record image preprocess, first-token latency, answer latency, accuracy score, and fallback metadata before real VLM wiring lands.
  • Use this surface as the seed input for later multimodal latency benchmark and browser-image app experiments.

Prompt Output

No multimodal run yet.

Metrics

Environment

Activity Log

    Schema-Aligned Result Draft

    {
      "status": "pending"
    }

    What This Unlocks

    • First browser multimodal raw JSON and screenshot capture path
    • Stable image latency and answer accuracy metadata for later VLM runtime integration
    • Reusable prompt fixture for `bench-multimodal-latency` and `app-browser-image-lab`