Attribute-based people search in surveillance environments
Daniel A. Vaquero, Rogerio S. Feris, et al.
WACV 2009
Document processing pipelines traditionally cascade optical character recognition (OCR) engines with downstream models for structured information extraction, leading to multi-stage error propagation. We fine-tune SmolDocling, a compact 256M-parameter vision-language model (VLM), to perform end-to-end key-value extraction directly from document images, jointly solving identification, localization, and association in a single pass without OCR preprocessing. We extend DocTags with specialized key, value, region, and link tags, enabling many-to-many relationships in a unified output sequence. To address data limitations, we design an augmentation pipeline combining synthetic form filling and graph-based crops that preserve complete key-value subgraphs. We further introduce a layout-aware evaluation framework extending text matching with spatial bounding box verification. On FUNSD, XFUND, and a large-scale private dataset, our model outperforms larger zero-shot VLM baselines under layout-aware evaluation, while being 27× smaller than Qwen2.5-VL (7B) and over 5×faster at inference. The model weights will be released publicly after publication.
Daniel A. Vaquero, Rogerio S. Feris, et al.
WACV 2009
Conrad Albrecht, Jannik Schneider, et al.
CVPR 2025
Pavel Kisilev, Daniel Freedman, et al.
ICPR 2012
Sudeep Sarkar, Kim L. Boyer
Computer Vision and Image Understanding