Gemma 4 VLM: Now Starting to Read Site Plans

Technology2026. 06. 12.

Gemma, Google DeepMind’s open-weight model family, has advanced to its fourth generation, expanding into multimodal models that accept text and images as input, with some models even capable of handling audio. With the addition of support for long contexts of up to 256K tokens and over 140 languages, the trend of "open models handling long text, multimodal data, and multilingual capabilities all at once, and deploying from on-device to server" has become clear.

What makes this change particularly interesting is that it has significantly lowered the barrier to entry for VLM (Vision-Language Models), which “see and describe” images. Construction site drawings have long been a prime example of unstructured data that machines have struggled to read. Digitalpresso is also preparing to apply this technology to drawing version control. In this article, we’ll explore what open VLMs like Gemma 4 are changing and why this technology is specifically addressing the long-standing challenge of “drawing version control.”

Gemma 4 and Open Multimodal Models
Why Blueprint Version Control Is Difficult
'What' and 'When' Changed
VLM-Based Drawing Version Control
Conclusion

Gemma 4 and Open Multimodal Models

Open Models Have Gone Multimodal

Gemma 4, released by Google DeepMind, is a model series with open weights—meaning the weights are publicly available so users can deploy them directly on their own infrastructure. The key feature is the expanded scope of input. It accepts both text and images, and some models even process audio.

This is complemented by support for long contexts of up to 256K tokens and over 140 languages. In simple terms, this means you can feed the model an entire bundle of long, complex documents and have it interpret even the images contained within.

VLM: Models That "Read" Images

At the center of this trend is the VLM. These are models that take images—such as photos or diagrams—as input and describe in natural language what is present within them and what has changed. The fact that this capability has become available as an open model that can be deployed on-device or on-premises without relying on cloud APIs is particularly significant in industrial settings where security and cost are critical concerns.

Why Version Control for Drawings Is Challenging

Drawings are constantly changing

On construction sites, drawings are not documents that are finalized once and then set aside. They undergo repeated revisions (Rev) based on design changes, site conditions, and client requests. The problem is that these revised versions circulate simultaneously through various channels, such as printed copies, PDFs, and images shared via messaging apps.

As a result, some people end up working with an outdated version, while others are looking at the latest one. This is why questions like “Is this drawing I’m looking at really the latest version?” and “When exactly was this changed?” are constantly being asked on-site.

Limitations of Existing Version Management

Until now, drawing version management has largely relied on file naming conventions and manual organization by people. This involves adding tags like “Rev_C” or “Final_ReallyFinal” to file names. However, this method fails to reveal “what changed” or “how it changed,” nor does it indicate “when the change occurred.” To identify the differences between two versions and pinpoint the exact timing of those changes, people ultimately had to lay the two drawings side by side, compare them visually, and then piece together the details using their memory and chat logs.

'What changed, and when'

Images were difficult for machines to read

The root cause lies in the fact that drawings are “images.” While text documents are easy to search, compare, and track historically, drawings—being images composed of lines and symbols—were structurally difficult for machines to process.

Consequently, if construction proceeds based on an incorrect version, it leads to rework and material waste. When determining liability, the core of the dispute becomes “which drawing from which point in time was used as the basis for the work.” In this system, version confusion regarding a single drawing directly translates into costs and risks.

“When” is just as important as “what”

On-site drawing history is more than just a list of differences. Only when both “which areas were changed” and “when and in which version those changes were reflected” are recorded together can proper tracking and verification be achieved. If the timing of changes is organized into a timeline, you can trace back to verify which revision a specific task was based on. Until now, this “when” was left only to human memory and scattered conversations, making it easily lost.

That barrier is now coming down

This is precisely where the expansion of VLM becomes significant. If a model can read and describe images, it can take two revisions as input and identify “which areas have changed and how” on behalf of humans. Tasks that previously relied solely on the human eye have, for the first time, entered a domain where machines can assist.

VLM-Based Drawing Version Management

We already provide features that attach drawings during site-specific communication and automatically map location and time metadata from photos to verify construction quality. Building on this foundation, Digitalpresso is planning to take drawing version management to the next level by integrating open VLMs like Gemma 4.

For example, the system would analyze a newly uploaded drawing to identify differences from the previous revision, and automatically record the version history—including when the changes were applied—along with time metadata. When "what has changed" and "when it changed" are compiled into a single timeline, users can determine which version is the latest and even trace back to verify which drawings served as the basis for specific tasks. Since it is an open model, the fact that sensitive data such as drawings can be handled within our own environment without being exported externally is another reason we chose this direction.

Digitalpresso plans to gradually refine this feature and introduce it to the field. If you are looking for ways to reduce drawing confusion and the burden of documentation, we encourage you to stay tuned for future updates.

In Closing

The expansion of the open multimodal model is more than just news that “AI has gotten smarter.” It is a sign that unstructured data—particularly images like on-site drawings—which had previously been entrusted solely to human eyes and hands, can now be read by machines and organized to track exactly what changed and when.

The pace at which technology matures and the pace at which the field adopts it are always different. However, it seems clear that the long-standing inefficiencies surrounding the versioning of a single drawing and the timing of its changes are now entering the realm of solvable problems. Digitalpresso, too, intends to prepare step by step for the process of translating this change into a concrete feature: drawing version management.