How AI Models Can Learn Human-Like Sketching Techniques

When conveying ideas, words can fall short. Often, the most effective method is to sketch a concept—like diagramming a circuit, which can clarify how the system operates.

But what if artificial intelligence could enhance our exploration of these visualizations? While existing AI systems excel at producing realistic images and cartoon-like drawings, many struggle to replicate the essence of sketching—its iterative, stroke-by-stroke process that aids in brainstorming and refining ideas.

A new system from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Stanford University, known as “SketchAgent,” offers a more human-like drawing experience. This innovative tool employs a multimodal language model—similar to Anthropic’s Claude 3.5 Sonnet—to convert natural language prompts into sketches within seconds. For instance, it can sketch a house independently or engage in collaborative drawing, interpreting text-based inputs to create each component sequentially.

Research demonstrates that SketchAgent can produce abstract renderings of various concepts, including a robot, butterfly, DNA helix, flowchart, and even the Sydney Opera House. In the future, this tool could evolve into an interactive art application, assisting educators and researchers in diagramming intricate ideas or providing users with drawing lessons.

CSAIL postdoc Yael Vinker, the lead author of a research paper on SketchAgent, emphasizes the system’s goal of facilitating a more natural communication style between humans and AI.

“Many don’t realize how often we draw in our daily lives—sketching our thoughts or brainstorming ideas,” she explains. “Our tool seeks to mirror that process, making multimodal language models more effective for visual communication.”

SketchAgent teaches models to sketch stroke-by-stroke without relying on conventional training data. Instead, researchers designed a “sketching language,” translating sketches into a numbered sequence on a grid. The system learns to sketch by observing examples, such as a house, with each stroke labeled according to its meaning—like the seventh stroke representing a “front door”—enabling the model to adapt to new concepts effectively.

In collaboration with three CSAIL affiliates—postdoc Tamar Rott Shaham, undergraduate researcher Alex Zhao, and Professor Antonio Torralba—alongside Stanford University’s Research Fellow Kristine Zheng and Assistant Professor Judith Ellen Fan, Vinker will present their findings at the 2025 Conference on Computer Vision and Pattern Recognition (CVPR) this month.

Evaluating AI’s Sketching Abilities

While text-to-image models like DALL-E 3 create interesting visuals, they often miss the vital element of sketching: the spontaneous, creative process where each stroke influences the overall design. In contrast, SketchAgent generates images that flow more naturally, resembling human sketches.

Previous efforts have sought to imitate this process, yet they trained their models on limited human-drawn datasets. SketchAgent, however, utilizes pre-trained language models rich in conceptual knowledge but lacking sketching capabilities. By teaching these models the sketching process, SketchAgent demonstrates versatility, drawing diverse concepts beyond its initial training.

To determine whether SketchAgent actively engages in the sketching process alongside humans or operates independently, the research team conducted tests in collaboration mode. When they removed SketchAgent’s contributions, the resulting drawings were often unrecognizable. For example, omitting artificial strokes representing a sailboat’s mast rendered the final sketch incomprehensible.

In further experiments, the researchers integrated various multimodal language models into SketchAgent, assessing their ability to generate recognizable sketches. The default model, Claude 3.5 Sonnet, produced the most human-like vector graphics, surpassing alternatives such as GPT-4o and Claude 3 Opus.

“Claude 3.5 Sonnet’s superior performance over models like GPT-4o and Claude 3 Opus indicates a distinct processing approach in generating visual information,” co-author Tamar Rott Shaham notes.

According to Shaham, SketchAgent could evolve as a valuable interface for AI collaboration beyond conventional text-based communication. “As models advance in understanding and generating different modalities, like sketches, they provide new ways for users to express concepts and obtain more intuitive, human-like responses,” she explains. “This could significantly enhance user interaction, making AI more accessible and versatile.”

Despite its promising capabilities, SketchAgent is not yet equipped for professional sketches. It currently generates simple representations resembling stick figures and doodles, struggling to depict intricate designs like logos, sentences, or specific creatures. There are occasions when the model misinterprets user instructions during collaborative drawing, such as sketching a bunny with two heads. Vinker suggests this may stem from the model’s breakdown of tasks into smaller steps, sometimes misaligning with human contributions. Potential improvements could involve training on synthetic data from diffusion models.

Moreover, SketchAgent may require multiple prompts before producing human-like doodles. The research team aims to refine user interactions and further enhance the sketching experience with multimodal language models.

Ultimately, this tool signals a shift in AI’s ability to draw concepts in a manner akin to humans, showcasing an interactive process that merges human and AI contributions for more cohesive final designs.

This research was supported by several organizations, including the U.S. National Science Foundation, a Hoffman-Yee Grant from the Stanford Institute for Human-Centered AI, Hyundai Motor Co., the U.S. Army Research Laboratory, the Zuckerman STEM Leadership Program, and a Viterbi Fellowship.

Photo credit & article inspired by: Massachusetts Institute of Technology

Leave a Reply

Your email address will not be published. Required fields are marked *