VB6Parse / Documentation

Form File Architecture Explained

Overview

The FormFile parser is one of the most complex components in vb6parse due to the unique structure of VB6 Form files (.frm). These files combine:

The parser must handle all four sections efficiently while providing both full parsing capability and fast-path extraction when only UI information is needed.

VB6 Form File Structure

A typical .frm file follows this layout:

VERSION 5.00
Object = "{831FDD16-0C5C-11D2-A9FC-0000F8754DA1}#2.0#0"; "mscomctl.ocx"
Begin VB.Form Form1
   Caption         =   "My Form"
   ClientHeight    =   3195
   ClientWidth     =   4680
   BeginProperty Font 
      Name            =   "Verdana"
      Size            =   8.25
      Charset         =   0
   EndProperty
   Begin VB.CommandButton Command1 
      Caption         =   "Click Me"
      Height          =   495
      Left            =   120
   End
End
Attribute VB_Name = "Form1"
Attribute VB_GlobalNameSpace = False

Private Sub Command1_Click()
    MsgBox "Hello!"
End Sub

Key Sections:

  1. VERSION - File format version (e.g., 5.00)
  2. Object - External component references (OCX/DLL)
  3. BEGIN...END blocks - Hierarchical control definitions
  4. Attribute - File-level metadata
  5. Code - VB6 procedures and event handlers

Challenges

Parsing Architecture

Multi-Layer Pipeline

Bytes
(Windows-1252 encoded)
↓
SourceFile
(decode_with_replacement)
↓
SourceStream
(character stream with tracking)
↓
tokenize()
(keyword lookup via phf_map)
↓
TokenStream
(Vec<(text, Token)>)
↙
↘
CST
(full)
Direct Extract
(fast path)
↘
↙
FormFile
- version
- objects
- form Control
- attributes
- cst (code)

Design Philosophy & Trade-offs

Core Principles

  1. Correctness over speed (but optimize where possible)
  2. Preserve all information (CST includes whitespace/comments)
  3. Memory efficiency (rowan's red-green tree, shared nodes)
  4. Partial success model (return what was parsed + collect errors)
  5. Type safety (strong Rust enums for properties and controls)

The Hybrid Approach Decision

The FormFile parser evolved through several iterations:

Phase 1: Full CST First (Original Design)

// Build complete CST, then extract everything from it
let cst = parse(token_stream);
let version = extract_version(&cst);
let objects = extract_objects(&cst);
let control = extract_control(&cst);
let attributes = extract_attributes(&cst);
✅ Pros
  • Simple, uniform approach
  • CST available for all sections
  • Easy to debug and visualize
❌ Cons
  • Expensive: Building CST for control blocks creates nodes for every token
  • Wasteful: Control properties extracted into Control structs, then CST discarded
  • Slow: For large forms, CST construction dominated parse time

Phase 2: Control-Only Extraction (Attempted Optimization)

// Skip CST, extract directly from tokens
let result = FormFile::parse_control_only(token_stream);
let (version, control, remaining_tokens) = result.unpack();
✅ Pros
  • Fast: No CST overhead for header/control sections
  • Memory efficient: Only creates final Control structs
  • Useful: Perfect for UI tools
❌ Cons
  • Incomplete: Doesn't parse code section
  • Separate API: Forces users to choose
  • Duplication: Logic exists in two places

Phase 3: Hybrid Strategy (Current Design) âœĻ

// Direct extraction for structured sections
let version = parser.parse_version_direct();
let objects = parser.parse_objects_direct();
let control = parser.parse_properties_block_to_control();
let attributes = parser.parse_attributes_direct();

// Build CST only for code section
let remaining_tokens = parser.into_tokens();
let cst = parse(TokenStream::from_tokens(remaining_tokens));
✅ Pros
  • Best of both worlds: Fast for headers, full CST for code
  • Single API: Users call FormFile::parse() regardless
  • Flexibility: parse_control_only() still available
  • Memory efficient: No CST nodes for extracted sections
  • Correct: Code section gets full CST with all information
⚠ïļ Trade-offs
  • Complexity: Parser has two modes
  • Maintenance: Changes may need updates in both paths
  • Learning curve: Developers must understand hybrid model

The Hybrid Parsing Strategy

Direct Extraction Methods

The Parser struct provides special methods for direct extraction:

1. new_direct_extraction(tokens, pos)

Creates a parser in "direct extraction mode" where tokens are consumed without building CST nodes.

let mut parser = Parser::new_direct_extraction(tokens, 0);

2. parse_version_direct()

Extracts VERSION without CST:

// Parses: VERSION 5.00 [CLASS]
let (version_opt, failures) = parser.parse_version_direct().unpack();

Returns: FileFormatVersion { major, minor }

3. parse_objects_direct()

Extracts Object references without CST:

// Parses: Object = "{UUID}#version#flags"; "filename"
let objects = parser.parse_objects_direct();

Handles two formats:

4. parse_properties_block_to_control()

This is the most complex direct extraction method. It recursively parses BEGIN...END blocks:

let (control_opt, failures) = parser.parse_properties_block_to_control().unpack();

Parses:

Returns: Fully constructed Control struct with name, tag, index, and typed properties

5. parse_attributes_direct()

Extracts Attribute statements:

// Parses: Attribute VB_Name = "Form1"
let attributes = parser.parse_attributes_direct();

Implementation Details

Control Type Mapping

The parser maps VB6 control type strings to Rust enum variants:

match control_type.as_str() {
    "VB.Form" => ControlKind::Form {
        properties: properties.into(),
        controls: child_controls,
        menus,
    },
    "VB.CommandButton" => ControlKind::CommandButton {
        properties: properties.into(),
    },
    "VB.TextBox" => ControlKind::TextBox {
        properties: properties.into(),
    },
    // ... 30+ built-in controls
    _ => ControlKind::Custom {
        properties: properties.into(),
        property_groups,
    },
}

Design decision: Default to Custom for unknown controls (e.g., third-party OCX controls).

Property Parsing

Properties are stored in a Properties struct (thin wrapper around HashMap):

pub struct Properties {
    key_value_store: HashMap<String, String>,
}

Type conversion happens at access time:

let width = properties.get_i32("ClientWidth", 600);  // Default: 600
let visible = properties.get_bool("Visible", true);
let color = properties.get_color("BackColor", VB_WINDOW_BACKGROUND);

Trade-off: Store as strings, convert on demand

  • ✅ Flexible: Can defer parsing errors
  • ✅ Simple: No complex property value enum
  • ⚠ïļ Repetitive: Same conversion code in multiple places
  • ⚠ïļ Type safety: Errors happen at runtime, not parse time

Property Groups

Property groups handle nested structures like Font properties:

BeginProperty Font {GUID}
   Name            =   "Verdana"
   Size            =   8.25
   Charset         =   0
EndProperty

Structure:

pub struct PropertyGroup {
    pub name: String,
    pub guid: Option<Uuid>,
    pub properties: HashMap<String, Either<String, PropertyGroup>>,
}

Uses Either<String, PropertyGroup> to support nesting:

Error Handling

The parser uses a partial success model:

pub struct ParseResult<'a, T, E> {
    pub result: Option<T>,
    pub failures: Vec<ErrorDetails<'a, E>>,
}

Philosophy:

  • Best effort: Parse as much as possible
  • Collect errors: Don't stop on first failure
  • Return both: Result + error list

Example Usage:

let (form_file_opt, failures) = FormFile::parse(&source_file).unpack();

if let Some(form) = form_file_opt {
    // Use parsed data
    println!("Form: {}", form.form.name);
}

if !failures.is_empty() {
    // Report warnings
    for error in failures {
        eprintln!("Warning: {:?}", error);
    }
}

Control Hierarchy & Properties

Type-Safe Control System

Each control type has a dedicated properties struct:

pub enum ControlKind {
    Form {
        properties: FormProperties,
        controls: Vec<Control>,
        menus: Vec<MenuControl>,
    },
    CommandButton {
        properties: CommandButtonProperties,
    },
    TextBox {
        properties: TextBoxProperties,
    },
    // ... 30+ variants
    Custom {
        properties: CustomControlProperties,
        property_groups: Vec<PropertyGroup>,
    },
}

Property structs use strong types:

pub struct FormProperties {
    pub caption: String,
    pub back_color: Color,
    pub border_style: FormBorderStyle,
    pub client_height: i32,
    pub client_width: i32,
    pub max_button: MaxButton,
    pub min_button: MinButton,
    // ... 50+ fields
}

Enums for discrete values:

#[derive(TryFromPrimitive)]
#[repr(i32)]
pub enum FormBorderStyle {
    None = 0,
    FixedSingle = 1,
    Sizable = 2,
    FixedDialog = 3,
    FixedToolWindow = 4,
    SizableToolWindow = 5,
}

Future Considerations

Potential Improvements

1. AST Layer

Currently, code sections are parsed into CST (preserves whitespace). A future AST could:

  • Strip whitespace/comments
  • Provide semantic queries
  • Enable code transformations

Trade-off: More complexity, but better for code analysis tools.

2. Incremental Parsing

For IDE scenarios, support incremental re-parsing:

  • Cache CST nodes
  • Re-parse only changed sections
  • Update property structs efficiently

Challenge: Rowan supports this, but requires careful state management.

3. Parallel Parsing

Large projects could parse forms in parallel:

  • Each .frm file is independent
  • Use rayon for parallel iteration
  • Aggregate results

Benefit: Faster bulk parsing for project-wide analysis.

Performance Metrics

Based on benchmarks with real-world VB6 projects:

Operation Time (avg) Memory
Parse small form (5 controls) ~50Ξs 10KB
Parse medium form (30 controls) ~200Ξs 50KB
Parse large form (100 controls) ~800Ξs 200KB
parse_control_only() speedup 2-3x faster 50% less

Key insight: Direct extraction is most beneficial for:

  • Large forms (many controls)
  • Tools that don't analyze code
  • Bulk processing scenarios

Summary

The FormFile parser represents a pragmatic balance between:

  1. Completeness: Full CST for code, typed properties for controls
  2. Performance: Direct extraction for structured sections
  3. Flexibility: Both full parse and fast-path APIs
  4. Correctness: Windows-1252 encoding, partial success model
  5. Maintainability: Rowan abstracted, single source of truth

The hybrid strategy was chosen because:

  • ✅ VB6 forms have distinct sections with different needs
  • ✅ CST overhead matters most for structured data (controls)
  • ✅ Code sections benefit from full CST (formatting, analysis)
  • ✅ Single API hides complexity from users
  • ✅ Specialized tools can use parse_control_only() fast path

This architecture successfully handles the diverse requirements of VB6 form parsing while maintaining reasonable performance and memory characteristics for real-world projects.