Automation

Automated Business Process Discovery: Mining Workflows Without Expensive Software

admin

Before you can automate a business process, you need to understand it. Process discovery reveals how work actually flows through your organisation, often exposing gaps between documented procedures and reality. Enterprise process mining platforms like Celonis, SAP Signavio, and UiPath promise automated discovery, but at costs that put them out of reach for most organisations. The alternative is not guesswork. Python and open source tools can deliver professional-grade process discovery without six-figure software investments.

This guide shows you how to implement automated business process discovery using PM4Py and related open source tools. You will learn to extract process insights from your data, visualise workflows, identify bottlenecks, and build the foundation for informed automation decisions.

What Is Process Discovery?

Process discovery extracts actual process behaviour from event data. Rather than relying on how people think processes work, process mining analyses system logs to reveal how processes actually execute.

The typical process discovery workflow:

  1. Extract event logs from source systems (ERP, CRM, ticketing systems)
  2. Transform data into standard process mining format
  3. Discover process models using mining algorithms
  4. Analyse for insights (bottlenecks, variations, compliance gaps)
  5. Visualise and communicate findings to stakeholders

Enterprise tools automate these steps but charge accordingly. Celonis deployments commonly exceed $200,000 annually for mid-sized organisations. For many, this cost cannot be justified before proving process mining delivers value.

The Enterprise Process Mining Cost Problem

What Enterprise Tools Cost

PlatformTypical Annual CostImplementationBest For
Celonis$150,000-$500,000+$50,000-$200,000Large enterprises with SAP
SAP Signavio$100,000-$300,000+Included with SAP projectsSAP customers
UiPath Process Mining$50,000-$150,000+Bundled with RPAUiPath automation users
ABBYY Timeline$30,000-$100,000+$20,000-$50,000Document-heavy processes
IBM Process Mining$80,000-$200,000+VariesIBM ecosystem

These figures exclude the internal effort required to prepare data, validate models, and operationalise insights. Total cost of ownership often doubles the software expense.

When Enterprise Tools Make Sense

Enterprise platforms justify their costs when:

  • Process mining is a strategic, organisation-wide initiative
  • Real-time monitoring and alerting are required
  • Integration with enterprise systems (SAP, Salesforce) is critical
  • Compliance requires vendor support and certification
  • Scale exceeds what open source tools handle efficiently

For everyone else, open source alternatives deliver the essential capabilities at a fraction of the cost.

PM4Py: The Open Source Process Mining Standard

PM4Py is the leading open source process mining library, developed at Fraunhofer FIT and freely available under Apache 2.0 license. It provides comprehensive capabilities for process discovery, conformance checking, and analysis.

Installing PM4Py

pip install pm4py

For visualisation support:

pip install graphviz

Core Capabilities

PM4Py implements the algorithms and techniques that power commercial process mining platforms:

  • Process Discovery: Alpha Miner, Heuristic Miner, Inductive Miner
  • Conformance Checking: Token-based replay, alignment-based
  • Process Enhancement: Performance analysis, bottleneck detection
  • Filtering and Preprocessing: Activity, time, case filtering
  • Visualisation: Process maps, BPMN diagrams, DFG visualisation

Building a Process Discovery Pipeline

Let’s build a complete process discovery pipeline using PM4Py. This example analyses an order-to-cash process but the approach applies to any process with event data.

Step 1: Prepare Your Event Log

Process mining requires event logs with three essential fields:

  • Case ID: Unique identifier for each process instance (order number, ticket ID)
  • Activity: The step or action that occurred
  • Timestamp: When the activity occurred

Additional fields enhance analysis:

  • Resource: Who performed the activity
  • Cost: Associated costs
  • Custom attributes: Any relevant process data
import pandas as pd
import pm4py

# Sample data structure
data = {
    'case_id': ['ORD001', 'ORD001', 'ORD001', 'ORD002', 'ORD002', 'ORD002'],
    'activity': ['Create Order', 'Approve Order', 'Ship Order',
                 'Create Order', 'Reject Order', 'Close Order'],
    'timestamp': ['2026-01-01 09:00', '2026-01-01 14:00', '2026-01-02 10:00',
                  '2026-01-01 10:00', '2026-01-01 16:00', '2026-01-01 17:00'],
    'resource': ['Sales', 'Manager', 'Warehouse',
                 'Sales', 'Manager', 'Sales']
}

df = pd.DataFrame(data)
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Convert to PM4Py event log format
event_log = pm4py.format_dataframe(
    df,
    case_id='case_id',
    activity_key='activity',
    timestamp_key='timestamp'
)

Step 2: Extract Data from Source Systems

Real process discovery starts with extracting event data from your systems. Common sources include:

ERP Systems (SAP, Oracle):

import pandas as pd
from sqlalchemy import create_engine

# Connect to database
engine = create_engine('postgresql://user:pass@host/erp_db')

# Extract order events
query = """
SELECT
    order_number as case_id,
    event_type as activity,
    event_timestamp as timestamp,
    user_name as resource
FROM order_events
WHERE event_timestamp >= '2025-01-01'
ORDER BY order_number, event_timestamp
"""

df = pd.read_sql(query, engine)

Ticketing Systems (ServiceNow, Jira):

# Example: Jira export processing
df = pd.read_csv('jira_export.csv')

# Map Jira fields to process mining format
df = df.rename(columns={
    'Issue Key': 'case_id',
    'Status Change': 'activity',
    'Changed': 'timestamp',
    'Assignee': 'resource'
})

Application Logs:

import re
from datetime import datetime

def parse_log_line(line):
    # Example: 2026-01-15 10:23:45 [ORDER-12345] ProcessPayment user=john
    pattern = r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[([^\]]+)\] (\w+) user=(\w+)'
    match = re.match(pattern, line)
    if match:
        return {
            'timestamp': datetime.strptime(match.group(1), '%Y-%m-%d %H:%M:%S'),
            'case_id': match.group(2),
            'activity': match.group(3),
            'resource': match.group(4)
        }
    return None

# Parse log file
events = []
with open('application.log', 'r') as f:
    for line in f:
        event = parse_log_line(line)
        if event:
            events.append(event)

df = pd.DataFrame(events)

Step 3: Discover the Process Model

PM4Py offers several discovery algorithms, each with different strengths:

Directly-Follows Graph (DFG): Simple, fast, intuitive visualisation

# Discover DFG
dfg, start_activities, end_activities = pm4py.discover_dfg(event_log)

# Visualise
pm4py.view_dfg(dfg, start_activities, end_activities)

Heuristic Miner: Handles noise and infrequent behaviour

# Discover process model using Heuristic Miner
heuristic_net = pm4py.discover_heuristics_net(event_log)

# Visualise
pm4py.view_heuristics_net(heuristic_net)

Inductive Miner: Guarantees sound process models

# Discover Petri net using Inductive Miner
net, initial_marking, final_marking = pm4py.discover_petri_net_inductive(event_log)

# Visualise
pm4py.view_petri_net(net, initial_marking, final_marking)

BPMN Discovery: Business-friendly notation

# Discover BPMN model
bpmn_model = pm4py.discover_bpmn_inductive(event_log)

# Visualise
pm4py.view_bpmn(bpmn_model)

Step 4: Analyse Process Performance

Process discovery reveals what happens. Performance analysis reveals how long it takes.

# Calculate case duration statistics
case_durations = pm4py.get_all_case_durations(event_log)

print(f"Average case duration: {sum(case_durations)/len(case_durations)/3600:.1f} hours")
print(f"Minimum: {min(case_durations)/3600:.1f} hours")
print(f"Maximum: {max(case_durations)/3600:.1f} hours")

# Get activity statistics
from pm4py.statistics.traces.generic.pandas import case_statistics

stats = case_statistics.get_kde_caseduration(
    pm4py.convert_to_dataframe(event_log)
)

Bottleneck Detection:

# Analyse waiting times between activities
from pm4py.algo.filtering.pandas.attributes import attributes_filter

# Get transition times
df = pm4py.convert_to_dataframe(event_log)
df = df.sort_values(['case:concept:name', 'time:timestamp'])
df['next_timestamp'] = df.groupby('case:concept:name')['time:timestamp'].shift(-1)
df['waiting_time'] = (df['next_timestamp'] - df['time:timestamp']).dt.total_seconds()

# Find slowest transitions
avg_waiting = df.groupby('concept:name')['waiting_time'].mean()
print("Average waiting time by activity:")
print(avg_waiting.sort_values(ascending=False).head(10))

Step 5: Identify Process Variants

Process variants are the different paths cases take through a process. Understanding variants helps identify standardisation opportunities.

# Get process variants
variants = pm4py.get_variants(event_log)

# Sort by frequency
sorted_variants = sorted(variants.items(), key=lambda x: len(x[1]), reverse=True)

print("Top 5 process variants:")
for i, (variant, cases) in enumerate(sorted_variants[:5]):
    print(f"\n{i+1}. Frequency: {len(cases)} cases ({len(cases)/len(event_log)*100:.1f}%)")
    print(f"   Path: {' -> '.join(variant)}")

Visualise variant distribution:

import matplotlib.pyplot as plt

# Create variant frequency chart
variant_names = [f"V{i+1}" for i in range(min(10, len(sorted_variants)))]
frequencies = [len(cases) for _, cases in sorted_variants[:10]]

plt.figure(figsize=(10, 6))
plt.bar(variant_names, frequencies)
plt.xlabel('Process Variant')
plt.ylabel('Number of Cases')
plt.title('Process Variant Distribution')
plt.savefig('variant_distribution.png')

Advanced Analysis Techniques

Conformance Checking

Compare actual process execution against expected models to identify deviations.

# Define expected process model
expected_net, im, fm = pm4py.discover_petri_net_inductive(event_log)

# Check conformance
from pm4py.algo.conformance.tokenreplay import algorithm as token_replay

replayed_traces = token_replay.apply(event_log, expected_net, im, fm)

# Analyse fitness
fitness = sum(1 for trace in replayed_traces if trace['trace_is_fit']) / len(replayed_traces)
print(f"Process fitness: {fitness:.1%}")

# Identify non-conforming cases
non_conforming = [
    trace for trace in replayed_traces
    if not trace['trace_is_fit']
]
print(f"Non-conforming cases: {len(non_conforming)}")

Resource Analysis

Understand who does what and how workload distributes.

# Analyse resource involvement
from pm4py.algo.organizational_mining.resource_profiles import algorithm as resource_profiles

profiles = resource_profiles.apply(event_log)

# Activities per resource
df = pm4py.convert_to_dataframe(event_log)
resource_activity = df.groupby(['org:resource', 'concept:name']).size().unstack(fill_value=0)
print("Activity distribution by resource:")
print(resource_activity)

Social Network Analysis

Discover handover patterns and collaboration networks.

from pm4py.algo.organizational_mining.sna import algorithm as sna

# Handover of work network
hw_matrix = sna.apply(event_log, variant=sna.Variants.HANDOVER_LOG)
print("Handover patterns:")
print(hw_matrix)

# Visualise network
# (Requires additional visualisation libraries)

Building a Complete Discovery Report

Combine analyses into a comprehensive report:

def generate_process_report(event_log, output_dir='./report'):
    import os
    os.makedirs(output_dir, exist_ok=True)

    report = []
    report.append("# Process Discovery Report\n")

    # Basic statistics
    num_cases = len(pm4py.get_variants(event_log))
    num_events = len(pm4py.convert_to_dataframe(event_log))
    activities = pm4py.get_event_attribute_values(event_log, 'concept:name')

    report.append(f"## Overview")
    report.append(f"- Total cases: {num_cases}")
    report.append(f"- Total events: {num_events}")
    report.append(f"- Unique activities: {len(activities)}")

    # Duration analysis
    durations = pm4py.get_all_case_durations(event_log)
    report.append(f"\n## Performance")
    report.append(f"- Average duration: {sum(durations)/len(durations)/3600:.1f} hours")
    report.append(f"- Median duration: {sorted(durations)[len(durations)//2]/3600:.1f} hours")

    # Generate process map
    dfg, start, end = pm4py.discover_dfg(event_log)
    pm4py.save_vis_dfg(dfg, start, end, f'{output_dir}/process_map.png')
    report.append(f"\n## Process Map")
    report.append(f"![Process Map](process_map.png)")

    # Variant analysis
    variants = pm4py.get_variants(event_log)
    sorted_variants = sorted(variants.items(), key=lambda x: len(x[1]), reverse=True)

    report.append(f"\n## Top Process Variants")
    for i, (variant, cases) in enumerate(sorted_variants[:5]):
        report.append(f"\n### Variant {i+1} ({len(cases)} cases, {len(cases)/num_cases*100:.1f}%)")
        report.append(f"Path: {' → '.join(variant)}")

    # Write report
    with open(f'{output_dir}/report.md', 'w') as f:
        f.write('\n'.join(report))

    print(f"Report generated in {output_dir}/")

# Generate report
generate_process_report(event_log)

Alternative Open Source Tools

While PM4Py is the most comprehensive Python option, other tools serve specific needs:

bupaR (R)

For organisations with R expertise, bupaR provides similar capabilities:

library(bupaR)

# Load event log
log <- eventlog(
  data,
  case_id = "case_id",
  activity_id = "activity",
  timestamp = "timestamp",
  lifecycle_id = "status",
  resource_id = "resource"
)

# Discover process map
process_map(log)

# Performance analysis
throughput_time(log, units = "hours")

ProM

ProM is a comprehensive process mining framework with a GUI. While not Python-based, it offers advanced algorithms and is valuable for:

  • Academic research
  • Algorithm comparison
  • Complex analysis without coding

Apromore

Apromore provides a web-based open source process mining platform. It suits organisations wanting:

  • Collaborative process analysis
  • Web-based access without local installation
  • Visual-first approach

From Discovery to Automation

Process discovery is not the end goal. It informs automation decisions.

Identifying Automation Candidates

Use discovery insights to prioritise automation:

def score_automation_potential(event_log):
    """Score activities for automation potential"""
    df = pm4py.convert_to_dataframe(event_log)

    scores = {}
    for activity in df['concept:name'].unique():
        activity_df = df[df['concept:name'] == activity]

        # Frequency score (higher = more automation value)
        frequency = len(activity_df)

        # Consistency score (low variance = easier to automate)
        resources = activity_df['org:resource'].nunique() if 'org:resource' in df.columns else 1

        # Duration score (longer = more value from automation)
        # This would require inter-activity timing

        scores[activity] = {
            'frequency': frequency,
            'resource_variance': resources,
            'automation_score': frequency / (resources + 1)  # Simple scoring
        }

    return pd.DataFrame(scores).T.sort_values('automation_score', ascending=False)

automation_candidates = score_automation_potential(event_log)
print("Automation candidates ranked by potential:")
print(automation_candidates.head(10))

Validating Assumptions

Before automating, validate that discovered patterns are intentional:

  1. Review variants with stakeholders: Are all paths legitimate?
  2. Investigate outliers: What causes exceptionally long cases?
  3. Confirm resource patterns: Are handovers intentional or workarounds?
  4. Check compliance: Do deviations indicate policy violations?

For implementation guidance after discovery, see our guides on business process automation with Python and choosing BPA tools.

How Tasrie IT Services Can Help

Process discovery is the foundation for successful automation. We help organisations:

  • Extract and prepare event data from disparate source systems
  • Conduct process mining analysis using PM4Py and appropriate tools
  • Interpret results and translate insights into actionable recommendations
  • Identify automation opportunities based on discovery findings
  • Implement automation using Python-based workflow tools

We do not sell process mining software licenses. Our focus is helping you understand your processes and make informed automation decisions using cost-effective, open source tools.

Explore our business process automation services or contact us to discuss process discovery for your organisation.

Key Takeaways

  • Process discovery reveals reality, exposing gaps between documented procedures and actual execution
  • Enterprise process mining platforms cost $100,000-$500,000+ annually, putting them out of reach for many organisations
  • PM4Py provides comprehensive, free process mining capabilities for Python users
  • Event log preparation is critical: ensure case ID, activity, and timestamp fields are properly extracted
  • Multiple discovery algorithms exist: DFG for simplicity, Heuristic Miner for noise tolerance, Inductive Miner for sound models
  • Performance and conformance analysis add depth beyond basic process discovery
  • Discovery informs automation: use insights to prioritise and validate automation candidates

Process discovery should not require enterprise software budgets. With Python and PM4Py, any organisation can gain the insights needed to automate intelligently and improve process performance.

Further Reading

Chat with real humans
Chat on WhatsApp