Before you can automate a business process, you need to understand it. Process discovery reveals how work actually flows through your organisation, often exposing gaps between documented procedures and reality. Enterprise process mining platforms like Celonis, SAP Signavio, and UiPath promise automated discovery, but at costs that put them out of reach for most organisations. The alternative is not guesswork. Python and open source tools can deliver professional-grade process discovery without six-figure software investments.
This guide shows you how to implement automated business process discovery using PM4Py and related open source tools. You will learn to extract process insights from your data, visualise workflows, identify bottlenecks, and build the foundation for informed automation decisions.
What Is Process Discovery?
Process discovery extracts actual process behaviour from event data. Rather than relying on how people think processes work, process mining analyses system logs to reveal how processes actually execute.
The typical process discovery workflow:
- Extract event logs from source systems (ERP, CRM, ticketing systems)
- Transform data into standard process mining format
- Discover process models using mining algorithms
- Analyse for insights (bottlenecks, variations, compliance gaps)
- Visualise and communicate findings to stakeholders
Enterprise tools automate these steps but charge accordingly. Celonis deployments commonly exceed $200,000 annually for mid-sized organisations. For many, this cost cannot be justified before proving process mining delivers value.
The Enterprise Process Mining Cost Problem
What Enterprise Tools Cost
| Platform | Typical Annual Cost | Implementation | Best For |
|---|---|---|---|
| Celonis | $150,000-$500,000+ | $50,000-$200,000 | Large enterprises with SAP |
| SAP Signavio | $100,000-$300,000+ | Included with SAP projects | SAP customers |
| UiPath Process Mining | $50,000-$150,000+ | Bundled with RPA | UiPath automation users |
| ABBYY Timeline | $30,000-$100,000+ | $20,000-$50,000 | Document-heavy processes |
| IBM Process Mining | $80,000-$200,000+ | Varies | IBM ecosystem |
These figures exclude the internal effort required to prepare data, validate models, and operationalise insights. Total cost of ownership often doubles the software expense.
When Enterprise Tools Make Sense
Enterprise platforms justify their costs when:
- Process mining is a strategic, organisation-wide initiative
- Real-time monitoring and alerting are required
- Integration with enterprise systems (SAP, Salesforce) is critical
- Compliance requires vendor support and certification
- Scale exceeds what open source tools handle efficiently
For everyone else, open source alternatives deliver the essential capabilities at a fraction of the cost.
PM4Py: The Open Source Process Mining Standard
PM4Py is the leading open source process mining library, developed at Fraunhofer FIT and freely available under Apache 2.0 license. It provides comprehensive capabilities for process discovery, conformance checking, and analysis.
Installing PM4Py
pip install pm4py
For visualisation support:
pip install graphviz
Core Capabilities
PM4Py implements the algorithms and techniques that power commercial process mining platforms:
- Process Discovery: Alpha Miner, Heuristic Miner, Inductive Miner
- Conformance Checking: Token-based replay, alignment-based
- Process Enhancement: Performance analysis, bottleneck detection
- Filtering and Preprocessing: Activity, time, case filtering
- Visualisation: Process maps, BPMN diagrams, DFG visualisation
Building a Process Discovery Pipeline
Let’s build a complete process discovery pipeline using PM4Py. This example analyses an order-to-cash process but the approach applies to any process with event data.
Step 1: Prepare Your Event Log
Process mining requires event logs with three essential fields:
- Case ID: Unique identifier for each process instance (order number, ticket ID)
- Activity: The step or action that occurred
- Timestamp: When the activity occurred
Additional fields enhance analysis:
- Resource: Who performed the activity
- Cost: Associated costs
- Custom attributes: Any relevant process data
import pandas as pd
import pm4py
# Sample data structure
data = {
'case_id': ['ORD001', 'ORD001', 'ORD001', 'ORD002', 'ORD002', 'ORD002'],
'activity': ['Create Order', 'Approve Order', 'Ship Order',
'Create Order', 'Reject Order', 'Close Order'],
'timestamp': ['2026-01-01 09:00', '2026-01-01 14:00', '2026-01-02 10:00',
'2026-01-01 10:00', '2026-01-01 16:00', '2026-01-01 17:00'],
'resource': ['Sales', 'Manager', 'Warehouse',
'Sales', 'Manager', 'Sales']
}
df = pd.DataFrame(data)
df['timestamp'] = pd.to_datetime(df['timestamp'])
# Convert to PM4Py event log format
event_log = pm4py.format_dataframe(
df,
case_id='case_id',
activity_key='activity',
timestamp_key='timestamp'
)
Step 2: Extract Data from Source Systems
Real process discovery starts with extracting event data from your systems. Common sources include:
ERP Systems (SAP, Oracle):
import pandas as pd
from sqlalchemy import create_engine
# Connect to database
engine = create_engine('postgresql://user:pass@host/erp_db')
# Extract order events
query = """
SELECT
order_number as case_id,
event_type as activity,
event_timestamp as timestamp,
user_name as resource
FROM order_events
WHERE event_timestamp >= '2025-01-01'
ORDER BY order_number, event_timestamp
"""
df = pd.read_sql(query, engine)
Ticketing Systems (ServiceNow, Jira):
# Example: Jira export processing
df = pd.read_csv('jira_export.csv')
# Map Jira fields to process mining format
df = df.rename(columns={
'Issue Key': 'case_id',
'Status Change': 'activity',
'Changed': 'timestamp',
'Assignee': 'resource'
})
Application Logs:
import re
from datetime import datetime
def parse_log_line(line):
# Example: 2026-01-15 10:23:45 [ORDER-12345] ProcessPayment user=john
pattern = r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[([^\]]+)\] (\w+) user=(\w+)'
match = re.match(pattern, line)
if match:
return {
'timestamp': datetime.strptime(match.group(1), '%Y-%m-%d %H:%M:%S'),
'case_id': match.group(2),
'activity': match.group(3),
'resource': match.group(4)
}
return None
# Parse log file
events = []
with open('application.log', 'r') as f:
for line in f:
event = parse_log_line(line)
if event:
events.append(event)
df = pd.DataFrame(events)
Step 3: Discover the Process Model
PM4Py offers several discovery algorithms, each with different strengths:
Directly-Follows Graph (DFG): Simple, fast, intuitive visualisation
# Discover DFG
dfg, start_activities, end_activities = pm4py.discover_dfg(event_log)
# Visualise
pm4py.view_dfg(dfg, start_activities, end_activities)
Heuristic Miner: Handles noise and infrequent behaviour
# Discover process model using Heuristic Miner
heuristic_net = pm4py.discover_heuristics_net(event_log)
# Visualise
pm4py.view_heuristics_net(heuristic_net)
Inductive Miner: Guarantees sound process models
# Discover Petri net using Inductive Miner
net, initial_marking, final_marking = pm4py.discover_petri_net_inductive(event_log)
# Visualise
pm4py.view_petri_net(net, initial_marking, final_marking)
BPMN Discovery: Business-friendly notation
# Discover BPMN model
bpmn_model = pm4py.discover_bpmn_inductive(event_log)
# Visualise
pm4py.view_bpmn(bpmn_model)
Step 4: Analyse Process Performance
Process discovery reveals what happens. Performance analysis reveals how long it takes.
# Calculate case duration statistics
case_durations = pm4py.get_all_case_durations(event_log)
print(f"Average case duration: {sum(case_durations)/len(case_durations)/3600:.1f} hours")
print(f"Minimum: {min(case_durations)/3600:.1f} hours")
print(f"Maximum: {max(case_durations)/3600:.1f} hours")
# Get activity statistics
from pm4py.statistics.traces.generic.pandas import case_statistics
stats = case_statistics.get_kde_caseduration(
pm4py.convert_to_dataframe(event_log)
)
Bottleneck Detection:
# Analyse waiting times between activities
from pm4py.algo.filtering.pandas.attributes import attributes_filter
# Get transition times
df = pm4py.convert_to_dataframe(event_log)
df = df.sort_values(['case:concept:name', 'time:timestamp'])
df['next_timestamp'] = df.groupby('case:concept:name')['time:timestamp'].shift(-1)
df['waiting_time'] = (df['next_timestamp'] - df['time:timestamp']).dt.total_seconds()
# Find slowest transitions
avg_waiting = df.groupby('concept:name')['waiting_time'].mean()
print("Average waiting time by activity:")
print(avg_waiting.sort_values(ascending=False).head(10))
Step 5: Identify Process Variants
Process variants are the different paths cases take through a process. Understanding variants helps identify standardisation opportunities.
# Get process variants
variants = pm4py.get_variants(event_log)
# Sort by frequency
sorted_variants = sorted(variants.items(), key=lambda x: len(x[1]), reverse=True)
print("Top 5 process variants:")
for i, (variant, cases) in enumerate(sorted_variants[:5]):
print(f"\n{i+1}. Frequency: {len(cases)} cases ({len(cases)/len(event_log)*100:.1f}%)")
print(f" Path: {' -> '.join(variant)}")
Visualise variant distribution:
import matplotlib.pyplot as plt
# Create variant frequency chart
variant_names = [f"V{i+1}" for i in range(min(10, len(sorted_variants)))]
frequencies = [len(cases) for _, cases in sorted_variants[:10]]
plt.figure(figsize=(10, 6))
plt.bar(variant_names, frequencies)
plt.xlabel('Process Variant')
plt.ylabel('Number of Cases')
plt.title('Process Variant Distribution')
plt.savefig('variant_distribution.png')
Advanced Analysis Techniques
Conformance Checking
Compare actual process execution against expected models to identify deviations.
# Define expected process model
expected_net, im, fm = pm4py.discover_petri_net_inductive(event_log)
# Check conformance
from pm4py.algo.conformance.tokenreplay import algorithm as token_replay
replayed_traces = token_replay.apply(event_log, expected_net, im, fm)
# Analyse fitness
fitness = sum(1 for trace in replayed_traces if trace['trace_is_fit']) / len(replayed_traces)
print(f"Process fitness: {fitness:.1%}")
# Identify non-conforming cases
non_conforming = [
trace for trace in replayed_traces
if not trace['trace_is_fit']
]
print(f"Non-conforming cases: {len(non_conforming)}")
Resource Analysis
Understand who does what and how workload distributes.
# Analyse resource involvement
from pm4py.algo.organizational_mining.resource_profiles import algorithm as resource_profiles
profiles = resource_profiles.apply(event_log)
# Activities per resource
df = pm4py.convert_to_dataframe(event_log)
resource_activity = df.groupby(['org:resource', 'concept:name']).size().unstack(fill_value=0)
print("Activity distribution by resource:")
print(resource_activity)
Social Network Analysis
Discover handover patterns and collaboration networks.
from pm4py.algo.organizational_mining.sna import algorithm as sna
# Handover of work network
hw_matrix = sna.apply(event_log, variant=sna.Variants.HANDOVER_LOG)
print("Handover patterns:")
print(hw_matrix)
# Visualise network
# (Requires additional visualisation libraries)
Building a Complete Discovery Report
Combine analyses into a comprehensive report:
def generate_process_report(event_log, output_dir='./report'):
import os
os.makedirs(output_dir, exist_ok=True)
report = []
report.append("# Process Discovery Report\n")
# Basic statistics
num_cases = len(pm4py.get_variants(event_log))
num_events = len(pm4py.convert_to_dataframe(event_log))
activities = pm4py.get_event_attribute_values(event_log, 'concept:name')
report.append(f"## Overview")
report.append(f"- Total cases: {num_cases}")
report.append(f"- Total events: {num_events}")
report.append(f"- Unique activities: {len(activities)}")
# Duration analysis
durations = pm4py.get_all_case_durations(event_log)
report.append(f"\n## Performance")
report.append(f"- Average duration: {sum(durations)/len(durations)/3600:.1f} hours")
report.append(f"- Median duration: {sorted(durations)[len(durations)//2]/3600:.1f} hours")
# Generate process map
dfg, start, end = pm4py.discover_dfg(event_log)
pm4py.save_vis_dfg(dfg, start, end, f'{output_dir}/process_map.png')
report.append(f"\n## Process Map")
report.append(f"")
# Variant analysis
variants = pm4py.get_variants(event_log)
sorted_variants = sorted(variants.items(), key=lambda x: len(x[1]), reverse=True)
report.append(f"\n## Top Process Variants")
for i, (variant, cases) in enumerate(sorted_variants[:5]):
report.append(f"\n### Variant {i+1} ({len(cases)} cases, {len(cases)/num_cases*100:.1f}%)")
report.append(f"Path: {' → '.join(variant)}")
# Write report
with open(f'{output_dir}/report.md', 'w') as f:
f.write('\n'.join(report))
print(f"Report generated in {output_dir}/")
# Generate report
generate_process_report(event_log)
Alternative Open Source Tools
While PM4Py is the most comprehensive Python option, other tools serve specific needs:
bupaR (R)
For organisations with R expertise, bupaR provides similar capabilities:
library(bupaR)
# Load event log
log <- eventlog(
data,
case_id = "case_id",
activity_id = "activity",
timestamp = "timestamp",
lifecycle_id = "status",
resource_id = "resource"
)
# Discover process map
process_map(log)
# Performance analysis
throughput_time(log, units = "hours")
ProM
ProM is a comprehensive process mining framework with a GUI. While not Python-based, it offers advanced algorithms and is valuable for:
- Academic research
- Algorithm comparison
- Complex analysis without coding
Apromore
Apromore provides a web-based open source process mining platform. It suits organisations wanting:
- Collaborative process analysis
- Web-based access without local installation
- Visual-first approach
From Discovery to Automation
Process discovery is not the end goal. It informs automation decisions.
Identifying Automation Candidates
Use discovery insights to prioritise automation:
def score_automation_potential(event_log):
"""Score activities for automation potential"""
df = pm4py.convert_to_dataframe(event_log)
scores = {}
for activity in df['concept:name'].unique():
activity_df = df[df['concept:name'] == activity]
# Frequency score (higher = more automation value)
frequency = len(activity_df)
# Consistency score (low variance = easier to automate)
resources = activity_df['org:resource'].nunique() if 'org:resource' in df.columns else 1
# Duration score (longer = more value from automation)
# This would require inter-activity timing
scores[activity] = {
'frequency': frequency,
'resource_variance': resources,
'automation_score': frequency / (resources + 1) # Simple scoring
}
return pd.DataFrame(scores).T.sort_values('automation_score', ascending=False)
automation_candidates = score_automation_potential(event_log)
print("Automation candidates ranked by potential:")
print(automation_candidates.head(10))
Validating Assumptions
Before automating, validate that discovered patterns are intentional:
- Review variants with stakeholders: Are all paths legitimate?
- Investigate outliers: What causes exceptionally long cases?
- Confirm resource patterns: Are handovers intentional or workarounds?
- Check compliance: Do deviations indicate policy violations?
For implementation guidance after discovery, see our guides on business process automation with Python and choosing BPA tools.
How Tasrie IT Services Can Help
Process discovery is the foundation for successful automation. We help organisations:
- Extract and prepare event data from disparate source systems
- Conduct process mining analysis using PM4Py and appropriate tools
- Interpret results and translate insights into actionable recommendations
- Identify automation opportunities based on discovery findings
- Implement automation using Python-based workflow tools
We do not sell process mining software licenses. Our focus is helping you understand your processes and make informed automation decisions using cost-effective, open source tools.
Explore our business process automation services or contact us to discuss process discovery for your organisation.
Key Takeaways
- Process discovery reveals reality, exposing gaps between documented procedures and actual execution
- Enterprise process mining platforms cost $100,000-$500,000+ annually, putting them out of reach for many organisations
- PM4Py provides comprehensive, free process mining capabilities for Python users
- Event log preparation is critical: ensure case ID, activity, and timestamp fields are properly extracted
- Multiple discovery algorithms exist: DFG for simplicity, Heuristic Miner for noise tolerance, Inductive Miner for sound models
- Performance and conformance analysis add depth beyond basic process discovery
- Discovery informs automation: use insights to prioritise and validate automation candidates
Process discovery should not require enterprise software budgets. With Python and PM4Py, any organisation can gain the insights needed to automate intelligently and improve process performance.