`aggregate_topology_points.py` User Guide

Use scripts/aggregate_topology_points.py to combine many *_topology_points.csv files into:

A single aggregated topology_stats CSV (same shape as per-crop stats).
Per-category histograms for selected scalars (CSV, plus PNG if matplotlib is available).
Composition barchart + violin/box plots for four focus categories (clusters, filament-manifolds not clusters, walls not filaments/clusters, and unassigned).

This is designed for large datasets (many crops and large per-point files).

What it does

Scans for *_topology_points.csv files (or uses explicit inputs).
Aggregates category statistics across all crops.
Writes histogram CSVs (and optional PNGs) per category and scalar.
Optionally renders composition barchart + violin/box plots.

Requirements

Python in your active environment.
Optional but recommended: polars (fast path, default).
Optional: matplotlib for PNG histogram plots.

CLI

python scripts/aggregate_topology_points.py \
  --root <OUTPUT_ROOT> \
  --output-dir <COMBINED_DIR> \
  --output-prefix <PREFIX> \
  [--glob "**/*_topology_points.csv"] \
  [--inputs file1.csv file2.csv ...] \
  [--bins 100] \
  [--hist-scalars field_value log_field_value|all] \
  [--hist-bin-mode per-category|global] \
  [--hist-percentile-range 1 99] \
  [--engine polars|stream|python] \
  [--polars-chunks N] \
  [--violin-scalar log_field_value] \
  [--box-scalar field_value] \
  [--plots-from-raw] \
  [--plot-sample-size 200000] \
  [--plot-fontscale 1.0] \
  [--plot-dpi 200] \
  [--no-plots]

Engines

polars (default): fast, exact stats, scales well. Use --polars-chunks to split inputs into N chunks for large datasets; chunked mode uses global histogram bins and yields approximate quantiles.
stream: fast and memory-light, approximate quantiles (uses histogram bins).
python: exact but slow on large files.

Output

In --output-dir with --output-prefix:

<prefix>_topology_stats.csv
<prefix>_<category>_<scalar>_hist.csv
<prefix>_<category>_<scalar>_hist.png (if matplotlib is available and --no-plots is not set)
<prefix>_filman_walls_composition.png (composition barchart for the four focus categories)
<prefix>_filman_walls_<scalar>_violin.png (violin plot for the four focus categories)
<prefix>_filman_walls_<scalar>_box.png (box plot for the four focus categories)

Categories follow ndtopo_stats.py, including filament‑manifold and cluster variants when present. The violin/box/composition plots only include:

clusters
filament_manifolds_not_clusters
walls_not_filament_manifolds_or_clusters
unassigned_walls_filament_manifolds_clusters

Histograms use per-category bins by default (--hist-bin-mode per-category) or shared bins per scalar (--hist-bin-mode global). The default histogram scalars are field_value and log_field_value (use --hist-scalars to override). Histogram PNGs include the particle count n and a red dashed median line; histogram CSVs include total_count. Violin plots use raw values if --plots-from-raw is set; otherwise they sample from histogram bins. Box plots are rendered from the aggregated *_topology_stats.csv by default.

Examples

Aggregate all crops under a root, fast path:

python scripts/aggregate_topology_points.py \
  --root outputs/quijote_batches_000 \
  --output-dir outputs/quijote_batches_000/combined \
  --output-prefix quijote_batches_000 \
  --engine polars \
  --hist-bin-mode global \
  --hist-percentile-range 1 99

Use explicit files and skip PNG plots:

python scripts/aggregate_topology_points.py \
  --inputs outputs/crop_a/topology_points.csv outputs/crop_b/topology_points.csv \
  --output-dir outputs/combined \
  --output-prefix combined \
  --no-plots

Streaming mode (approximate quantiles):

python scripts/aggregate_topology_points.py \
  --root outputs/quijote_batches_000 \
  --output-dir outputs/quijote_batches_000/combined \
  --output-prefix quijote_batches_000 \
  --engine stream \
  --bins 50

Chunked polars mode (lower memory, approximate quantiles):

python scripts/aggregate_topology_points.py \
  --root outputs/quijote_batches_000 \
  --output-dir outputs/quijote_batches_000/combined \
  --output-prefix quijote_batches_000 \
  --engine polars \
  --polars-chunks 4 \
  --hist-bin-mode global \
  --hist-percentile-range 1 99

`compare_simulations.py` User Guide

Use scripts/compare_simulations.py to compare two simulations side-by-side. It reads one *_topology_stats.csv per simulation (auto-discovered from a folder or its combined/ subdirectory) and, when per-point CSVs are available, reads the individual *_topology_points.csv files across all crops.

What it does

Reads aggregated topology_stats.csv files from two simulation output folders.
Renders side-by-side box plots for each of the four primary categories (clusters, filaments, walls, unassigned) across chosen scalars, with a summary-statistics table below.
Renders side-by-side violin plots (requires per-point CSVs) built from actual particle values with equal-area KDE normalization — every violin’s visual area is proportional to its sample size — plus a median line and IQR indicator.
Renders two scatter + marginal-density plots (requires per-point CSVs):
- Particle proximity: x = density^(−1/3) (a proxy for inter-particle separation), y = log₁₀(density).
- Voronoi cell volume: x = density^(−1) (Voronoi volume proxy), y = log₁₀(density).
- Each scatter plot includes marginal density curves along both axes.

Requirements

Python with matplotlib, numpy, scipy (for KDE), and polars (or pandas).

CLI

python scripts/compare_simulations.py \
  --sim <SIM_A_DIR> \
  --sim <SIM_B_DIR> \
  [--stats-file <CSV_A> --stats-file <CSV_B>] \
  [--points-file <CSV_A> --points-file <CSV_B>] \
  [--labels "z=3" "z=0"] \
  [--scalars log_field_value field_value] \
  [--plot-sample-size N] \
  [--output-dir <OUTPUT_DIR>] \
  [--output-prefix <PREFIX>] \
  [--font-scale 1.0] \
  [--dpi 150]

Key options

--sim — Root output directory for a simulation (repeat exactly twice). The script searches for *_topology_stats.csv directly or inside a combined/ subdirectory, and finds all *_topology_points.csv files under the folder (excluding combined/) for violin and scatter plots.
--stats-file — Supply stats CSVs directly (repeat exactly twice) instead of --sim.
--points-file — Supply per-point CSVs directly (repeat exactly twice) instead of auto-discovery.
--labels — Display names for the two simulations (default: folder basenames). Pass the earlier epoch first (e.g. "z=3" "z=0") — violin plots always show sim A on the right and sim B on the left, so earlier-epoch data appears on the left.
--scalars — Which scalar columns to plot (default: all detected scalars).
--plot-sample-size — Reservoir-sample at most N rows per simulation when reading per-point CSVs (default: read all rows).
--output-dir — Where to write the figure (default: current directory).
--output-prefix — Filename prefix for output PNGs (default: comparison).
--font-scale — Multiplier for all font sizes (default: 1.0).
--dpi — Output image DPI (default: 150).

Output files

For each scalar:

<prefix>_<scalar>_box.png — side-by-side box plots + stats table
<prefix>_<scalar>_violin.png — side-by-side violin plots + stats table (requires per-point CSVs)

Always (requires per-point CSVs):

<prefix>_scatter_proximity.png — log₁₀-density vs particle proximity scatter + marginal densities
<prefix>_scatter_voronoi.png — log₁₀-density vs Voronoi cell volume scatter + marginal densities

Violin plot conventions

Each violin uses equal-area KDE normalization: all violins share a single global scale so that visual area ∝ sample density rather than being scaled to a common maximum width per violin.
Outliers are trimmed to the 0.1–99.9% range per category before KDE and stats are computed.
A horizontal line marks the median; a thick vertical line spans the IQR.
Column order: sim B (the second --sim, typically the earlier epoch) is shown on the left; sim A is on the right. Pass --labels in chronological order (earlier first) to match the visual order.

Notes

The figure uses the same four-category colour scheme as aggregate_topology_points.py (clusters yellow, filaments orange, walls red, unassigned purple).
Box plots: simulation A uses a lighter fill with //// hatch; simulation B uses a solid fill.
Violin plot colours match the category colour — both redshifts use the same hue.
Scatter plots colour points by category and show per-category marginal density curves. The legend uses coloured text placed at roughly the upper-right quadrant of the figure.
If no per-point CSVs are found for a simulation, violin and scatter plots are skipped with a warning.

Examples

python scripts/compare_simulations.py \
  --sim outputs/quijote_batches_000_w_clusters \
  --sim outputs/quijote_batches_004_w_clusters_points_6_0 \
  --labels "z=3" "z=0" \
  --scalars log10_field_value \
  --output-dir outputs/comparison_000_vs_004 \
  --output-prefix compare_z3_vs_z0_6_0

More Examples

Redshift 3:

python scripts/aggregate_topology_points.py \
  --root outputs/quijote_batches_000_w_clusters \
  --output-dir outputs/quijote_batches_000_w_clusters/combined \
  --output-prefix quijote_batches_000_w_clusters \
  --engine polars \
  --polars-chunks 4 \
  --log10-field-value \
  --violin-scalar log10_field_value \
  --hist-bin-mode global \
  --hist-percentile-range 1 99 \
  --plot-percentile-range .1 99.9 \
  --plot-fontscale 1.2 \
  --plot-dpi 600

Redshift 0:

python scripts/aggregate_topology_points.py \
  --root outputs/quijote_batches_004_w_clusters_points_6_0 \
  --output-dir outputs/quijote_batches_004_w_clusters_points_6_0/combined \
  --output-prefix quijote_batches_004_w_clusters_points_6_0 \
  --engine polars \
  --polars-chunks 4 \
  --log10-field-value \
  --violin-scalar log10_field_value \
  --hist-bin-mode global \
  --hist-percentile-range 1 99 \
  --plot-percentile-range .1 99.9 \
  --plot-fontscale 1.2 \
  --plot-dpi 600

What it does

Requirements

CLI

Engines

Output

Examples

compare_simulations.py User Guide

What it does

Requirements

CLI

Key options

Output files

Violin plot conventions

Notes

Examples

More Examples

`compare_simulations.py` User Guide