aggregate_topology_points.py User Guide

Use scripts/aggregate_topology_points.py to combine many *_topology_points.csv files into:

This is designed for large datasets (many crops and large per-point files).

What it does

  • Scans for *_topology_points.csv files (or uses explicit inputs).
  • Aggregates category statistics across all crops.
  • Writes histogram CSVs (and optional PNGs) per category and scalar.
  • Optionally renders composition barchart + violin/box plots.

Requirements

  • Python in your active environment.
  • Optional but recommended: polars (fast path, default).
  • Optional: matplotlib for PNG histogram plots.

CLI

python scripts/aggregate_topology_points.py \
  --root <OUTPUT_ROOT> \
  --output-dir <COMBINED_DIR> \
  --output-prefix <PREFIX> \
  [--glob "**/*_topology_points.csv"] \
  [--inputs file1.csv file2.csv ...] \
  [--bins 100] \
  [--hist-scalars field_value log_field_value|all] \
  [--hist-bin-mode per-category|global] \
  [--hist-percentile-range 1 99] \
  [--engine polars|stream|python] \
  [--polars-chunks N] \
  [--violin-scalar log_field_value] \
  [--box-scalar field_value] \
  [--plots-from-raw] \
  [--plot-sample-size 200000] \
  [--plot-fontscale 1.0] \
  [--plot-dpi 200] \
  [--no-plots]

Engines

  • polars (default): fast, exact stats, scales well. Use --polars-chunks to split inputs into N chunks for large datasets; chunked mode uses global histogram bins and yields approximate quantiles.
  • stream: fast and memory-light, approximate quantiles (uses histogram bins).
  • python: exact but slow on large files.

Output

In --output-dir with --output-prefix:

  • <prefix>_topology_stats.csv
  • <prefix>_<category>_<scalar>_hist.csv
  • <prefix>_<category>_<scalar>_hist.png (if matplotlib is available and --no-plots is not set)
  • <prefix>_filman_walls_composition.png (composition barchart for the four focus categories)
  • <prefix>_filman_walls_<scalar>_violin.png (violin plot for the four focus categories)
  • <prefix>_filman_walls_<scalar>_box.png (box plot for the four focus categories)

Categories follow ndtopo_stats.py, including filament‑manifold and cluster variants when present. The violin/box/composition plots only include:

  • clusters
  • filament_manifolds_not_clusters
  • walls_not_filament_manifolds_or_clusters
  • unassigned_walls_filament_manifolds_clusters

Histograms use per-category bins by default (--hist-bin-mode per-category) or shared bins per scalar (--hist-bin-mode global). The default histogram scalars are field_value and log_field_value (use --hist-scalars to override). Histogram PNGs include the particle count n and a red dashed median line; histogram CSVs include total_count. Violin plots use raw values if --plots-from-raw is set; otherwise they sample from histogram bins. Box plots are rendered from the aggregated *_topology_stats.csv by default.

Examples

Aggregate all crops under a root, fast path:

python scripts/aggregate_topology_points.py \
  --root outputs/quijote_batches_000 \
  --output-dir outputs/quijote_batches_000/combined \
  --output-prefix quijote_batches_000 \
  --engine polars \
  --hist-bin-mode global \
  --hist-percentile-range 1 99

Use explicit files and skip PNG plots:

python scripts/aggregate_topology_points.py \
  --inputs outputs/crop_a/topology_points.csv outputs/crop_b/topology_points.csv \
  --output-dir outputs/combined \
  --output-prefix combined \
  --no-plots

Streaming mode (approximate quantiles):

python scripts/aggregate_topology_points.py \
  --root outputs/quijote_batches_000 \
  --output-dir outputs/quijote_batches_000/combined \
  --output-prefix quijote_batches_000 \
  --engine stream \
  --bins 50

Chunked polars mode (lower memory, approximate quantiles):

python scripts/aggregate_topology_points.py \
  --root outputs/quijote_batches_000 \
  --output-dir outputs/quijote_batches_000/combined \
  --output-prefix quijote_batches_000 \
  --engine polars \
  --polars-chunks 4 \
  --hist-bin-mode global \
  --hist-percentile-range 1 99

compare_simulations.py User Guide

Use scripts/compare_simulations.py to compare two simulations side-by-side. It reads one *_topology_stats.csv per simulation (auto-discovered from a folder or its combined/ subdirectory) and, when per-point CSVs are available, reads the individual *_topology_points.csv files across all crops.

What it does

  • Reads aggregated topology_stats.csv files from two simulation output folders.
  • Renders side-by-side box plots for each of the four primary categories (clusters, filaments, walls, unassigned) across chosen scalars, with a summary-statistics table below.
  • Renders side-by-side violin plots (requires per-point CSVs) built from actual particle values with equal-area KDE normalization — every violin’s visual area is proportional to its sample size — plus a median line and IQR indicator.
  • Renders two scatter + marginal-density plots (requires per-point CSVs):
    • Particle proximity: x = density^(−1/3) (a proxy for inter-particle separation), y = log₁₀(density).
    • Voronoi cell volume: x = density^(−1) (Voronoi volume proxy), y = log₁₀(density).
    • Each scatter plot includes marginal density curves along both axes.

Requirements

  • Python with matplotlib, numpy, scipy (for KDE), and polars (or pandas).

CLI

python scripts/compare_simulations.py \
  --sim <SIM_A_DIR> \
  --sim <SIM_B_DIR> \
  [--stats-file <CSV_A> --stats-file <CSV_B>] \
  [--points-file <CSV_A> --points-file <CSV_B>] \
  [--labels "z=3" "z=0"] \
  [--scalars log_field_value field_value] \
  [--plot-sample-size N] \
  [--output-dir <OUTPUT_DIR>] \
  [--output-prefix <PREFIX>] \
  [--font-scale 1.0] \
  [--dpi 150]

Key options

  • --sim — Root output directory for a simulation (repeat exactly twice). The script searches for *_topology_stats.csv directly or inside a combined/ subdirectory, and finds all *_topology_points.csv files under the folder (excluding combined/) for violin and scatter plots.
  • --stats-file — Supply stats CSVs directly (repeat exactly twice) instead of --sim.
  • --points-file — Supply per-point CSVs directly (repeat exactly twice) instead of auto-discovery.
  • --labels — Display names for the two simulations (default: folder basenames). Pass the earlier epoch first (e.g. "z=3" "z=0") — violin plots always show sim A on the right and sim B on the left, so earlier-epoch data appears on the left.
  • --scalars — Which scalar columns to plot (default: all detected scalars).
  • --plot-sample-size — Reservoir-sample at most N rows per simulation when reading per-point CSVs (default: read all rows).
  • --output-dir — Where to write the figure (default: current directory).
  • --output-prefix — Filename prefix for output PNGs (default: comparison).
  • --font-scale — Multiplier for all font sizes (default: 1.0).
  • --dpi — Output image DPI (default: 150).

Output files

For each scalar:

  • <prefix>_<scalar>_box.png — side-by-side box plots + stats table
  • <prefix>_<scalar>_violin.png — side-by-side violin plots + stats table (requires per-point CSVs)

Always (requires per-point CSVs):

  • <prefix>_scatter_proximity.png — log₁₀-density vs particle proximity scatter + marginal densities
  • <prefix>_scatter_voronoi.png — log₁₀-density vs Voronoi cell volume scatter + marginal densities

Violin plot conventions

  • Each violin uses equal-area KDE normalization: all violins share a single global scale so that visual area ∝ sample density rather than being scaled to a common maximum width per violin.
  • Outliers are trimmed to the 0.1–99.9% range per category before KDE and stats are computed.
  • A horizontal line marks the median; a thick vertical line spans the IQR.
  • Column order: sim B (the second --sim, typically the earlier epoch) is shown on the left; sim A is on the right. Pass --labels in chronological order (earlier first) to match the visual order.

Notes

  • The figure uses the same four-category colour scheme as aggregate_topology_points.py (clusters yellow, filaments orange, walls red, unassigned purple).
  • Box plots: simulation A uses a lighter fill with //// hatch; simulation B uses a solid fill.
  • Violin plot colours match the category colour — both redshifts use the same hue.
  • Scatter plots colour points by category and show per-category marginal density curves. The legend uses coloured text placed at roughly the upper-right quadrant of the figure.
  • If no per-point CSVs are found for a simulation, violin and scatter plots are skipped with a warning.

Examples

python scripts/compare_simulations.py \
  --sim outputs/quijote_batches_000_w_clusters \
  --sim outputs/quijote_batches_004_w_clusters_points_6_0 \
  --labels "z=3" "z=0" \
  --scalars log10_field_value \
  --output-dir outputs/comparison_000_vs_004 \
  --output-prefix compare_z3_vs_z0_6_0

More Examples

Redshift 3:

python scripts/aggregate_topology_points.py \
  --root outputs/quijote_batches_000_w_clusters \
  --output-dir outputs/quijote_batches_000_w_clusters/combined \
  --output-prefix quijote_batches_000_w_clusters \
  --engine polars \
  --polars-chunks 4 \
  --log10-field-value \
  --violin-scalar log10_field_value \
  --hist-bin-mode global \
  --hist-percentile-range 1 99 \
  --plot-percentile-range .1 99.9 \
  --plot-fontscale 1.2 \
  --plot-dpi 600

Redshift 0:

python scripts/aggregate_topology_points.py \
  --root outputs/quijote_batches_004_w_clusters_points_6_0 \
  --output-dir outputs/quijote_batches_004_w_clusters_points_6_0/combined \
  --output-prefix quijote_batches_004_w_clusters_points_6_0 \
  --engine polars \
  --polars-chunks 4 \
  --log10-field-value \
  --violin-scalar log10_field_value \
  --hist-bin-mode global \
  --hist-percentile-range 1 99 \
  --plot-percentile-range .1 99.9 \
  --plot-fontscale 1.2 \
  --plot-dpi 600