aggregate_topology_points.py User Guide
Use scripts/aggregate_topology_points.py to combine many *_topology_points.csv files into:
- A single aggregated
topology_statsCSV (same shape as per-crop stats). - Per-category histograms for selected scalars (CSV, plus PNG if matplotlib is available).
- Composition barchart + violin/box plots for four focus categories (clusters, filament-manifolds not clusters, walls not filaments/clusters, and unassigned).
This is designed for large datasets (many crops and large per-point files).
What it does
- Scans for
*_topology_points.csvfiles (or uses explicit inputs). - Aggregates category statistics across all crops.
- Writes histogram CSVs (and optional PNGs) per category and scalar.
- Optionally renders composition barchart + violin/box plots.
Requirements
- Python in your active environment.
- Optional but recommended:
polars(fast path, default). - Optional:
matplotlibfor PNG histogram plots.
CLI
python scripts/aggregate_topology_points.py \
--root <OUTPUT_ROOT> \
--output-dir <COMBINED_DIR> \
--output-prefix <PREFIX> \
[--glob "**/*_topology_points.csv"] \
[--inputs file1.csv file2.csv ...] \
[--bins 100] \
[--hist-scalars field_value log_field_value|all] \
[--hist-bin-mode per-category|global] \
[--hist-percentile-range 1 99] \
[--engine polars|stream|python] \
[--polars-chunks N] \
[--violin-scalar log_field_value] \
[--box-scalar field_value] \
[--plots-from-raw] \
[--plot-sample-size 200000] \
[--plot-fontscale 1.0] \
[--plot-dpi 200] \
[--no-plots]Engines
polars(default): fast, exact stats, scales well. Use--polars-chunksto split inputs into N chunks for large datasets; chunked mode uses global histogram bins and yields approximate quantiles.stream: fast and memory-light, approximate quantiles (uses histogram bins).python: exact but slow on large files.
Output
In --output-dir with --output-prefix:
<prefix>_topology_stats.csv<prefix>_<category>_<scalar>_hist.csv<prefix>_<category>_<scalar>_hist.png(if matplotlib is available and--no-plotsis not set)<prefix>_filman_walls_composition.png(composition barchart for the four focus categories)<prefix>_filman_walls_<scalar>_violin.png(violin plot for the four focus categories)<prefix>_filman_walls_<scalar>_box.png(box plot for the four focus categories)
Categories follow ndtopo_stats.py, including filament‑manifold and cluster variants when present. The violin/box/composition plots only include:
clustersfilament_manifolds_not_clusterswalls_not_filament_manifolds_or_clustersunassigned_walls_filament_manifolds_clusters
Histograms use per-category bins by default (--hist-bin-mode per-category) or shared bins per scalar (--hist-bin-mode global). The default histogram scalars are field_value and log_field_value (use --hist-scalars to override). Histogram PNGs include the particle count n and a red dashed median line; histogram CSVs include total_count. Violin plots use raw values if --plots-from-raw is set; otherwise they sample from histogram bins. Box plots are rendered from the aggregated *_topology_stats.csv by default.
Examples
Aggregate all crops under a root, fast path:
python scripts/aggregate_topology_points.py \
--root outputs/quijote_batches_000 \
--output-dir outputs/quijote_batches_000/combined \
--output-prefix quijote_batches_000 \
--engine polars \
--hist-bin-mode global \
--hist-percentile-range 1 99Use explicit files and skip PNG plots:
python scripts/aggregate_topology_points.py \
--inputs outputs/crop_a/topology_points.csv outputs/crop_b/topology_points.csv \
--output-dir outputs/combined \
--output-prefix combined \
--no-plotsStreaming mode (approximate quantiles):
python scripts/aggregate_topology_points.py \
--root outputs/quijote_batches_000 \
--output-dir outputs/quijote_batches_000/combined \
--output-prefix quijote_batches_000 \
--engine stream \
--bins 50Chunked polars mode (lower memory, approximate quantiles):
python scripts/aggregate_topology_points.py \
--root outputs/quijote_batches_000 \
--output-dir outputs/quijote_batches_000/combined \
--output-prefix quijote_batches_000 \
--engine polars \
--polars-chunks 4 \
--hist-bin-mode global \
--hist-percentile-range 1 99compare_simulations.py User Guide
Use scripts/compare_simulations.py to compare two simulations side-by-side. It reads one *_topology_stats.csv per simulation (auto-discovered from a folder or its combined/ subdirectory) and, when per-point CSVs are available, reads the individual *_topology_points.csv files across all crops.
What it does
- Reads aggregated
topology_stats.csvfiles from two simulation output folders. - Renders side-by-side box plots for each of the four primary categories (clusters, filaments, walls, unassigned) across chosen scalars, with a summary-statistics table below.
- Renders side-by-side violin plots (requires per-point CSVs) built from actual particle values with equal-area KDE normalization — every violin’s visual area is proportional to its sample size — plus a median line and IQR indicator.
- Renders two scatter + marginal-density plots (requires per-point CSVs):
- Particle proximity: x = density^(−1/3) (a proxy for inter-particle separation), y = log₁₀(density).
- Voronoi cell volume: x = density^(−1) (Voronoi volume proxy), y = log₁₀(density).
- Each scatter plot includes marginal density curves along both axes.
Requirements
- Python with
matplotlib,numpy,scipy(for KDE), andpolars(orpandas).
CLI
python scripts/compare_simulations.py \
--sim <SIM_A_DIR> \
--sim <SIM_B_DIR> \
[--stats-file <CSV_A> --stats-file <CSV_B>] \
[--points-file <CSV_A> --points-file <CSV_B>] \
[--labels "z=3" "z=0"] \
[--scalars log_field_value field_value] \
[--plot-sample-size N] \
[--output-dir <OUTPUT_DIR>] \
[--output-prefix <PREFIX>] \
[--font-scale 1.0] \
[--dpi 150]Key options
--sim— Root output directory for a simulation (repeat exactly twice). The script searches for*_topology_stats.csvdirectly or inside acombined/subdirectory, and finds all*_topology_points.csvfiles under the folder (excludingcombined/) for violin and scatter plots.--stats-file— Supply stats CSVs directly (repeat exactly twice) instead of--sim.--points-file— Supply per-point CSVs directly (repeat exactly twice) instead of auto-discovery.--labels— Display names for the two simulations (default: folder basenames). Pass the earlier epoch first (e.g."z=3" "z=0") — violin plots always show sim A on the right and sim B on the left, so earlier-epoch data appears on the left.--scalars— Which scalar columns to plot (default: all detected scalars).--plot-sample-size— Reservoir-sample at most N rows per simulation when reading per-point CSVs (default: read all rows).--output-dir— Where to write the figure (default: current directory).--output-prefix— Filename prefix for output PNGs (default:comparison).--font-scale— Multiplier for all font sizes (default: 1.0).--dpi— Output image DPI (default: 150).
Output files
For each scalar:
<prefix>_<scalar>_box.png— side-by-side box plots + stats table<prefix>_<scalar>_violin.png— side-by-side violin plots + stats table (requires per-point CSVs)
Always (requires per-point CSVs):
<prefix>_scatter_proximity.png— log₁₀-density vs particle proximity scatter + marginal densities<prefix>_scatter_voronoi.png— log₁₀-density vs Voronoi cell volume scatter + marginal densities
Violin plot conventions
- Each violin uses equal-area KDE normalization: all violins share a single global scale so that visual area ∝ sample density rather than being scaled to a common maximum width per violin.
- Outliers are trimmed to the 0.1–99.9% range per category before KDE and stats are computed.
- A horizontal line marks the median; a thick vertical line spans the IQR.
- Column order: sim B (the second
--sim, typically the earlier epoch) is shown on the left; sim A is on the right. Pass--labelsin chronological order (earlier first) to match the visual order.
Notes
- The figure uses the same four-category colour scheme as
aggregate_topology_points.py(clusters yellow, filaments orange, walls red, unassigned purple). - Box plots: simulation A uses a lighter fill with
////hatch; simulation B uses a solid fill. - Violin plot colours match the category colour — both redshifts use the same hue.
- Scatter plots colour points by category and show per-category marginal density curves. The legend uses coloured text placed at roughly the upper-right quadrant of the figure.
- If no per-point CSVs are found for a simulation, violin and scatter plots are skipped with a warning.
Examples
python scripts/compare_simulations.py \
--sim outputs/quijote_batches_000_w_clusters \
--sim outputs/quijote_batches_004_w_clusters_points_6_0 \
--labels "z=3" "z=0" \
--scalars log10_field_value \
--output-dir outputs/comparison_000_vs_004 \
--output-prefix compare_z3_vs_z0_6_0More Examples
Redshift 3:
python scripts/aggregate_topology_points.py \
--root outputs/quijote_batches_000_w_clusters \
--output-dir outputs/quijote_batches_000_w_clusters/combined \
--output-prefix quijote_batches_000_w_clusters \
--engine polars \
--polars-chunks 4 \
--log10-field-value \
--violin-scalar log10_field_value \
--hist-bin-mode global \
--hist-percentile-range 1 99 \
--plot-percentile-range .1 99.9 \
--plot-fontscale 1.2 \
--plot-dpi 600Redshift 0:
python scripts/aggregate_topology_points.py \
--root outputs/quijote_batches_004_w_clusters_points_6_0 \
--output-dir outputs/quijote_batches_004_w_clusters_points_6_0/combined \
--output-prefix quijote_batches_004_w_clusters_points_6_0 \
--engine polars \
--polars-chunks 4 \
--log10-field-value \
--violin-scalar log10_field_value \
--hist-bin-mode global \
--hist-percentile-range 1 99 \
--plot-percentile-range .1 99.9 \
--plot-fontscale 1.2 \
--plot-dpi 600