ATISS
ATISS copied to clipboard
Reproduction Discrepancies: High FID, Overfitting Gives Nicer Results
Hello authors, First, thank you for open‑sourcing your code and detailed paper! We’re a team of beginner undergraduate students. So, we have following questions about reproducing ATISS results:
1. FID much higher than reported (~95–115 vs. 35)
- We used your
compute_fid.pyunmodified on 256×256 photorealistic, top‑down renders (after mesh retrieval) for both real test scenes and our 1 000 generated test layouts. - Repeated sampling 10× and averaged the FID.
- We get FID in the range 95–115, despite matching your classifier and KL metrics (ours are even slightly better).
- We also tried using same floor‐texture for real and synthetic scene pairs generated for same scenes floor plan.
Could we be missing any rendering or preprocessing detail?
2. Validation loss “overfits” yet sample quality remains good
Validation loss decreases only for ~50 epochs, then rises to worse than random initialization.
- Yet sampled scenes from the “overfitted” checkpoint contain plausible layouts and correct furniture counts, whereas the lowest–val‐loss checkpoint often produces near‐empty rooms.
- We calculated metrics for overfitted checkpoint, it gives similar results as minimum validation loss checkpoint.
What do you suggest about this?
3. Teacher forcing
- Why did you decide not to use teacher forcing?
- What should we change in loss or other to do teacher forcing?
Thank you!