ddf-pipeline
ddf-pipeline copied to clipboard
Comparison between new DDF/kMS and old DDF/kMS
This issue is for the final checks of the new DDF/kMS before switching LoTSS processing to use the new rather than old versions.
6 LoTSS fields (P030+51, P054+36, P216+15, P269+63, P316+81 and P347+08) are being processed using the same parset and facet layout with the new and old versions of DDF/kMS.
I shall update this ticket as the fields finish processing but the new maps are stored in:
/beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/P???+??
Whereas the old ones are in:
/beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/comparison-maps/P???+??
The first run to produce final maps is P054+36. Attached is a comparison image_full_ampphase_di_m.NS.app.restored.fits. Left is with new kMS/DDF (/home/tasse/DDFSingularity/ddf_dev.sif and using the pipeline version in this pull request - https://github.com/mhardcastle/ddf-pipeline/pull/336) and right is the old version (from our LoTSS archive) that was made a few months ago using the old kMS/DDF (/data/lofar/mjh/ddf_py3_d11_ctmaster.sif).
When blinking between the images it is pretty tough to see any difference at all.
Comparison between final P316+81 maps. Left new and right old.
Here arguably the new in a little worse with some negative holes around some of the brighter sources. The rms on the maps is a few % higher too (probably due to artifacts from the negative holes).
That's bad... Needs investigation... You see the holes in all facets? At what step do the differences appear? Could be the smoothing, in that case we could try to smooth the new solutions with the old SmoothSols.py and see what happens?
Hmmm it might be something to do with the DDS3_full_slow_merged solutions. The holes seem to appear (well a bit hard to tell for sure because of differences in artefacts) for the first time in the final image (image_full_ampphase_di_m.NS) and are not really apparent in the previous image_full_ampphase_di_m map.
Left is the image_full_ampphase_di_m.NS and right is the image_full_ampphase_di_m map. Colourscale is the same.
P030+51 is giving a more weird issue in that the bootstrap image comes out very oddly (see below).
The two low declination fields both hang on the first kMS call for some reason. Ive tried clearing cache and using different nodes but they hang at 99% of the kMS on the first msfile. Odd.
Well that's a killms fail at the first dd stage... but presumably works OK in the old pipeline? or not?
Yeh I don’t have those images from the old pipeline but it worked ok because the bootstrap mask images are propagated and for my new run those are pretty bad.
Indeed I think something from with the amplitude solutions on kMS but will do a test to check
Sent from Outlook for iOShttps://aka.ms/o0ukef
From: Martin Hardcastle @.> Sent: Thursday, November 16, 2023 3:09:23 PM To: mhardcastle/ddf-pipeline @.> Cc: Timothy Shimwell @.>; Author @.> Subject: Re: [mhardcastle/ddf-pipeline] Comparison between new DDF/kMS and old DDF/kMS (Issue #337)
Well that's a killms fail at the first dd stage... but presumably works OK in the old pipeline? or not?
— Reply to this email directly, view it on GitHubhttps://github.com/mhardcastle/ddf-pipeline/issues/337#issuecomment-1814504763, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC4BT5XONQCWK7SYDEMGVWDYEYNBHAVCNFSM6AAAAAA7JG67A2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJUGUYDINZWGM. You are receiving this because you authored the thread.Message ID: @.***>
OK so reimaging the bootstrap bit with just P appliied rather than AP kind of solves the issue. So the amplitude solutions are indeed really poor.
Looking at other data products I also see that for these fields the image_dirin_SSD is better than the image_dirin_SSD_m_c_di. So there are also issues still at the DI solving step. I'll do a direct comparison with the old pipeline so we compare these early pipeline products.
Hmm.
So the bootstrap issue seemed somewhat due to the fast dt_di that we use in the parset. Let the dt_di be automatically choose it gives a slower dt_di and seemed to produce better results (i suspect in some cases would be better still if it were a bit longer intervals still) with less diverging facets. Will continue runs with automatically chosen dt_di.
Note that for some reason the old kMS did not suffer so significantly with the di_dt being very fast (still not good through with the image_dirin_SSD being generally better than the image_dirin_SSD_m_c_di).
P269+63 - no major differences for this one between the new and old kMS/DDF versions (different dt_di was used in the older reduction due to ticket #315)
So to-date:
P269+63 & P054+36 approximately comparable between new and old versions.
P216+15 -- having issues running so no results yet.
All the other fields get worse unfortunately.
P030+51 - big differences at the bootstrap step with new kMS apparently producing much worse solutions from the DI maps.
P316+81 - negative holes in new kMS/DDF final maps but not in old one.
P347+08 - bright source away from centre causes much bigger issues in new DDF/kMS compared to old one. Resulting image is far noisier.
The simplest field to test things on seems P030+51 because it goes wrong so early in the pipeline. I can try e.g. using old kMs one the run with the new DDF and seeing the results so we have a really clean comparison with just a single kMS step (no smooth sols or DDF pipeline). Then I think it might require @cyriltasse to take a look at kMS.
If anyone has suggestions for things to try let me know.
For P030+51 ive data in /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/P030+51-SAMEDICO-NEWSOFT and /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/P030+51-SAMEDICO-OLDSOFT to investigate why the image_bootstrap maps are so different for this field when changing from old to new software.
These runs were stopped soon after the bootstrap mapping so contain all the appropriate weights. The attached shows images made using the command below. Here the left is the new software dirty image and right the old software dirty image:
DDF.py --Misc-ConserveMemory=1 --Output-Name=image_test --Data-MS=mslist.txt --Deconv-PeakFactor 0.001000 --Data-ColName DATA_DI_CORRECTED --Parallel-NCPU=32 --Beam-CenterNorm=1 --Deconv-CycleFactor=0 --Deconv-MaxMinorIter=1000000 --Deconv-MaxMajorIter=2 --Deconv-Mode SSD --Beam-Model=LOFAR --Weight-Robust -0.150000 --Image-NPix=20000 --CF-wmax 50000 --CF-Nw 100 --Output-Also onNeds --Image-Cell 1.500000 --Facets-NFacets=11 --SSDClean-NEnlargeData 0 --Freq-NDegridBand 1 --Beam-NBand 1 --Facets-DiamMax 1.5 --Facets-DiamMin 0.1 --Deconv-RMSFactor=3.000000 --SSDClean-ConvFFTSwitch 10000 --Data-Sort 1 --Cache-Dir=. --Cache-DirWisdomFFTW=. --Debug-Pdb=never --Log-Memory 1 --GAClean-RMSFactorInitHMP 1.000000 --GAClean-MaxMinorIterInitHMP 10000.000000 --DDESolutions-SolsDir=SOLSDIR --Cache-Weight=reset --Beam-PhasedArrayMode=A --Misc-IgnoreDeprecationMarking=1 --Beam-At=facet --Output-Mode=Clean --Predict-ColName DD_PREDICT --Output-RestoringBeam 12.000000 --Weight-ColName="IMAGING_WEIGHT" --Freq-NBand=2 --RIME-DecorrMode=FT --SSDClean-SSDSolvePars [S,Alpha] --SSDClean-BICFactor 0 --Mask-Auto=1 --Mask-SigTh=5.00 --Mask-External=image_dirin_SSD_m_c_di_m.app.restored.fits.mask.fits --DDESolutions-GlobalNorm=None --DDESolutions-DDModeGrid=AP --DDESolutions-DDModeDeGrid=AP --DDESolutions-DDSols=DDS0 --Predict-InitDicoModel=image_dirin_SSD_m_c_di_m.DicoModel --Selection-UVRangeKm=[0.100000,1000.000000] --GAClean-MinSizeInit=10
Hi -
So after 10 days of intense debugging I managed to understand what was going on, and now the new and old version of killMS virtually give the same results (left/right are old/new)
There were two bugs:
- due to a misplaced if statement the adaptative Q matrix were upated based on iteration level rather than on the recurtion level
- when using the Asyncronous predict, the rms used to estimate the noise matrix were based on t-1 instead of t
The new kMS has many more functionalities and is faster than the older one by ~30-40%
I'll design a kMS
test function to do this more automatically in the future
Ok I've build the new singularity image: /home/tasse/DDFSingularity/ddf_dev.sif
@twshimwell ready for testing
Let me know if you need some nodes @twshimwell .
Great stuff Cyril. @mhardcastle I have 6 test fields and I already have node066. If possible can I either get another 5 nodes so I can do all 6 fields in parallel. Alternatively, perhaps 2more nodes and then ill do two batches of 3 runs. many thanks
I've reserved nodes001-005 for you, so go crazy (but remind me to remove the reservation again afterwards!).
Thanks. Just restarting all the runs. Should have some results by Monday hopefully. Had to choose a few different fields due to LINC reprocessing near Ateam sources. Anyway, that shouldn't really matter.
Well - not sure why but I don't see it faster now, it's just the same speed... I'll dig a bit
Well I guess one starts to see differeces in computing time when the datasets are bigger...
We can look out for this in Tim's tests, he's using the same type of machine for both and the timing info is all in the logs and can be extracted by the plotting code. Although timing differences of tens of per cent can easily be caused just by some process hitting the storage. I have 26 ddf-pipeline instances running at Herts right now so things may be a bit slow.
hmmm im getting DDFacet issues on the image_dirin_SSD_m_c image (so the first one with the clustering). The command run is e.g.
DDF.py --Misc-ConserveMemory=1 --Output-Name=image_dirin_SSD_m_c --Data-MS=mslist.txt --Deconv-PeakFactor 0.001000 --Data-ColName DATA --Parallel-NCPU=32 --Beam-CenterNorm=1 --Deconv-CycleFactor=0 --Deconv-MaxMinorIter=1000000 --Deconv-MaxMajorIter=1 --Deconv-Mode SSD --Beam-Model=LOFAR --Weight-Robust -0.150000 --Image-NPix=20000 --CF-wmax 50000 --CF-Nw 100 --Output-Also onNeds --Image-Cell 1.500000 --Facets-NFacets=11 --SSDClean-NEnlargeData 0 --Freq-NDegridBand 1 --Beam-NBand 1 --Facets-DiamMax 1.5 --Facets-DiamMin 0.1 --Deconv-RMSFactor=0.000000 --SSDClean-ConvFFTSwitch 10000 --Data-Sort 1 --Cache-Dir=. --Cache-DirWisdomFFTW=. --Debug-Pdb=never --Log-Memory 1 --GAClean-RMSFactorInitHMP 1.000000 --GAClean-MaxMinorIterInitHMP 10000.000000 --DDESolutions-SolsDir=SOLSDIR --Cache-Weight=reset --Beam-PhasedArrayMode=A --Misc-IgnoreDeprecationMarking=1 --Beam-At=facet --Output-Mode=Clean --Predict-ColName DD_PREDICT --Output-RestoringBeam 12.000000 --Weight-ColName="None" --Freq-NBand=2 --RIME-DecorrMode=FT --SSDClean-SSDSolvePars [S,Alpha] --SSDClean-BICFactor 0 --Facets-CatNodes=/beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs/uv-misc-files/P090+77/image_dirin_SSD_m.npy.ClusterCat.npy --Mask-Auto=1 --Mask-SigTh=15.00 --Mask-External=image_dirin_SSD.app.restored.fits.mask.fits --Predict-InitDicoModel=image_dirin_SSD_m.DicoModel --Selection-UVRangeKm=[0.100000,1000.000000] --GAClean-MinSizeInit=10
The error looks like:
-
17:23:57 - AsyncProcessPool [1.2/6.2 3.1/11.2 115.7Gb] process comp31: exception raised processing job buildpsf:88: Traceback (most recent call last): File "/usr/local/src/DDFacet/DDFacet/Other/AsyncProcessPool.py", line 873, in _dispatch_job result = call(*args, **kwargs) File "/usr/local/src/DDFacet/DDFacet/Imager/ClassFacetMachine.py", line 995, in _buildFacetSlice_worker psf[ch][pol] = psf[ch][pol].T[::-1, :] ValueError: could not broadcast input array from shape (3675,3465) into shape (3465,3675)
Build PSF facet slices......... 87/98 [==================================== ] 88% - 0'01"
(the previous singularity I was using with the newer software was /beegfs/car/mjh/DDFSingularity/ddf.sif which I believe is built from /beegfs/car/mjh/DDFSingularity/ddf-py3.singularity. This one the DDFacet seemed to not have this issue. Also using the latest version of ddf-pipeline master branch is fine for the new singularity image)
That's just a copy of Cyril's directory, so shouldn't be any different in itself.
Is this a clean run from the start?
oh strange. Yes these are all clean runs from the start. Only thing different from normal runs is that I specify the clusterfile so that we have the same directions as the old comparison runs.
@twshimwell can you point me to where the data is?
its in /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs. There is a folder for each field and in the /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs/config-files folder are the pipeline parset files.
example logfile with error is e.g. /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs/P030+51/image_dirin_SSD_m_c.log
It's weirdbecause I was working on P030+51 too and I did not have this error... For the clusterfile I just copied the one of the old processing into to new directory and let the pipeline skip the computation, but I hrdly see how this could happen
Ah ok. Wonder why i see the issue now then. I see it on all the fields im trying. So you simply copied the clusterfile and not any of the image_dirin_SSD_m_c products?