ddf-pipeline icon indicating copy to clipboard operation
ddf-pipeline copied to clipboard

Comparison between new DDF/kMS and old DDF/kMS

Open twshimwell opened this issue 1 year ago • 76 comments

This issue is for the final checks of the new DDF/kMS before switching LoTSS processing to use the new rather than old versions.

6 LoTSS fields (P030+51, P054+36, P216+15, P269+63, P316+81 and P347+08) are being processed using the same parset and facet layout with the new and old versions of DDF/kMS.

I shall update this ticket as the fields finish processing but the new maps are stored in:

/beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/P???+??

Whereas the old ones are in:

/beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/comparison-maps/P???+??

twshimwell avatar Nov 13 '23 14:11 twshimwell

The first run to produce final maps is P054+36. Attached is a comparison image_full_ampphase_di_m.NS.app.restored.fits. Left is with new kMS/DDF (/home/tasse/DDFSingularity/ddf_dev.sif and using the pipeline version in this pull request - https://github.com/mhardcastle/ddf-pipeline/pull/336) and right is the old version (from our LoTSS archive) that was made a few months ago using the old kMS/DDF (/data/lofar/mjh/ddf_py3_d11_ctmaster.sif).

When blinking between the images it is pretty tough to see any difference at all.

Screenshot 2023-11-13 at 15 25 43

twshimwell avatar Nov 13 '23 14:11 twshimwell

Comparison between final P316+81 maps. Left new and right old.

Here arguably the new in a little worse with some negative holes around some of the brighter sources. The rms on the maps is a few % higher too (probably due to artifacts from the negative holes).

Screenshot 2023-11-14 at 11 21 14

twshimwell avatar Nov 14 '23 10:11 twshimwell

That's bad... Needs investigation... You see the holes in all facets? At what step do the differences appear? Could be the smoothing, in that case we could try to smooth the new solutions with the old SmoothSols.py and see what happens?

cyriltasse avatar Nov 14 '23 10:11 cyriltasse

Hmmm it might be something to do with the DDS3_full_slow_merged solutions. The holes seem to appear (well a bit hard to tell for sure because of differences in artefacts) for the first time in the final image (image_full_ampphase_di_m.NS) and are not really apparent in the previous image_full_ampphase_di_m map.

Left is the image_full_ampphase_di_m.NS and right is the image_full_ampphase_di_m map. Colourscale is the same.

Screenshot 2023-11-14 at 11 48 15

twshimwell avatar Nov 14 '23 10:11 twshimwell

P030+51 is giving a more weird issue in that the bootstrap image comes out very oddly (see below).

The two low declination fields both hang on the first kMS call for some reason. Ive tried clearing cache and using different nodes but they hang at 99% of the kMS on the first msfile. Odd.

Screenshot 2023-11-16 at 14 57 00

twshimwell avatar Nov 16 '23 13:11 twshimwell

Well that's a killms fail at the first dd stage... but presumably works OK in the old pipeline? or not?

mhardcastle avatar Nov 16 '23 14:11 mhardcastle

Yeh I don’t have those images from the old pipeline but it worked ok because the bootstrap mask images are propagated and for my new run those are pretty bad.

Indeed I think something from with the amplitude solutions on kMS but will do a test to check

Sent from Outlook for iOShttps://aka.ms/o0ukef


From: Martin Hardcastle @.> Sent: Thursday, November 16, 2023 3:09:23 PM To: mhardcastle/ddf-pipeline @.> Cc: Timothy Shimwell @.>; Author @.> Subject: Re: [mhardcastle/ddf-pipeline] Comparison between new DDF/kMS and old DDF/kMS (Issue #337)

Well that's a killms fail at the first dd stage... but presumably works OK in the old pipeline? or not?

— Reply to this email directly, view it on GitHubhttps://github.com/mhardcastle/ddf-pipeline/issues/337#issuecomment-1814504763, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC4BT5XONQCWK7SYDEMGVWDYEYNBHAVCNFSM6AAAAAA7JG67A2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJUGUYDINZWGM. You are receiving this because you authored the thread.Message ID: @.***>

twshimwell avatar Nov 16 '23 14:11 twshimwell

OK so reimaging the bootstrap bit with just P appliied rather than AP kind of solves the issue. So the amplitude solutions are indeed really poor.

Looking at other data products I also see that for these fields the image_dirin_SSD is better than the image_dirin_SSD_m_c_di. So there are also issues still at the DI solving step. I'll do a direct comparison with the old pipeline so we compare these early pipeline products.

Hmm.

twshimwell avatar Nov 17 '23 09:11 twshimwell

So the bootstrap issue seemed somewhat due to the fast dt_di that we use in the parset. Let the dt_di be automatically choose it gives a slower dt_di and seemed to produce better results (i suspect in some cases would be better still if it were a bit longer intervals still) with less diverging facets. Will continue runs with automatically chosen dt_di.

Note that for some reason the old kMS did not suffer so significantly with the di_dt being very fast (still not good through with the image_dirin_SSD being generally better than the image_dirin_SSD_m_c_di).

twshimwell avatar Nov 21 '23 09:11 twshimwell

P269+63 - no major differences for this one between the new and old kMS/DDF versions (different dt_di was used in the older reduction due to ticket #315)

twshimwell avatar Nov 30 '23 14:11 twshimwell

So to-date:

P269+63 & P054+36 approximately comparable between new and old versions.

P216+15 -- having issues running so no results yet.

All the other fields get worse unfortunately.

P030+51 - big differences at the bootstrap step with new kMS apparently producing much worse solutions from the DI maps.

P316+81 - negative holes in new kMS/DDF final maps but not in old one.

P347+08 - bright source away from centre causes much bigger issues in new DDF/kMS compared to old one. Resulting image is far noisier.

The simplest field to test things on seems P030+51 because it goes wrong so early in the pipeline. I can try e.g. using old kMs one the run with the new DDF and seeing the results so we have a really clean comparison with just a single kMS step (no smooth sols or DDF pipeline). Then I think it might require @cyriltasse to take a look at kMS.

If anyone has suggestions for things to try let me know.

twshimwell avatar Dec 01 '23 14:12 twshimwell

For P030+51 ive data in /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/P030+51-SAMEDICO-NEWSOFT and /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/P030+51-SAMEDICO-OLDSOFT to investigate why the image_bootstrap maps are so different for this field when changing from old to new software.

These runs were stopped soon after the bootstrap mapping so contain all the appropriate weights. The attached shows images made using the command below. Here the left is the new software dirty image and right the old software dirty image:

DDF.py --Misc-ConserveMemory=1 --Output-Name=image_test --Data-MS=mslist.txt --Deconv-PeakFactor 0.001000 --Data-ColName DATA_DI_CORRECTED --Parallel-NCPU=32 --Beam-CenterNorm=1 --Deconv-CycleFactor=0 --Deconv-MaxMinorIter=1000000 --Deconv-MaxMajorIter=2 --Deconv-Mode SSD --Beam-Model=LOFAR --Weight-Robust -0.150000 --Image-NPix=20000 --CF-wmax 50000 --CF-Nw 100 --Output-Also onNeds --Image-Cell 1.500000 --Facets-NFacets=11 --SSDClean-NEnlargeData 0 --Freq-NDegridBand 1 --Beam-NBand 1 --Facets-DiamMax 1.5 --Facets-DiamMin 0.1 --Deconv-RMSFactor=3.000000 --SSDClean-ConvFFTSwitch 10000 --Data-Sort 1 --Cache-Dir=. --Cache-DirWisdomFFTW=. --Debug-Pdb=never --Log-Memory 1 --GAClean-RMSFactorInitHMP 1.000000 --GAClean-MaxMinorIterInitHMP 10000.000000 --DDESolutions-SolsDir=SOLSDIR --Cache-Weight=reset --Beam-PhasedArrayMode=A --Misc-IgnoreDeprecationMarking=1 --Beam-At=facet --Output-Mode=Clean --Predict-ColName DD_PREDICT --Output-RestoringBeam 12.000000 --Weight-ColName="IMAGING_WEIGHT" --Freq-NBand=2 --RIME-DecorrMode=FT --SSDClean-SSDSolvePars [S,Alpha] --SSDClean-BICFactor 0 --Mask-Auto=1 --Mask-SigTh=5.00 --Mask-External=image_dirin_SSD_m_c_di_m.app.restored.fits.mask.fits --DDESolutions-GlobalNorm=None --DDESolutions-DDModeGrid=AP --DDESolutions-DDModeDeGrid=AP --DDESolutions-DDSols=DDS0 --Predict-InitDicoModel=image_dirin_SSD_m_c_di_m.DicoModel --Selection-UVRangeKm=[0.100000,1000.000000] --GAClean-MinSizeInit=10

Screenshot 2023-12-05 at 12 24 52

twshimwell avatar Dec 05 '23 11:12 twshimwell

Hi -

So after 10 days of intense debugging I managed to understand what was going on, and now the new and old version of killMS virtually give the same results (left/right are old/new)

image

There were two bugs:

  • due to a misplaced if statement the adaptative Q matrix were upated based on iteration level rather than on the recurtion level
  • when using the Asyncronous predict, the rms used to estimate the noise matrix were based on t-1 instead of t

The new kMS has many more functionalities and is faster than the older one by ~30-40%

I'll design a kMS test function to do this more automatically in the future

cyriltasse avatar Jan 05 '24 08:01 cyriltasse

Ok I've build the new singularity image: /home/tasse/DDFSingularity/ddf_dev.sif

@twshimwell ready for testing

cyriltasse avatar Jan 05 '24 09:01 cyriltasse

Let me know if you need some nodes @twshimwell .

mhardcastle avatar Jan 05 '24 09:01 mhardcastle

Great stuff Cyril. @mhardcastle I have 6 test fields and I already have node066. If possible can I either get another 5 nodes so I can do all 6 fields in parallel. Alternatively, perhaps 2more nodes and then ill do two batches of 3 runs. many thanks

twshimwell avatar Jan 05 '24 12:01 twshimwell

I've reserved nodes001-005 for you, so go crazy (but remind me to remove the reservation again afterwards!).

mhardcastle avatar Jan 05 '24 13:01 mhardcastle

Thanks. Just restarting all the runs. Should have some results by Monday hopefully. Had to choose a few different fields due to LINC reprocessing near Ateam sources. Anyway, that shouldn't really matter.

twshimwell avatar Jan 05 '24 13:01 twshimwell

Well - not sure why but I don't see it faster now, it's just the same speed... I'll dig a bit

cyriltasse avatar Jan 05 '24 17:01 cyriltasse

Well I guess one starts to see differeces in computing time when the datasets are bigger...

cyriltasse avatar Jan 05 '24 17:01 cyriltasse

We can look out for this in Tim's tests, he's using the same type of machine for both and the timing info is all in the logs and can be extracted by the plotting code. Although timing differences of tens of per cent can easily be caused just by some process hitting the storage. I have 26 ddf-pipeline instances running at Herts right now so things may be a bit slow.

mhardcastle avatar Jan 05 '24 17:01 mhardcastle

hmmm im getting DDFacet issues on the image_dirin_SSD_m_c image (so the first one with the clustering). The command run is e.g.

DDF.py --Misc-ConserveMemory=1 --Output-Name=image_dirin_SSD_m_c --Data-MS=mslist.txt --Deconv-PeakFactor 0.001000 --Data-ColName DATA --Parallel-NCPU=32 --Beam-CenterNorm=1 --Deconv-CycleFactor=0 --Deconv-MaxMinorIter=1000000 --Deconv-MaxMajorIter=1 --Deconv-Mode SSD --Beam-Model=LOFAR --Weight-Robust -0.150000 --Image-NPix=20000 --CF-wmax 50000 --CF-Nw 100 --Output-Also onNeds --Image-Cell 1.500000 --Facets-NFacets=11 --SSDClean-NEnlargeData 0 --Freq-NDegridBand 1 --Beam-NBand 1 --Facets-DiamMax 1.5 --Facets-DiamMin 0.1 --Deconv-RMSFactor=0.000000 --SSDClean-ConvFFTSwitch 10000 --Data-Sort 1 --Cache-Dir=. --Cache-DirWisdomFFTW=. --Debug-Pdb=never --Log-Memory 1 --GAClean-RMSFactorInitHMP 1.000000 --GAClean-MaxMinorIterInitHMP 10000.000000 --DDESolutions-SolsDir=SOLSDIR --Cache-Weight=reset --Beam-PhasedArrayMode=A --Misc-IgnoreDeprecationMarking=1 --Beam-At=facet --Output-Mode=Clean --Predict-ColName DD_PREDICT --Output-RestoringBeam 12.000000 --Weight-ColName="None" --Freq-NBand=2 --RIME-DecorrMode=FT --SSDClean-SSDSolvePars [S,Alpha] --SSDClean-BICFactor 0 --Facets-CatNodes=/beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs/uv-misc-files/P090+77/image_dirin_SSD_m.npy.ClusterCat.npy --Mask-Auto=1 --Mask-SigTh=15.00 --Mask-External=image_dirin_SSD.app.restored.fits.mask.fits --Predict-InitDicoModel=image_dirin_SSD_m.DicoModel --Selection-UVRangeKm=[0.100000,1000.000000] --GAClean-MinSizeInit=10

The error looks like:

  • 17:23:57 - AsyncProcessPool [1.2/6.2 3.1/11.2 115.7Gb] process comp31: exception raised processing job buildpsf:88: Traceback (most recent call last): File "/usr/local/src/DDFacet/DDFacet/Other/AsyncProcessPool.py", line 873, in _dispatch_job result = call(*args, **kwargs) File "/usr/local/src/DDFacet/DDFacet/Imager/ClassFacetMachine.py", line 995, in _buildFacetSlice_worker psf[ch][pol] = psf[ch][pol].T[::-1, :] ValueError: could not broadcast input array from shape (3675,3465) into shape (3465,3675)

    Build PSF facet slices......... 87/98 [==================================== ] 88% - 0'01"

twshimwell avatar Jan 05 '24 19:01 twshimwell

(the previous singularity I was using with the newer software was /beegfs/car/mjh/DDFSingularity/ddf.sif which I believe is built from /beegfs/car/mjh/DDFSingularity/ddf-py3.singularity. This one the DDFacet seemed to not have this issue. Also using the latest version of ddf-pipeline master branch is fine for the new singularity image)

twshimwell avatar Jan 05 '24 19:01 twshimwell

That's just a copy of Cyril's directory, so shouldn't be any different in itself.

Is this a clean run from the start?

mhardcastle avatar Jan 05 '24 21:01 mhardcastle

oh strange. Yes these are all clean runs from the start. Only thing different from normal runs is that I specify the clusterfile so that we have the same directions as the old comparison runs.

twshimwell avatar Jan 06 '24 05:01 twshimwell

@twshimwell can you point me to where the data is?

cyriltasse avatar Jan 06 '24 13:01 cyriltasse

its in /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs. There is a folder for each field and in the /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs/config-files folder are the pipeline parset files.

twshimwell avatar Jan 06 '24 14:01 twshimwell

example logfile with error is e.g. /beegfs/car/shimwell/Cyril-tests/FINAL-TESTS/new-runs/P030+51/image_dirin_SSD_m_c.log

twshimwell avatar Jan 06 '24 14:01 twshimwell

It's weirdbecause I was working on P030+51 too and I did not have this error... For the clusterfile I just copied the one of the old processing into to new directory and let the pipeline skip the computation, but I hrdly see how this could happen

cyriltasse avatar Jan 06 '24 16:01 cyriltasse

Ah ok. Wonder why i see the issue now then. I see it on all the fields im trying. So you simply copied the clusterfile and not any of the image_dirin_SSD_m_c products?

twshimwell avatar Jan 08 '24 13:01 twshimwell