help to analize and optimize very big explorer
scanning some bufr imported in arkimet I get a very big explorer.
du -ksh *.xapian
1,7G report_fixed.xapian
4,0K report_mobile.xapian
2,5M sample_fixed.xapian
4,0K sample_mobile.xapian
bufr data are big but so big summary is strange and I get also some strange values from summary; data was imported in arkimet so I expect no special problems in this kind of data.
Size of arkimet dataset with bufr:
du -ksh report_fixed
30G report_fixed
to generate explorer I use this:
#!/usr/bin/python3
import dballe
from pathlib import Path
dstypes=["report_fixed","report_mobile","sample_fixed","sample_mobile"]
for dstype in dstypes:
for path in Path(dstype).rglob('*.bufr'):
file=path.as_posix()
print(file)
# update from file
with dballe.Explorer(dstype+".xapian") as explorer:
with explorer.update() as updater:
importer = dballe.Importer("BUFR")
with importer.from_file(file) as message:
try:
updater.add_messages(message)
except Exception as e:
print (e)
print ("updated from file")
#print (explorer.all_reports)
#print (explorer.all_levels)
#print (explorer.all_tranges)
#print (explorer.all_varcodes)
print (explorer.stats)
interrupted after 24 hours of work with:
#!/usr/bin/python3
import dballe
from pathlib import Path
dstypes=["report_fixed","report_mobile","sample_fixed","sample_mobile"]
for dstype in dstypes:
with dballe.Explorer(dstype+".xapian") as explorer:
print (explorer.all_reports)
print (explorer.all_levels)
print (explorer.all_tranges)
print (explorer.all_varcodes)
print (explorer.stats)
explorer.stats say:
ExplorerStats(datetime_min=datetime.datetime(2000, 1, 1, 0, 0), datetime_max=datetime.datetime(2021, 6, 22, 23, 0), count=1117667617)
[]
[]
[]
[]
ExplorerStats(datetime_min=None, datetime_max=None, count=0)
['fixed', 'luftdaten']
[dballe.Level(103,1000,None,None), dballe.Level(103,2000,None,None), dballe.Level(265,1,None,None), None]
[dballe.Trange(0,0,60), dballe.Trange(254,0,0), None]
['B01011', 'B01019', 'B01194', 'B01213', 'B04001', 'B04002', 'B04003', 'B04004', 'B04005', 'B04006', 'B05001', 'B06001', 'B10004', 'B12101', 'B13003', 'B15195', 'B15198', 'B15202', 'B15203', 'B15242', 'B49193', 'B49194', 'B49195', 'B49196', 'B49197']
ExplorerStats(datetime_min=datetime.datetime(2021, 6, 23, 0, 0), datetime_max=datetime.datetime(2021, 6, 23, 23, 59, 59), count=8934327)
[]
[]
[]
[]
ExplorerStats(datetime_min=None, datetime_max=None, count=0)
I get the attached output log.zip
Thanks in advance for any suggestions how to analize the problem
sembra che alcuni problemi derivino da strani bufr. allego un esempio.
dbamsg dump problema.bufr
#0 BUFR message: 262 bytes, origin 200:0, category 0 255:255:0, bufr edition 4, tables 14:1, subsets 1, values: 51/51:
Subset 0:
001194 Report mnemonic(CCITTIA5): fixed
004001 YEAR(YEAR): 2019
004002 MONTH(MONTH): 7
004003 DAY(DAY): 31
004004 HOUR(HOUR): 13
004005 MINUTE(MINUTE): 45
004006 SECOND(SECOND): 0
001011 SHIP OR MOBILE LAND STATION IDENTIFIER(CCITTIA5): cmricci
005001 LATITUDE (HIGH ACCURACY)(DEGREE): 44.00035
006001 LONGITUDE (HIGH ACCURACY)(DEGREE): 12.65516
007192 First level type(NUMERIC): 1
004192 Time range type(NUMERIC): 1
004193 Time range P1(NUMERIC): 0
004194 Time range P2(NUMERIC): 900
013011 TOTAL PRECIPITATION / TOTAL WATER EQUIVALENT(KG/M**2): 0.0
007192 First level type(NUMERIC): 103
007193 Level L1(NUMERIC): 2000
004192 Time range type(NUMERIC): 0
012101 TEMPERATURE/DRY-BULB TEMPERATURE(K): 301.84
013003 RELATIVE HUMIDITY(%): 57
004192 Time range type(NUMERIC): 1
014198 Global radiation flux (downward)(W/M**2): -1
004192 Time range type(NUMERIC): 581
012101 TEMPERATURE/DRY-BULB TEMPERATURE(K): 1.57
013003 RELATIVE HUMIDITY(%): 62
004192 Time range type(NUMERIC): 57
012101 TEMPERATURE/DRY-BULB TEMPERATURE(K): 2.21
013003 RELATIVE HUMIDITY(%): 60
004192 Time range type(NUMERIC): 56
004194 Time range P2(NUMERIC): -3670016
012101 TEMPERATURE/DRY-BULB TEMPERATURE(K): 0.29
013003 RELATIVE HUMIDITY(%): 62
007193 Level L1(NUMERIC): -418906107
004192 Time range type(NUMERIC): 880
004194 Time range P2(NUMERIC): -536346624
011002 WIND SPEED(M/S): 360.0
004192 Time range type(NUMERIC): 0
011041 MAXIMUM WIND GUST SPEED(M/S): 0.8
011043 MAXIMUM WIND GUST DIRECTION(DEGREE TRUE): 0
004192 Time range type(NUMERIC): 0
011001 WIND DIRECTION(DEGREE TRUE): 100
011002 WIND SPEED(M/S): 0.0
004192 Time range type(NUMERIC): 0
004194 Time range P2(NUMERIC): -3670016
011001 WIND DIRECTION(DEGREE TRUE): 0
011002 WIND SPEED(M/S): 0.0
007192 First level type(NUMERIC): 0
007193 Level L1(NUMERIC): 19398656
025025 Battery voltage(V): 0.0
025192 Battery charge(%): 81
025193 Battery current(A): -6.584
Mi pare strano si riescano a scrivere dei bufr tramite API con metadati non validi ...
Direi che se la dimensione di un explorer cresce, vuol dire che c'è un'entropia alta nei metadati che vengono accorpati dall'explorer, che sono: stazione (lat, lon, ident, rep_memo), livello, timerange, varcode.
Puoi fare un'analisi dell'explorer grosso stampando quanti elementi ci sono in all_stations, all_levels, all_tranges, e all_varcodes, e vedere per quali di quelli entrano dati ad alta entropia.
Una volta capito cosa fa esplodere l'explorer, c'è da capire se è legittimo. Per esempio, importando dati di aerei latitudine e longitudine cambiano in continuazione, e quindi le stazioni acquisiscono entropia alta, e l'explorer non può accorpare dati.
Nel tuo caso mi sembra di capire che livello e timerange contengono inaspettatamente del rumore, e se è cos'consiglierei di andare a investigare la causa di quel rumore.