add NEOCC python interface to astroquery library
This is a python interface to data that the Near Earth Objects Coordination Centre (NEOCC) makes available through its API service (https://neo.ssa.esa.int/computer-access).
Hello @D-arioSpace! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
Comment last updated at 2024-02-07 15:54:47 UTC
Dear @ceb8, it's a pleasure to meet you. Sorry for the delay in replying. Please take a look at this latest pull where we removed the dependency form the parse library. Unfortunately we cannot remove the pandas dependency because we do not have the necessary resources to implement the changes, it requires too much effort. Hopefully this may be one of these "limited circumstances" you speak of. In this case it's not a question of functionality (apparently there is nothing that astropy.tables can't do and panda does, but I can't be 100% sure either). It is just a matter of our internal resources that we do not have at the moment.
I perfectly understand that this can introduce dependencies and extra-feature that you would like to avoid. Please take also into account that, apart from the lack of resources, there are also other considerations that push us to this choice:
- there are users of some institutions who already use the library. We would like to encourage these users to use the library through astroquery (instead of the stand-alone version we provided) but with as little effort as possible for them. By using the library as it is now, they should just change their current import to
from astroquery.esa.neocc import neoccin their code, and everything else will be transparent to them. - all users who now use the library already have pandas pre-installed. For those who don't have it or start with a very basic version of python, the installation can easily be done separately.
Taking all this into consideration, please let us know if we can focus on the integration into astroquery without considering the pandas replacement.
Hi @D-arioSpace,
Unfortunately we really cannot allow a pandas dependency into our codebase, returning Astropy Tables to users is a very basic requirement in our API specs. However, we can provide the labor to make that change. If this course of action is acceptable, I will implement those changes within this PR.
There is one other issue that needs to be resolved before we proceed, which is the matter of the copyright notice you have on your code:
© Copyright [European Space Agency][2022] All rights reserved
Astroquery is released under a 3-clause BSD style license (see https://astroquery.readthedocs.io/en/latest/license.html), and individual modules cannot be released under different licenses. Would it be OK to remove this notice? We have other modules contributed by ESA developers, so I believe the 3-clause BSD license should be acceptable from that standpoint.
Let me know what you think. If I get your OK I will, implement the necessary changes and then ping you so you can review my work.
Thanks for your patience.
Hello @ceb8!
First of all, sorry for the delay in replying, unfortunately we had to resolve some legal issues due to the change of copyright requested.
From now on, you can change the license and insert a 3-clause BSD style license anywhere in the code.
Also, I would like to point out that we carried out (within this fork) another push containing the implementation of the new API provided by NEOCC web portal (https://neo.ssa.esa.int/PSDB-portlet/download?file=past_impactors_list), this time using directly astropy.table (see parse_impacted function).
Thanks again for your help. We are looking forward to see this library integrated into astroquery!
Hi @D-arioSpace!
Thanks for your response, and no worries about the delay. That's very good news about the copyright issue.
I am a bit confused about the new implementation you mention however, since I still see pandas imported all over. Do you mean this to be an example implementation for me? If so, thank you very much and I will get started on the work shortly. Just want to make sure not to step on your toes.
Hello @ceb8
you're right, it is just an example, which we hope to have implemented well (we did this to become familiar with the astropy.table data structure, taking advantage of the fact that this specific function had to be implemented almost from scratch). Feel free to improve it. And yes, pandas is still in all other functions, which need to be converted. Thanks again for your help. :)
Dario
@D-arioSpace Excellent, I am in this middle of one coding project, so I will start this one as soon as that one wraps up.
@D-arioSpace I'm pushing the de-pandafied lists.py functions so that you can go ahead and take a look at them while I am working on the tables.
Hi @ceb8! I had a look to the code and it looks fine. I also tested the main functionality within the list.py module and all of them are working! I would just remove the extra white spaces between the asteroid's number and its name (1) for all lists. I also saw that in the two 'priority' lists the time is in isot format, while in all the others it is iso (2). Both comments are not so relevant, but in case you think it's useful to implement them, I'll leave them to you.
I leave below the steps to reproduce the behavior I mentioned: (1)
ast = neocc.query_list(list_name='nea_list')
ast[1]
<Row index=1>
NEA
str27
----------------
719 Albert
(2)
out1 = neocc.query_list(list_name='priority_list')
out1[0][7]
<Time object: scale='utc' format='isot' value=2022-11-30T00:00:00.000>
out2 = neocc.query_list(list_name='risk_list')
out2[0][3]
<Time object: scale='utc' format='iso' value=2056-12-12 21:39:00.000>
Hi @ceb8 just to inform you I added a very small fix to tabs.py and to the 433.clolin file as the latter changed and the former has been adapted to that change
@D-arioSpace I’ve fully removed pandas from this module. In doing so I also tried to regularise the output formats, so now instead of query-specific objects the outputs are all either a single astropy table or a list of them.
I haven’t updated the documentation or the tests, so this is still very much a work in progress. But I wanted to make sure we are all on the same page in terms of the functionality before working on that.
I have a couple of big picture questions that @bsipocz and @keflavich may want to also weigh in on:
-
query_objectsometimes returns a single Astropy table, and sometimes a list of them, would it be better to separate out the list-returning queries into a separate function? (or alternately always return a list, but sometimes the list is only of length 1) - Two table types for
query_object(orbit properties and ephemeradies) require additional arguments, might it make sense to split these queries out into seperate functions? - This is mainly for @bsipocz and @keflavich, this module does not use the
async_to_syncparadigm. It would not be too hard to update it do that, but it seems like we are moving away from that structure, what do you think about changing it vs leaving as is? - Generally how do you feel about the functionality, and output formats? I put all the non-tabular information in the meta property of the table(s):
impacts_table = neocc.query_object(name='1979XB', tab='impacts')
pd_tble.meta
- A lot of the datetimes produce
dubious yearwarnings. I looked up why, and it seems basically unavoidable with dates too far in the past/future using UTC, what are your opinions on how to handle this?
Hi @ceb8 sorry for the big delay in answering and many thanks for the update of this interface!
Regarding your big picture questions, I personally don't have any issues with the current behavior of query_object returning either a single table or a list of tables, but it would be good to get feedback from @bsipocz and @keflavich as well, since they are the experts in this area.
Regarding the two table types that require additional arguments, I'm okay with leaving them as is for now, but if anyone else has strong opinions on this, we can revisit it later.
In terms of functionality and output formats, I don't have any major issues with them. However, I did notice an issue with error handling when calling query_object with an invalid tab name, for example:
temp = neocc.query_object(name='1979XB', tab='s')
it returns me a non-clean message like:
KeyError: 'impactsPlease introduce a valid tab name. valid tabs names are: , close_approachesPlease introduce a valid tab name. valid tabs names are: , observationsPlease introduce a valid tab name. valid tabs names are: , physical_propertiesPlease introduce a valid tab name. valid tabs names are: , orbit_propertiesPlease introduce a valid tab name. valid tabs names are: , ephemeridesPlease introduce a valid tab name. valid tabs names are: , summary'.
The error message that is returned is not very clean and includes multiple repeated messages.
Additionally, I encountered an error when trying to retrieve observations for object "433" (i.e. neocc.query_object(name='433', tab='observations') ). It looks like there may be a bug in the code related to radar observations and the "AsteroidObservation" class, in the “_get_rad_into” method. It would be helpful to investigate this further and see if we can resolve the issue (please let me know if there's anything I can do to help. I anticipate having more time available in the near future to dedicate to this project, so please don't hesitate to reach out to me if there's anything I can assist with.).
Overall, thanks for the updates! Let me know if there's anything I can do to help.
Hi @ceb8 I was wondering if the work is progressing smoothly or if I could assist you in clarifying any doubts or concerns you might have.
@D-arioSpace Thanks for checking in, I'm just getting back to this after a busy period. Seems pretty straightforward. I'll leave things as they are for now, fix the bugs, and move forward updating the testing to match the new functionality.
Thank you very much @ceb8, I confirm my availability to clarify any doubts you may have about files and functions. I hope to hear from you soon then! Thanks again.
Hi @ceb8 sorry for bothering you again. I'm working with NEOCC data for some projects and I was wondering if you believe it would be possible to have the code integrated into astroquery by September. This integration would greatly benefit these other projects, as they would be able to utilise the astroquery package directly instead of relying on the "standalone" version for NEOCC data.
Thank you again for your time and effort!
@D-arioSpace That seems like a reasonable timeframe.
@bsipocz @keflavich Could I get your opinions on the big picture questions I mentioned above?
@D-arioSpace Looking into the neocc.query_object(name='433', tab='observations') error, and I can't reproduce it. I get a table that looks superficially fine to me as output, can you describe what you're seeing?
query_object sometimes returns a single Astropy table, and sometimes a list of them, would it be better to separate out the list-returning queries into a separate function? (or alternately always return a list, but sometimes the list is only of length 1)
Yes, that's a good idea if there are two distinct routes that both make sense. For some services, like Vizier, queries are supposed to return lists of catalogs, so it just makes sense that that's the return.
Two table types for query_object (orbit properties and ephemeradies) require additional arguments, might it make sense to split these queries out into seperate functions?
No strong opinion here.
This is mainly for @bsipocz and @keflavich, this module does not use the async_to_sync paradigm. It would not be too hard to update it do that, but it seems like we are moving away from that structure, what do you think about changing it vs leaving as is?
I'm fine with the approach that is less work now. async_to_sync has proved useful in some debug cases, but I think the asynchronous queries have not seen widespread adoption and do not have particularly broad use cases. Since we don't really document how to use those, no one is using them.
Generally how do you feel about the functionality, and output formats? I put all the non-tabular information in the meta property of the table(s):
At a glance, that seems fine to me, but I'm not a NEOCC user - it would be better to find people using the data and get their opinions. From a structural perspective, this seems fine.
A lot of the datetimes produce dubious year warnings. I looked up why, and it seems basically unavoidable with dates too far in the past/future using UTC, what are your opinions on how to handle this?
Maybe we should silence those warnings in this context, since this is a context where far past/future dates are expected.
Thanks @keflavich! To your first point, I can't tell which paradigm you are advocating. Always returning a list?
I'm ambivalent, actually - always returning a list is right if we're sticking with one function, but I'm also OK with splitting into two functions if there is an understandable difference between two approaches. I would avoid having one function return lists sometimes and tables other times.
Hello @ceb8 , So, I have performed other tests, and most of them are okay. However, I noticed some errors in some calls (but not in the one I mentioned in the previous comment) specifically the call:
zzz = neocc.query_object(name='2023DW', tab='close_approaches')
returns me an error. I investigated it a bit, and it is due to an extra line NEOCC added at the beginning of the close approach file of each asteroid. To solve it, you can tell the program to skip the first line in the pd.read_csv, in class CloseApproaches in the static method clo_appr_parser(data_obj):
df_close_appr = pd.read_csv(df_impacts_d,
delim_whitespace=True, **skiprows=1**)
Additionally, we have a similar issue with the call:
zzz = neocc.query_object(name='433 Eros', tab='orbit_properties', orbital_elements="keplerian", orbit_epoch="middle");
which returns me the following error:
zzz = neocc.query_object(name='433 Eros', tab='orbit_properties', orbital_elements="keplerian", orbit_epoch="middle");
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/dario/dev/astroqyery_intergation/astroquery/astroquery/esa/neocc/core.py", line 477, in query_object
neocc_obj = tabs.parse_orbital_properties(resp_str)
File "/home/dario/dev/astroqyery_intergation/astroquery/astroquery/esa/neocc/tabs.py", line 655, in parse_orbital_properties
section = re.search("(\w+ parameters)", body_lst[3]).groups()[0]
AttributeError: 'NoneType' object has no attribute 'groups'
This time I'm not sure where the error come from, I'm still investigating it.
Furthermore, we have recently incorporated an additional column into the close approach lists (both the recent and upcoming lists). This new column, a float value, indicates the frequency of a specific event based on the miss distance from Earth, the size of the asteroid, and its relative velocity. It is important to note that both calls still function properly, so if you believe it is worthwhile to include this information, feel free to do so.
@ceb8 - There is a lot of commits here, I would even guess duplicates, certainly duplicated commit messages, so I wonder whether you could cleanup the PR, rebase and squash things out a bit?
Also, I wonder whether the total diff of ~150k is necessary, I assume it's due to bundled data files, but should they be this many/big?
Codecov Report
Attention: 215 lines in your changes are missing coverage. Please review.
Comparison is base (
ab7fdaa) 66.77% compared to head (30e854f) 67.77%.
Additional details and impacted files
@@ Coverage Diff @@
## main #2254 +/- ##
==========================================
+ Coverage 66.77% 67.77% +0.99%
==========================================
Files 237 246 +9
Lines 18303 19537 +1234
==========================================
+ Hits 12222 13241 +1019
- Misses 6081 6296 +215
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hi @ceb8 do you know if the date I suggested weeks ago:
[...] I was wondering if you believe it would be possible to have the code integrated into astroquery by September. This integration would greatly benefit these other projects, as they would be able to utilise the astroquery package directly [...].
is still feasible? If not, please just inform me. Thank you again for your work!
@D-arioSpace At this point, everything substantive is finished, it's just a matter of getting all the tests passing and then the final sign off from @bsipocz/@keflavich. The main test issue right now is the windows failures, if you have a windows machine to test this on please do feel free to work on it, I don't have one regularly, which is the hold up there.
@keflavich @bsipocz Thoughts on the docs failure? It seems to be failing on alfalfa rather than neocc, so maybe a rebase is in order?
I don't see why you say it's an alfalfa issue, to me it looks more to be a generic rtd problem and therefore I restarted the build. If it keeps failing I would suggest trying to sort out the problem with local docs builds, it's just easier to debug than waiting on RTD in CI.
The windows failure is due to windows having int32 as default rather than int64. So either the code needs to be explicit to use int64 for int columns (if you have long int IDs it may be necessary, as I recall e.g. gaia or SDSS IDs doesn't fit into the 32bit), or the test should be more generic and check for int rather than int64.
Thank you @bsipocz @ceb8 for your effort! I briefly tried all the features of the library, and they all seem well implemented. In these days, I will invite some of my colleagues to play with this fork and see if we can identify any errors before integration. Regarding the warnings astropy Time, I tend for keeping them because in my opinion it's better to make the user aware of the inherited warnings from the library rather than silencing them. What are the next steps now?
@D-arioSpace I've got all the tests passing and addressed the ones of @bsipocz's comments that I can. All that needs to happen now is that you address the outstanding comments, and anything else that comes up when @bsipocz does a more careful check.