Doctor improvements for various audio / video file extensions
After the recent update to use Magika for MIME type and extension identification in Doctor, we’ve noticed that some audio files — particularly WMA (Windows Media Audio) — are being classified as application/octet-stream or otherwise unrecognized. This could lead to incorrect or missing extensions when files are ingested or exported.
As we may be moving forward with audio soon we should test the range of possible file types and improve doctor.
@quevon24 so far there are only 28 rows that were not recognized as mp3 or wma. You could test Magika with those, see the last query below
courtlistener=> SELECT count(*) from audio_audio where local_path_original_file like 'mp3%';
count
-------
84945
(1 row)
courtlistener=> SELECT count(*) from audio_audio where local_path_original_file like 'wma%';
count
-------
13190
(1 row)
courtlistener=>
SELECT date_created, id, local_path_original_file
from audio_audio
where local_path_original_file not like 'mp3%' and local_path_original_file not like 'wma%'
order by 1 desc;
date_created | id | local_path_original_file
-------------------------------+-------+---------------------------------------------------------------------------------------
2023-02-06 19:30:38.940528+00 | 84992 | nsf/pages/unavailable2/2023/02/06/john_doe_v._judith_rodgers.nsfpagesunavailable2
2022-09-07 17:31:06.390409+00 | 82373 | nsf/pages/unavailable1/2022/09/07/thomas_hammett_v._janet_yellen.nsfpagesunavailable1
2021-06-11 00:31:32.899771+00 | 77029 | jpg/2021/06/10/mohammed_abdelsalam_v._merrick_garland.jpg
2020-01-14 19:31:14.571191+00 | 67935 | nsf/2020/01/13/aaron_ball_v._george_washington_university_1.nsf
2019-09-27 17:31:07.819259+00 | 65606 | nsf/2019/09/27/flat_wireless_llc_v._fcc.nsf
2019-03-08 16:31:32.156948+00 | 62026 | m4a/2019/03/07/feng_v._univ_of_delaware.m4a
2019-02-13 22:31:03.72909+00 | 61662 | m4a/2019/02/13/obasi_investment_v._tibet_pharm_inc.m4a
2019-01-28 22:30:51.989352+00 | 61201 | m4a/2019/01/24/united_states_v._payano_incomplete_recording_1.m4a
2019-01-25 16:32:40.049046+00 | 61190 | m4a/2019/01/24/furgess_v._pa_dept_corrections.m4a
2019-01-25 16:31:41.236007+00 | 61189 | m4a/2019/01/24/hess_v._comm_social_security_1.m4a
2019-01-24 16:31:57.753293+00 | 61166 | m4a/2019/01/23/deutsche_bank_v._bendex_properties.m4a
2019-01-16 16:33:25.290489+00 | 60996 | m4a/2019/01/15/charte_v._america_tutor_inc.m4a
2019-01-16 16:31:50.802025+00 | 60995 | m4a/2019/01/15/guariglia_v._united_food_commercial.m4a
2018-11-07 22:32:20.922348+00 | 59646 | m4a/2018/11/07/sec_v._gentile_1.m4a
2018-04-24 21:31:40.029489+00 | 36125 | m4a/2018/04/24/united_states_v._schonewolf.m4a
2018-04-24 21:30:56.598989+00 | 36124 | m4a/2018/04/24/united_states_v._wegeler.m4a
2018-04-23 21:31:20.699798+00 | 36112 | m4a/2018/04/23/in_re_energy_future_holdings.m4a
2018-01-05 21:33:05.887324+00 | 33904 | m4a/2017/12/12/united_states_v._city_of_pittsburgh.m4a
2017-09-25 20:33:22.086373+00 | 31945 | rm/2017/09/22/ameren_services_company_v._federal_energy_regulatory_commission.rm
2016-10-06 17:31:56.691762+00 | 26058 | nsf/pages/unavailable2/2016/10/06/sealed_case.nsfpagesunavailable2
2016-04-15 19:34:01.172686+00 | 15872 | rm/2016/04/15/flamingo_las_vegas_operating_v._nlrb.rm
2016-04-15 19:32:11.798292+00 | 15870 | rm/2016/04/15/state_of_west_virginia_v._hhs.rm
2016-04-15 19:31:46.77508+00 | 15869 | rm/2016/04/15/takeda_pharmaceuticals_u.s.a._v._sylvia_burwell.rm
2015-10-07 20:38:10.294664+00 | 13713 | nsf/Pages/Unavailable1/2015/10/02/frank_morello_v._dc.nsfPagesUnavailable1
2015-06-04 16:38:51.017465+00 | 12732 | MP3/2015/06/04/shukh_v._seagate_technology_llc_combined_w2014-1406.MP3
2015-05-07 03:12:13.363916+00 | 12372 | nsf/Pages/Unavailable1/2015/05/05/samuel_st._john_v._jeh_johnson.nsfPagesUnavailable1
2014-11-18 20:20:16.107161+00 | 9046 | MP3/2014/10/06/sarah_mcivor_v._credit_control_services_inc..MP3
2014-11-15 18:18:36.909428+00 | 4413 | MP3/2014/05/13/occidental_fire__casualty_co._v._adam_soczynski.MP3
(28 rows)
Thanks @grossir.
I already checked those rows and they appear to be correct.
Extensions: rm - Realmedia audio file MP3 - mp3 in caps m4a - MPEG-4 audio file
The only problem is with nsf files, but this seems to happen because there is no recording available, e.g. https://media.cadc.uscourts.gov/recordings/bydate/2019/9
https://www.courtlistener.com/api/rest/v4/audio/?id=65606
That audio in CL was created on 2019-09-27, but it wasn't until 2024 that the scraper was updated to verify that the URL contained the .mp3 extension.
Regarding to the audio with jpg extension (https://www.courtlistener.com/api/rest/v4/audio/?id=77029), I was unable to replicate it (the backscraper does not return that result), but this appears to be the correct url to the file: https://www.ca9.uscourts.gov/datastore/media/2021/06/10/19-72018.mp3
@flooie Do you have any examples showing the problem of the file not being recognized or classified as application/octet-stream?