doctor icon indicating copy to clipboard operation
doctor copied to clipboard

Doctor improvements for various audio / video file extensions

Open flooie opened this issue 6 months ago • 2 comments

After the recent update to use Magika for MIME type and extension identification in Doctor, we’ve noticed that some audio files — particularly WMA (Windows Media Audio) — are being classified as application/octet-stream or otherwise unrecognized. This could lead to incorrect or missing extensions when files are ingested or exported.

As we may be moving forward with audio soon we should test the range of possible file types and improve doctor.

flooie avatar Oct 06 '25 15:10 flooie

@quevon24 so far there are only 28 rows that were not recognized as mp3 or wma. You could test Magika with those, see the last query below

courtlistener=> SELECT count(*) from audio_audio where local_path_original_file like 'mp3%';
 count 
-------
 84945
(1 row)

courtlistener=> SELECT count(*) from audio_audio where local_path_original_file like 'wma%';
 count 
-------
 13190
(1 row)

courtlistener=> 
SELECT date_created, id, local_path_original_file 
from audio_audio 
where local_path_original_file not like 'mp3%' and local_path_original_file not like 'wma%' 
order by 1 desc;

         date_created          |  id   |                               local_path_original_file                                
-------------------------------+-------+---------------------------------------------------------------------------------------
 2023-02-06 19:30:38.940528+00 | 84992 | nsf/pages/unavailable2/2023/02/06/john_doe_v._judith_rodgers.nsfpagesunavailable2
 2022-09-07 17:31:06.390409+00 | 82373 | nsf/pages/unavailable1/2022/09/07/thomas_hammett_v._janet_yellen.nsfpagesunavailable1
 2021-06-11 00:31:32.899771+00 | 77029 | jpg/2021/06/10/mohammed_abdelsalam_v._merrick_garland.jpg
 2020-01-14 19:31:14.571191+00 | 67935 | nsf/2020/01/13/aaron_ball_v._george_washington_university_1.nsf
 2019-09-27 17:31:07.819259+00 | 65606 | nsf/2019/09/27/flat_wireless_llc_v._fcc.nsf
 2019-03-08 16:31:32.156948+00 | 62026 | m4a/2019/03/07/feng_v._univ_of_delaware.m4a
 2019-02-13 22:31:03.72909+00  | 61662 | m4a/2019/02/13/obasi_investment_v._tibet_pharm_inc.m4a
 2019-01-28 22:30:51.989352+00 | 61201 | m4a/2019/01/24/united_states_v._payano_incomplete_recording_1.m4a
 2019-01-25 16:32:40.049046+00 | 61190 | m4a/2019/01/24/furgess_v._pa_dept_corrections.m4a
 2019-01-25 16:31:41.236007+00 | 61189 | m4a/2019/01/24/hess_v._comm_social_security_1.m4a
 2019-01-24 16:31:57.753293+00 | 61166 | m4a/2019/01/23/deutsche_bank_v._bendex_properties.m4a
 2019-01-16 16:33:25.290489+00 | 60996 | m4a/2019/01/15/charte_v._america_tutor_inc.m4a
 2019-01-16 16:31:50.802025+00 | 60995 | m4a/2019/01/15/guariglia_v._united_food_commercial.m4a
 2018-11-07 22:32:20.922348+00 | 59646 | m4a/2018/11/07/sec_v._gentile_1.m4a
 2018-04-24 21:31:40.029489+00 | 36125 | m4a/2018/04/24/united_states_v._schonewolf.m4a
 2018-04-24 21:30:56.598989+00 | 36124 | m4a/2018/04/24/united_states_v._wegeler.m4a
 2018-04-23 21:31:20.699798+00 | 36112 | m4a/2018/04/23/in_re_energy_future_holdings.m4a
 2018-01-05 21:33:05.887324+00 | 33904 | m4a/2017/12/12/united_states_v._city_of_pittsburgh.m4a
 2017-09-25 20:33:22.086373+00 | 31945 | rm/2017/09/22/ameren_services_company_v._federal_energy_regulatory_commission.rm
 2016-10-06 17:31:56.691762+00 | 26058 | nsf/pages/unavailable2/2016/10/06/sealed_case.nsfpagesunavailable2
 2016-04-15 19:34:01.172686+00 | 15872 | rm/2016/04/15/flamingo_las_vegas_operating_v._nlrb.rm
 2016-04-15 19:32:11.798292+00 | 15870 | rm/2016/04/15/state_of_west_virginia_v._hhs.rm
 2016-04-15 19:31:46.77508+00  | 15869 | rm/2016/04/15/takeda_pharmaceuticals_u.s.a._v._sylvia_burwell.rm
 2015-10-07 20:38:10.294664+00 | 13713 | nsf/Pages/Unavailable1/2015/10/02/frank_morello_v._dc.nsfPagesUnavailable1
 2015-06-04 16:38:51.017465+00 | 12732 | MP3/2015/06/04/shukh_v._seagate_technology_llc_combined_w2014-1406.MP3
 2015-05-07 03:12:13.363916+00 | 12372 | nsf/Pages/Unavailable1/2015/05/05/samuel_st._john_v._jeh_johnson.nsfPagesUnavailable1
 2014-11-18 20:20:16.107161+00 |  9046 | MP3/2014/10/06/sarah_mcivor_v._credit_control_services_inc..MP3
 2014-11-15 18:18:36.909428+00 |  4413 | MP3/2014/05/13/occidental_fire__casualty_co._v._adam_soczynski.MP3
(28 rows)

grossir avatar Nov 04 '25 19:11 grossir

Thanks @grossir.

I already checked those rows and they appear to be correct.

Extensions: rm - Realmedia audio file MP3 - mp3 in caps m4a - MPEG-4 audio file

The only problem is with nsf files, but this seems to happen because there is no recording available, e.g. https://media.cadc.uscourts.gov/recordings/bydate/2019/9

Image

https://www.courtlistener.com/api/rest/v4/audio/?id=65606

That audio in CL was created on 2019-09-27, but it wasn't until 2024 that the scraper was updated to verify that the URL contained the .mp3 extension.

Regarding to the audio with jpg extension (https://www.courtlistener.com/api/rest/v4/audio/?id=77029), I was unable to replicate it (the backscraper does not return that result), but this appears to be the correct url to the file: https://www.ca9.uscourts.gov/datastore/media/2021/06/10/19-72018.mp3

@flooie Do you have any examples showing the problem of the file not being recognized or classified as application/octet-stream?

quevon24 avatar Nov 04 '25 23:11 quevon24