rails Fix hanging test due to interlocking(?)

Alright, I'm not going to pretend that I understand exactly what's happening, but the fact is that this seems to fix something, so I'll try to explain what I understand.

The problem

Running the following test hangs forever:

$ bin/test test/engine/commands_test.rb:62
Run options: -v --seed 4349

# Running:

Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 💥 Hangs here

Somehow I seem to have reproduced it with the command that would run on CI:

$ TEST_DIR=engine TESTOPTS="-v -n test_server_command_broadcast_logs"  bundle exec rake test

--- test/engine/commands_test.rb
/Users/sto/.rubies/ruby-3.3.5/bin/ruby -w -Itest -Ilib -I../activesupport/lib -I../actionpack/lib -I../actionview/lib -I../activemodel/lib test/engine/commands_test.rb
Run options: -v -n test_server_command_broadcast_logs --seed 59220

# Running:

Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 💥 Hangs here

I'm gonna venture a guess and assume that maybe this test (and 3 others) do not run in CI because of this condition:

https://github.com/rails/rails/blob/aacbb5c0f5bdd11f0dee78da03bd6859f0cabeba/railties/test/engine/commands_test.rb#L36

https://github.com/rails/rails/blob/aacbb5c0f5bdd11f0dee78da03bd6859f0cabeba/railties/test/console_helpers.rb#L23-L25

Is there a way we can check if these tests ever ran on CI?

What seems to be happening

I do not understand much of how PTY works, but what I have identified is that:

The test hangs at net.get("/"). Using a bit of debugging, I confirmed that I was not able to curl http://localhost:3000 after spawning rails server.
It appears the process is entering a lock where:
- it will not continue processing and complete the request until rails server's output has been "consumed" (read, readlines) out of the IO object returned by PTY.new.
- we won't be reading more output out of primary since net.get("/") is blocked

The workaround

I have no idea whether this is an acceptable fix, but parallelizing, and allowing our test thread to try and read (blocking) the PTY's output, while a different thread makes the HTTP request (blocking too), seems to be fixing the problem.

You can try running any of the two commands I shared above and confirm the test will succeed now.

Questions

Is there a way we can check if these tests ever ran on CI?
Is my fix an acceptable approach?
Has this test ever worked in the past? (Locally/On CI)
Has anything changed (maybe in Puma), making Puma's output blocking, and preventing it to continue processing a request until all its output has been consumed?

Motivation / Background

This Pull Request has been created because [REPLACE ME]

Checklist

Before submitting the PR make sure the following are checked:

[x] This Pull Request is related to one change. Unrelated changes should be opened in separate PRs.
[x] Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
[x] Tests are added or updated if you fix a bug or add a feature.
[x] CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

Oct 07 '24 08:10 davidstosik

I was under the impression that buildkite/ci was using a tty and therefor these tests were running. @yahonda can probably verify that.

Oct 19 '24 03:10 zzak

I was under the impression that buildkite/ci was using a tty and therefor these tests were running. @yahonda can probably verify that.

No you're right. I made a guess but I was wrong. For example on the CI build for this branch, I can see the test/engine/commands_test.rb test file has 6 runs, which matches.(here)

Now the question is, why do those tests do not hang on CI? @zzak I'm sure you've you tried running them locally before, do they still run on your machine?

Oct 19 '24 13:10 davidstosik

I'm unable to reproduce it locally:

bin/test -v test/engine/commands_test.rb:62
Run options: -v --seed 6467

# Running:

Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 2.14 s = .

Finished in 2.145408s, 0.4661 runs/s, 2.7967 assertions/s.
1 runs, 6 assertions, 0 failures, 0 errors, 0 skips

I wonder if it has something to do with the spawn_command in that file being a unique case, compared to the other tests. That is just a hunch though. https://github.com/rails/rails/blob/b8720e7cb8c9894d142b8576c21065c92cd1907a/railties/test/isolation/abstract_unit.rb#L345

Oct 19 '24 23:10 zzak

@zzak I just rebooted my laptop to be sure. Haven't got any Rails app running. The test still hangs.

(macOS Sequoia, Ruby 3.3.5)

While the test is hanging, I can see ruby listening to the port 3000, which disappears as soon as I kill the hanging test:

$  lsof -i -P | grep LISTEN | grep ruby
ruby      4656  sto    7u  IPv4 0x3ba087057954dfb8      0t0  TCP localhost:3000 (LISTEN)
ruby      4656  sto    8u  IPv6 0xc7ccd4d0b84fd561      0t0  TCP localhost:3000 (LISTEN)
ruby      4656  sto    9u  IPv4 0x4593af10ba1eeabc      0t0  TCP localhost:3000 (LISTEN)
ruby      4656  sto   10u  IPv4 0x405f2a704fdff54e      0t0  TCP connectivity-check.warp-svc:3000 (LISTEN)
ruby      4656  sto   11u  IPv4 0xe41de747218cb552      0t0  TCP connectivity-check.warp-svc:3000 (LISTEN)

Update: tried on a different Mac. Same. Will try on a Linux machine soon.

Update again: took a while to set up and run, but yeah, this test passes on Ubuntu 24.04 with Ruby 3.3.5. Takes something like 13 seconds though.

Oct 21 '24 07:10 davidstosik

I'm on linux, but I remember there being some differences in the native threads they used. Worst case, we can try to spike on a macos runner on GH actions to get it to repro.

Oct 21 '24 11:10 zzak

Yep, seems to be a macOS issue, since I managed to get it to work on Linux too.

Oct 21 '24 12:10 davidstosik

@davidstosik - if you add a signal handler that dumps the thread backtraces, it might help track down where the test is hanging.

Something like:

Signal.trap "SIGPROF" do
  Thread.list.each do |thread| 
    puts thread.name; 
    puts thread.backtrace; 
    puts
  end
end

Then send a PROF signal to the test process when it is hanging.

Oct 22 '24 10:10 djmb

Thanks @djmb! I added this to the beginning of the hanging test:

      Signal.trap "SIGPROF" do
        Thread.list.each do |thread| 
          puts "🌊" * 10
          puts thread.name
          puts thread.backtrace
          puts
        end
      end

Ran the test until it hands, then executed kill -s SIGPROF {the hanged process pid}. Results are:


Rails::Engine::CommandsTest#test_server_command_broadcast_logs = 🌊🌊🌊🌊🌊🌊🌊🌊🌊🌊

/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:67:in `backtrace'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:67:in `block (2 levels) in test_server_command_broadcast_logs'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:64:in `each'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:64:in `block in test_server_command_broadcast_logs'
/Users/sto/.gem/ruby/3.3.1/gems/net-protocol-0.2.2/lib/net/protocol.rb:229:in `wait_readable'
/Users/sto/.gem/ruby/3.3.1/gems/net-protocol-0.2.2/lib/net/protocol.rb:229:in `rbuf_fill'
/Users/sto/.gem/ruby/3.3.1/gems/net-protocol-0.2.2/lib/net/protocol.rb:199:in `readuntil'
/Users/sto/.gem/ruby/3.3.1/gems/net-protocol-0.2.2/lib/net/protocol.rb:209:in `readline'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http/response.rb:158:in `read_status_line'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http/response.rb:147:in `read_new'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2342:in `block in transport_request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2333:in `catch'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2333:in `transport_request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2306:in `request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2299:in `block in request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:1570:in `start'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2297:in `request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:1917:in `get'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:77:in `block in test_server_command_broadcast_logs'
<internal:kernel>:90:in `tap'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:76:in `test_server_command_broadcast_logs'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/test.rb:94:in `block (2 levels) in run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/test.rb:190:in `capture_exceptions'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/test.rb:89:in `block in run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:367:in `time_it'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/test.rb:88:in `run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:1207:in `run_one_method'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:446:in `run_one_method'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:433:in `block (2 levels) in run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:429:in `each'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:429:in `block in run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:471:in `on_signal'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:458:in `with_info_handler'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:428:in `run'
/Users/sto/src/github.com/rails/rails/railties/lib/rails/test_unit/line_filtering.rb:10:in `run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:331:in `block in __run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:331:in `map'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:331:in `__run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:287:in `run'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest.rb:85:in `block in autorun'

🌊🌊🌊🌊🌊🌊🌊🌊🌊🌊

<internal:thread_sync>:18:in `pop'
/Users/sto/.gem/ruby/3.3.1/gems/minitest-5.25.1/lib/minitest/parallel.rb:30:in `block (2 levels) in start'

# (abridging: the backtrace above appears 11 times) 

🌊🌊🌊🌊🌊🌊🌊🌊🌊🌊
Timeout stdlib thread
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:110:in `sleep'
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:110:in `wait'
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:110:in `block (2 levels) in create_timeout_thread'
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:108:in `synchronize'
/Users/sto/.gem/ruby/3.3.1/gems/timeout-0.4.1/lib/timeout.rb:108:in `block in create_timeout_thread'

(Most threads don't seem to have a name., but at least I have traces.)

Looking at the first thread's backtrace:

...
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http/response.rb:158:in `read_status_line'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http/response.rb:147:in `read_new'
...
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:1570:in `start'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:2297:in `request'
/opt/rubies/3.3.1/lib/ruby/3.3.0/net/http.rb:1917:in `get'
/Users/sto/src/github.com/rails/rails/railties/test/engine/commands_test.rb:77:in `block in test_server_command_broadcast_logs'
...

It does look like the test is stuck on net.get("/").

Oct 29 '24 01:10 davidstosik

I wonder if just adding a sleep after spawning the process also solves it for you, like the process may have output "Listening on..." but it's not accepting connections yet causing the request to hang.

Oct 29 '24 02:10 zzak

@zzak I tried adding sleep 5 after assert_output("Listening on", primary), before Net::HTTP.new, but the test is still hanging.

I'm still not pretending I understand whatever's happening with PTY, but for the record, the change below would work as well.

I'm not suggesting we should use it, but it might help understand where the problem is?

diff --git i/railties/test/console_helpers.rb w/railties/test/console_helpers.rb
index 10d3a54602..b871051aad 100644
--- i/railties/test/console_helpers.rb
+++ w/railties/test/console_helpers.rb
@@ -12,7 +12,7 @@ def assert_output(expected, io, timeout = 10)
     output = +""
     until output.include?(expected) || Time.now > timeout
       if IO.select([io], [], [], 0.1)
-        output << io.read(1)
+        output << (io.read(1) || "")
       end
     end
 
diff --git i/railties/test/engine/commands_test.rb w/railties/test/engine/commands_test.rb
index 3e888ce0d8..1289759de3 100644
--- i/railties/test/engine/commands_test.rb
+++ w/railties/test/engine/commands_test.rb
@@ -60,9 +60,23 @@ def test_server_command_work_inside_engine
     end
 
     def test_server_command_broadcast_logs
       primary, replica = PTY.open
-      pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
-      assert_output("Listening on", primary)
+      out_io = File.open("test_out", "w")
+      pid = spawn_command(
+        "server", replica, env: { "RAILS_ENV" => "development" },
+        out_io: out_io
+      )
+
+      assert_output("Listening on", File.open("test_out", "r"))
 
       Net::HTTP.new("127.0.0.1", 3000).tap do |net|
         net.get("/")
@@ -73,7 +87,7 @@ def test_server_command_broadcast_logs
         assert_match("Processing by Rails::WelcomeController", logs)
       end
 
-      assert_output("Processing by Rails::WelcomeController", primary)
+      assert_output("Processing by Rails::WelcomeController", File.open("test_out", "r"))
     ensure
       kill(pid)
     end
@@ -84,9 +98,10 @@ def plugin_path
       "#{@destination_root}/bukkits"
     end
 
-    def spawn_command(command, fd, env: {})
+    def spawn_command(command, fd, env: {}, out_io: nil)
+      out_io ||= fd
       in_plugin_context(plugin_path) do
-        Process.spawn(env, "bin/rails #{command}", in: fd, out: fd, err: fd)
+        Process.spawn(env, "bin/rails #{command}", in: fd, out: out_io, err: fd)
       end
     end

As you can see, if I spawn the process to write into a file instead of the replica file descriptor returned by PTY.open, and if I read to that file instead of reading from the primary file descriptor, then the test works. (Whereas, I checked, it still does not work for me on macOS.)

Oct 30 '24 13:10 davidstosik

I wrote a lot more than what's below, but now I've identified the issue more clearly, what's next should suffice.

If you want to see the process that got me here, you can expand this.

Here's another experiment: on main, apply this diff:

diff --git i/railties/test/engine/commands_test.rb w/railties/test/engine/commands_test.rb
index 3e888ce0d8..a740054139 100644
--- i/railties/test/engine/commands_test.rb
+++ w/railties/test/engine/commands_test.rb
@@ -60,9 +60,18 @@ def test_server_command_work_inside_engine
     end
 
     def test_server_command_broadcast_logs
+      in_plugin_context(plugin_path) do
+        puts
+        puts `pwd`.strip
+        puts "Press Enter to continue"
+        $stdin.gets
+      end
+
       primary, replica = PTY.open
       pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
+
       assert_output("Listening on", primary)
+      puts "Press Enter to continue"
 
       Net::HTTP.new("127.0.0.1", 3000).tap do |net|
         net.get("/")

It outputs the path to the dummy app used in the test, then pauses the test in two places:

Before starting the Rails server.
Before issuing the HTTP request to /.

This is how it goes:

Run the test.
At the first pause, I can open a second terminal and run tail -F {dummy app path}/test/dummy/log/development.log to see what the Rails app outputs in real time.
The file starts empty.
Then I hit Enter in the test process, to proceed to the Rails application start, and stop before making the HTTP request. The log file stays empty.

Then I press Enter again, and the tail -F process watching the log file now outputs this:

Started GET "/" for 127.0.0.1 at 2024-10-30 22:55:38 +0900
   (0.8ms)  CREATE TABLE "schema_migrations" ("version" varchar NOT NULL PRIMARY KEY) /*application='Dummy'*/
   (0.1ms)  CREATE TABLE "ar_internal_metadata" ("key" varchar NOT NULL PRIMARY KEY, "value" varchar, "created_at" datetime(6) NOT NULL, "updated_at" datetime(6) NOT NULL) /*application='Dummy'*/
  ActiveRecord::SchemaMigration Load (0.0ms)  SELECT "schema_migrations"."version" FROM "schema_migrations" ORDER BY "schema_migrations"."version" ASC /*application='Dummy'*/
Processing by Rails::WelcomeController#index as */*
  Rendering /Users/sto/src/github.com/rails/rails/railties/lib/rails/templates/rails/welcome/index.html.erb
  Rendered /Users/sto/src/github.com/rails/rails/railties/lib/rails/templates/rails/welcome/index.html.erb (Duration: 3.6ms | GC: 3.2ms)

Then the test process hangs.
Finally, I hit Ctrl-C on the test process.
Once does nothing, so I do a second time.
The test process is killed.

At the same time, this line gets added at the bottom of the log file:

Completed 200 OK in 71931ms (Views: 71890.0ms | ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 3.2ms)

I don't know what it is, but it appears that the Rails server receives the request, renders its response, but is unable to complete?

If, instead of resuming the test execution after the second pause, I browse to http://localhost:3000, I get the same behaviour: Rails renders the page, but does not complete. On the browser side, spinner never stops. As soon as I kill the test, my browser shows the expected Rails Welcome page.

[!IMPORTANT]

TL;DR: macOS's pseudo terminal internal buffer size (that's a mouthful) is only 1kB, and that can't even accommodate the output generated by rails server booting and serving a single request. In other common OSes, for example Debian-based Linux, it'll be way more than enough, like 19kB, 131kB... Since we can't guarantee that buffer size will not decrease in the future, or that Rails server's output will not outgrow that size, I think the fix I originally suggested (use a thread so we can consume the output as it is produced) is justified.

Here's one last experiment. Just add primary.read(196) before sending the HTTP request:

diff --git c/railties/test/engine/commands_test.rb i/railties/test/engine/commands_test.rb
index 3e888ce0d8..1781650704 100644
--- c/railties/test/engine/commands_test.rb
+++ i/railties/test/engine/commands_test.rb
@@ -64,6 +64,8 @@ def test_server_command_broadcast_logs
       pid = spawn_command("server", replica, env: { "RAILS_ENV" => "development" })
       assert_output("Listening on", primary)
 
+      primary.read(196)
+
       Net::HTTP.new("127.0.0.1", 3000).tap do |net|
         net.get("/")
       end

This is another one of those things that are not a proper fix, but help understanding what's happening.

I empties primary buffer from all the stuff Rails outputs at startup, and apparently, that's all Rails needs to be able to continue serving the request and completing the response, making the test pass successfully.

Understanding a bit more about the issue, I wrote this simple Ruby script:

require "pty"

(500..).each do |size|
  size = size*10
  puts "Spawning a process that outputs #{size} characters."
  primary, replica = PTY.open
  pid = Process.spawn("echo -n #{'a' * size}", in: replica, out: replica, err: replica)
  Process.wait(pid)
  sleep 0.1
end

Running this script will keep spawning an echo aaaa command with more as on a new PTY, until it hangs. It hangs at 1030 characters. I dialed it down and identified that spawning a process that echoes 1024 characters works, but spawning a process that echoes 1025 characters will hang!

So PTY has a 1kB buffer and hangs when that buffer is full? Is this a macOS limitation? What would be that number on Linux? Can we change it?

I found this superuser post: Ptys (Pseudo terminals) internal buffer size, which as script that confirms my assumptions:

On macOS, the pseudo terminal internal buffer size is 1kB (confirmed by running ptsbufsize.py).
On Debian, it varies depending on the version, but seems to be way higher (19kB, 131kB...)
Overall, I guess we can't really trust it to be big enough?

Going back to our test, here's what the Rails server outputs:

When it starts:

=> Booting Puma
=> Rails 8.1.0.alpha application starting in development
=> Run `bin/rails server --help` for more startup options
Puma starting in single mode...
* Puma version: 6.4.3 (ruby 3.3.1-p55) ("The Eagle of Durango")
*  Min threads: 3
*  Max threads: 3
*  Environment: development
*          PID: 76984
* Listening on http://127.0.0.1:3000
* Listening on http://[::1]:3000
* Listening on http://127.255.255.0:3000
* Listening on http://127.0.2.2:3000
* Listening on http://127.0.2.3:3000
Use Ctrl-C to stop

That's 533 characters.

When it receives the request to /:

Started GET "/" for 127.0.0.1 at 2024-10-30 23:58:29 +0900
   (0.9ms)  CREATE TABLE "schema_migrations" ("version" varchar NOT NULL PRIMARY KEY) /*application='Dummy'*/
   (0.1ms)  CREATE TABLE "ar_internal_metadata" ("key" varchar NOT NULL PRIMARY KEY, "value" varchar, "created_at" datetime(6) NOT NULL, "updated_at" datetime(6) NOT NULL) /*application='Dummy'*/
  ActiveRecord::SchemaMigration Load (0.0ms)  SELECT "schema_migrations"."version" FROM "schema_migrations" ORDER BY "schema_migrations"."version" ASC /*application='Dummy'*/
Processing by Rails::WelcomeController#index as */*
  Rendering /Users/sto/src/github.com/rails/rails/railties/lib/rails/templates/rails/welcome/index.html.erb
  Rendered /Users/sto/src/github.com/rails/rails/railties/lib/rails/templates/rails/welcome/index.html.erb (Duration: 0.5ms | GC: 0.1ms)
Completed 200 OK in 67ms (Views: 2.8ms | ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 3.7ms)

That's 1023 characters (including color escaping codes). It's a freaky coincidence that that number would be just below the 1kB limit!

Calling assert_output "eats" all characters in the buffer until it can find the string it looked for. So assert_output("Listening to") will eat 327 characters, which explains why I can fix the test with primary.read(196)(533-327 = 196).

So what can we do?

Add primary.read(196) to consume as much output as possible before we launch the HTTP request. I've confirmed above that it works.
Use an assert_output that will "eat" more of the early output. Ideally, we can assert_output("Use Ctrl-C to stop\r" to consume all the Rails server's boot output.

In my situation however, the size of the Rails server output is so big (it depends on the paths where the partials are), that even not including \r from the assert_output above would make the output bigger than 1024 characters long, hanging the process.

I tried cloning the Rails repository in a shorter path, to see if my long paths were to blame:

gh repo clone rails/rails /tmp/rails
cd /tmp/rails/railties
bin/test -v test/engine/commands_test.rb:62

Unfortunately, the output is still too long without a fix, and the test hangs.

It looks like all the fix ideas I could come up with above are limited, and not guaranteed to continue to work in the future:

what if the pseudo terminal buffer size is decreased again in the future?
what if Rails server outputs more stuff to the screen when serving a request?

In conclusion, I think the fix I'm suggesting in this PR is reasonable: run the HTTP request in a separate thread so we can keep consuming the rails server process' output as it is produced.

Oct 30 '24 14:10 davidstosik

@davidstosik Thanks for your detailed investigation.

I was able to reproduce the issue using a macos runner on github actions, and indeed your fix to use a thread to make the http request works well: https://github.com/zzak/rails/actions/runs/11639390507

Here is my config for reference:

name: Debug railtie test on macos

on:
  push:
  workflow_dispatch:

jobs:
  test:
    runs-on: macos-latest
    steps:
      - name: checkout
        uses: actions/checkout@v3

      - name: setup-ruby
        uses: ruby/setup-ruby@v1
        with:
          ruby-version: 3.3
          bundler-cache: true
        env:
          BUNDLE_WITHOUT: "db:job"

      - name: run test
        env:
          BUNDLE_WITHOUT: "db:job"
        run: |
          cd railties
          bin/test test/engine/commands_test.rb:62

It indeed seems like this issue is due to pty, and there are few tests which rely on it, none of them are making a consecutive http request. I was thinking this might affect other tests, or that we should try to make it re-useable in case this comes up again. But after running the majority of the tests on that runner, and checking server_test, console_test, and dbconsole_test, this seems like a rare case and this patch is probably safe to merge.

Nov 03 '24 00:11 zzak

We can generalize if/when we find other hanging tests later. 👍🏻

Nov 03 '24 06:11 davidstosik

Great investigation on this!

I do think we should still go with a thread wrapping the get, even with the pipe: we're in less danger of platform variance etc, but I think your first instinct is still correct, in that it's "wrong" to perform a blocking operation and just hope that there's enough buffer space to handle whatever it produces before we start reading.

Nov 04 '24 07:11 matthewd

@matthewd Thank you for the review! I checked IO.pipe, and indeed, it also has a buffer size limit, although way bigger than PTY: it's 64kB, which feels a lot more comfortable than PTY's 1kB!

Happy to restore the earlier threaded approach if that still feels better.

Nov 04 '24 08:11 davidstosik

rails rails copied to clipboard

Fix hanging test due to interlocking(?)

The problem

What seems to be happening

The workaround

Questions

Motivation / Background

Checklist

rails
rails copied to clipboard