Bm_printf / serial payload crashes

I am running a customized mote firmware that I assembled using pieces of bm_protocol user_code.cpp examples: hello_world, serial_payload_example and serial_payload_guide. The main difference is that I cannot do line-oriented input, mostly because my sensor sends a blank line immediately following every data line. Instead I am doing character-oriented input. In simplest terms I have swapped PLUART::lineAvailable() and PLUART::readLine() for PLUART::byteAvailable and PLUART::readByte().

That all seems to work beautifully. Except when it crashes.

During longer tests the buoy drops a measurement every fifth reading, like clockwork. By watching the bmdk terminal output, the spotter terminal output, and whatā€™s written to the SD card, I know the whole buoy is rebooting when this happens. My best guess is a segmentation fault in my mote firmware.

In these linesā€¦

// Print the payload data to a file, to the bm_printf console, and to the printf console.

bm_fprintf(0, ā€œsami_data.logā€, ā€œtick: %ā€ PRIu64 ā€œ, rtc: %s, line: %.*s\nā€, uptimeGetMs(), rtcTimeBuffer, payload_size, payload_buffer);

bm_printf(0, ā€œ[sami] | tick: %ā€ PRIu64 ā€œ, rtc: %s, line: %.*sā€, uptimeGetMs(), rtcTimeBuffer, payload_size, payload_buffer);

printf(ā€œ[sami] | tick: %ā€ PRIu64 ā€œ, rtc: %s, line: %.*s\nā€, uptimeGetMs(), rtcTimeBuffer, payload_size, payload_buffer);

ā€¦it is the bm_printf() line that crashes things. Which is driving me nuts. The bm_fprintf() line successfully writes its output to the SD card, and I believe the next line is trying to write a message to the bmdk terminal. The printf() line, if it got that far, would send its message to the spotter terminal.

My current theory is that something about the payload_buffer isnā€™t properly reset after each measurement. Simple resetting payload_size to 0 isnā€™t having the intended effect. So, when enough input has been received (in this context it happens predictably at the fifth measurement received from the payload), it crashes.

I can share the entire firmware source if necessary, itā€™s only 153 lines right now, but what is it about my payload_buffer or payload_size, which are defined asā€¦

// A buffer for our data from the payload uart
u_int8_t payload_buffer[2048];
static u_int32_t payload_size = 0;

ā€¦could be segfaulting in that second line when the first line is perfectly happy?

[Heads-up to @Taylor.gill who will be following this convo too.]

Mike J+

@zack_j follow up from our email.

Please let me know if I can provide anymore information.

1 Like

Hi @mike_j and @Taylor.gill! The whole buoy resetting would point to a crash on the Spotter, not on the mote. It then makes a bit more sense that the crash is happening at the bm_printf() since this is the function that sends the output to Spotterā€™s terminal. The printf() line is sending it to the bmdk terminal. I am curious what would happen if you comment out the bm_printf() line. Does the crash go away? Does the printf line output anything?

Do you know what characters are being sent when Spotter crashes? What do the logs/outputs look like right before the crash occurs? Is the fifth reading in the sami_data.log?

If you are using PLUART::byteAvailable and PLUART::readByte() how are you determining the end of a reading? Is there a special character that the sensor sends or is it a specific number of characters? Are there potentially variable lengths? Are you 100% sure the payload_size is reset to 0 each time? What happens if you do something like:

memset(payload_buffer, 0, sizeof(payload_buffer));

That way you know it is starting from a cleared state.

I hope some of these questions/thoughts are helpful! Good luck debugging and keep us posted! If you feel like it, you can share your code and I (or someone else) can look through it and see if there is anything else we can help with!

Victor

I just realized I left out what was probably the most useful part of the forensicsā€¦ the output to the spotter terminal when it crashes. See below.

But in answer to your questions, YES the string makes it into the SD card file okay.

Note that our payload lines are 400+ characters so (to be conservative) I have been breaking lines at 295 characters for cell-transmission-sized pieces. In this example the first half of the payload output makes it to the SD card, bmdk terminal, spotter terminal, and cell transmission okay. It is the second part of the line that crashes during spotter terminal output.

The sensor sends \r\r after each measurement, i.e. two CR or ASCII 13 characters. But my code tests against both \r and \n. If it finds them, it doesnā€™t add them to the buffer, and if thereā€™s anything in the buffer (i.e. not a blank line) it triggers a transmission. The other transmission/printout trigger is when the buffer reaches 295 chars.

Here are the last two lines in the SD card file before a crash:

2024-04-30T14:29:38.585Z | tick: 2581147, rtc: 2024-04-30T14:29:38.324, line: :1DBE70AE256B7F1189E2509207222C31FE42512207022B91FE1250F207622B71FE12519207B22BA1FE2250E205222AF1FE1250E176B22B11FBC25120B4D22B31F5A250E048022AF1EDB251803A022AD1EC52510061C22AF1F0B2510096822AF1F4B25140D2B22B01F762518114822B31FA02512149E22B11FB42516179722B41FD0251119A522B01FD425151B3122B21FD3251A
2024-04-30T14:29:38.683Z | tick: 2581170, rtc: 2024-04-30T14:29:38.347, line: 1C7622B11FE5251D1D5522B51FDB25191E0C22AE1FE625171E8C22AC1FDB25141EF822AC1FE5251A1F5822B01FED25181F9D22B11FE8251A1FC822AB1FE8251E1FFB22B61FE7251B1FFF22B21FCE3B013C4F1899D6

And here are the last lines printed to the Spotter terminal output when it crashes:

hum_temp | tick: 2572189, rtc: 2024-04-30T14:29:29.367, hum: 41.159645, temp: 25.212315^M
pressure | tick: 2574198, rtc: 2024-04-30T14:29:31.375, temp: 25.217985, pressure: 1016.935608^M
power | tick: 2574205, rtc: 2024-04-30T14:29:31.382, addr: 67, voltage: 24.001600, current: 0.000000^M
power | tick: 2574207, rtc: 2024-04-30T14:29:31.382, addr: 65, voltage: 23.998400, current: -0.000500^M
Set RTC to 2024-04-30T14:29:35.210^M
[sami] | tick: 2581147, rtc: 2024-04-30T14:29:38.324, line: :1DBE70AE256B7F1189E2509207222C31FE42512207022B91FE1250F207622B71FE12519207B22BA1FE2250E205222AF1FE1250E176B22B11FBC25120B4D22B31F5A250E048022AF1EDB251803A022AD1EC52510061C22AF1F0B2510096822AF1F4B25140D2B22B01F762518114822B31FA02512149E22B11FB42516179722B41FD0251119A522B01FD425151B3122B21FD3251A^M
[sami-debug] | tick: 2581148, rtc: 2024-04-30T14:29:38.324, chars: 296, charsize: 1, totsize: 296, line: :1DBE70AE256B7F1189E2509207222C31FE42512207022B91FE1250F207622B71FE12519207B22BA1FE2250E205222AF1FE1250E176B22B11FBC25120B4D22B31F5A250E048022AF1EDB251803A022AD1EC52510061C22AF1F0B2510096822AF1F4B25140D2B22B01F762518114822B31FA02512149E22B11FB42516179722B41FD0251119A522B01FD425151B3122B21FD3251A^M
[sami-debug] | tick: 2581149, rtc: 2024-04-30T14:29:38.324 | Successfully sent Spotter transmit data request^M
[sami] | tick: 2581170, rtc: 2024-04-30T14:29:38.347, line: 1C7622B11FE5251D1D5522B51FDB25191E0C22AE1FE625171E8C22AC1FDB25141EF822B61FE7251B1FFF22B21FCE3B013C4F1899D6^M
8251A1FC822AB1FE8251E1FFBException in thread rx:^M
Traceback (most recent call last):^M
File ā€œ/Users/mike.j/Library/Python/3.11/lib/python/site-packages/serial/serialposix.pyā€, line 575, in read^M
^M
ā€” exit ā€”^M
buf = os.read(self.fd, size - len(read))^M
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
OSError: [Errno 6] Device not configured^M

You can do the same puzzle-piecing that I did, and will see that there is clearly some kind of payload_buffer corruption ā€“ the middle of that final payload string is missing but part of it reappears on a second line in the output. Thatā€™s why at first I was wondering if there was some kind of control character ā€“ vertical tab, form feed, something unexpected ā€“ that the bm_fprintf() managed to ignore but bm_printf() couldnā€™t forgive. But a later revision tests for that in the payload, doesnā€™t find it, and still crashes.

It just breaks my brain what is so terribly wrong with the buffer_payload array that it kills the Spotter but not the output to the SD card.

btw thanks for the memset suggestion, @Taylor.gill is trying that workaround today. Ideally Iā€™d prefer to have some idea of why things went toes-up in the earlier revisions but I would settle for a version that mysteriously and inexplicably resolves the problem.

Mike J+

Hmmm that is very strange behavior. Both the bm_printf() and bm_fprintf() should be referring to the same buffer and they both use a buffer + length to write to the SD card/print to terminal output( i.e., they shouldnā€™t differ in what they ā€œallowā€ to be printed). Perhaps something is happening to the buffer on its way to the Spotter that is corrupting it. Or the Spotter is corrupting the received buffer somehow (I hope itā€™s not this :sweat_smile:, but may be possible).

Since the crash is on Spotter can you see if there are any coredump files in the log directory on the SD card? If so could you please send the SD card contents (or just the coredumps) via email? I can then look into the coredump and see exactly where the crash is and hopefully gain more insight into the cause.

Has the memset done anything? Have you tried commenting out the bm_printf()? I would be curious to see if the crash moves from the Spotter to the mote when it tries to printf(). This would point more towards something in the buffer on the mote, rather than something corrupting the buffer as it makes it way to Spotter, or Spotter corrupting the buffer.

Victor

One thing you could try to add is:

  // Turn off the PLUART line buffer
  PLUART::setUseLineBuffer(false);

This will make sure that the line buffer is not used and you will not accidentally overflow the line buffer. Otherwise you might be crashing on line 59 of payload_uart.cpp. Iā€™m trying to reproduce your issue and my bmdk was hitting this assert. This is a crash on the bmdk and not on the Spotter, but may be related to what you are observing.

What version of the Spotter/bmdk are you guys using? I am currently testing on the latest development firmware we have and I havenā€™t seen any crashes when I feed the bmdk your example serial lines. I would be interested in reverting my Spotter/bmdk back down to see if I can better reproduce the crashing behavior.

-Victor

Hi @vsowa12,

I ran the first new firmware test with the addition of the memset. The spotter did not crash this time, but again no 5th measurement was received. It looks like it was just skipped over?

At both 5th measurement drops it showed this notification:
2024-05-09T03:32:17.468Z [BRIDGE_SYS] [INFO] Neighbor c26a1077e2872d6a added

I believe that points to the dev kit disconnecting and reconnecting at these message drops?

I can try commenting out the bm_printf() today and see what happens.

1 Like

We are using Version v2.9.0

2 Likes

Iā€™m probably confused about whether ā€˜versionā€™ in this context refers to hardware or software, and of which device.

But fwiw Iā€™ve been following the bmdk guides and doing

git checkout v0.5.0

as part of my firmware compilation process. If I understood @zachary correctly, it would be possible for us to use later versions (v0.9.1 was mentioned?) only if we ā€œupdate[d] [our] Bridge and main Spotter firmware for Spotter compatibility as well.ā€

Weā€™d be game to try that but it seemed like a new DevGuide explaining how to do those steps was still in the works.

Just to cover all the bases, Iā€™ll mention here that weā€™re currently testing code changes that use your ā€˜PLUART::setUseLineBuffer(false);ā€™ suggestion and also a version where all references to bm_printf() are commented out. Those test results areā€¦ pending!

Mike J+

1 Like

So weā€™re still sifting through the todayā€™s test data. Things we tried:

  • Adding the memset() code.
  • Adding the PLUART::setUseLineBuffer(false) code.
  • Commenting out the bm_printf() calls entirely.
  • Leaving in the bm_printf() and printf() calls but without asking them to read anything from that payload_buffer array. In this version they spit out the current payload_size but no buffer contents:

// Print the payload data to a file, to the bm_printf console, and to the printf console.

bm_fprintf(0, ā€œsami_data.logā€, ā€œtick: %ā€ PRIu64 ā€œ, rtc: %s, line: %.*s\nā€,
uptimeGetMs(), rtcTimeBuffer, payload_size, payload_buffer);

bm_printf(0, ā€œ[sami] | tick: %ā€ PRIu64 ā€œ, rtc: %s, line length: %dā€, uptimeGetMs(), rtcTimeBuffer, payload_size);

printf(ā€œ[sami] | tick: %ā€ PRIu64 ā€œ, rtc: %s, line length: %d\nā€, uptimeGetMs(), rtcTimeBuffer, payload_size);

The first three versions still crash the bmdk at least. It seems like sometimes the spotter terminal remains connected but the ā€˜tickā€™ numbers from the bmdk always restart from zero and we still lose payload strings at regular intervals.

The fourth and final version appears to work without crashingā€¦?!? More testing is needed but so far weā€™ve seen it survive 1.6x longer than any previous test. So, cautiously optimistic.

Mike J+

1 Like

@Taylor.gill Yes! To me this means that the bmdk is resetting/crashing.

Ah yes, sorry @mike_j, I should have clarified that I meant software. Thank you for pointing that out. I will now try to reproduce your set up with Spotter on v2.9.0, and bmdk on v0.5.0.

We could update your bristlemouth code base to v0.9.1. The dev guide for this is not 100% ready yet, but I can write out rough steps here. It would involve updating the Spotter + Bridge. I think before I do that I would want to see if I can reproduce the crash on the same versions you are running on. That would tell me that a change in the bristlemouth code is fixing your crashes.

Thanks for the latest update @mike_j! Do you have any examples of the payload_size being printed out? Are the payload_sizeā€™s matching what you would expect them to be?

Also are you using/have access to an ST-LINK debugger/flashing device?

Victor

Hmmm I am still not able to reproduce the issue. To provide a bit more context, I have hooked up the bmdks payload uart to a usb-to-TTL device. I am then using a python script to output two lines of data from your examples with a 5 second delay between them. With the bmdk receiving data, I am not seeing my devkit or Spotter reset.

I look forward to hearing how the fourth version test goes!

Victor

The sensor spits out a four-character preamble when it starts up and, about 2 minutes later, outputs a 466-character measurement string. My code divides the second string in two (296 and 170 characters each) for xmission. The following is the output seen in the Spotter console in the latest trial, where we seem to be avoiding the crashes. This test endured for 9 measurements before the sensor reached its end time and prior to this weā€™d never seen it go more than five rounds before being knocked out.

1715279537.800 c26a1077e2872d6a, [sami] | tick: 92567, rtc: 2024-05-09T18:32:17.359, line length: 296

1715279537.875 c26a1077e2872d6a, [sami] | tick: 92603, rtc: 2024-05-09T18:32:17.398, line length: 170

1715280300.808 c26a1077e2872d6a, [sami] | tick: 855585, rtc: 2024-05-09T18:45:00.527, line length: 4

1715280437.847 c26a1077e2872d6a, [sami] | tick: 992597, rtc: 2024-05-09T18:47:17.566, line length: 296

1715280437.921 c26a1077e2872d6a, [sami] | tick: 992633, rtc: 2024-05-09T18:47:17.601, line length: 170

1715281200.851 c26a1077e2872d6a, [sami] | tick: 1755615, rtc: 2024-05-09T19:00:00.734, line length: 4

1715281337.890 c26a1077e2872d6a, [sami] | tick: 1892627, rtc: 2024-05-09T19:02:17.773, line length: 296

1715281337.968 c26a1077e2872d6a, [sami] | tick: 1892663, rtc: 2024-05-09T19:02:17.808, line length: 170

1715282100.898 c26a1077e2872d6a, [sami] | tick: 2655645, rtc: 2024-05-09T19:14:59.937, line length: 4

1715282237.937 c26a1077e2872d6a, [sami] | tick: 2792657, rtc: 2024-05-09T19:17:16.976, line length: 296

1715282238.019 c26a1077e2872d6a, [sami] | tick: 2792693, rtc: 2024-05-09T19:17:17.011, line length: 170

1715283000.941 c26a1077e2872d6a, [sami] | tick: 3555675, rtc: 2024-05-09T19:30:00.144, line length: 4

1715283137.984 c26a1077e2872d6a, [sami] | tick: 3692687, rtc: 2024-05-09T19:32:17.183, line length: 296

1715283138.054 c26a1077e2872d6a, [sami] | tick: 3692723, rtc: 2024-05-09T19:32:17.218, line length: 170

1715283900.988 c26a1077e2872d6a, [sami] | tick: 4455705, rtc: 2024-05-09T19:45:00.351, line length: 4

1715284038.035 c26a1077e2872d6a, [sami] | tick: 4592726, rtc: 2024-05-09T19:47:17.398, line length: 296

1715284038.113 c26a1077e2872d6a, [sami] | tick: 4592752, rtc: 2024-05-09T19:47:17.425, line length: 170

1715284801.031 c26a1077e2872d6a, [sami] | tick: 5355735, rtc: 2024-05-09T20:00:00.554, line length: 4

1715284938.082 c26a1077e2872d6a, [sami] | tick: 5492756, rtc: 2024-05-09T20:02:17.605, line length: 296

1715284938.160 c26a1077e2872d6a, [sami] | tick: 5492782, rtc: 2024-05-09T20:02:17.628, line length: 170

1715285701.082 c26a1077e2872d6a, [sami] | tick: 6255769, rtc: 2024-05-09T20:15:00.765, line length: 4

1715285838.125 c26a1077e2872d6a, [sami] | tick: 6392786, rtc: 2024-05-09T20:17:17.812, line length: 296

1715285838.199 c26a1077e2872d6a, [sami] | tick: 6392812, rtc: 2024-05-09T20:17:17.835, line length: 170

1715286601.121 c26a1077e2872d6a, [sami] | tick: 7155795, rtc: 2024-05-09T20:30:00.968, line length: 4

1715286738.171 c26a1077e2872d6a, [sami] | tick: 7292816, rtc: 2024-05-09T20:32:18.015, line length: 296

1715286738.246 c26a1077e2872d6a, [sami] | tick: 7292842, rtc: 2024-05-09T20:32:18.042, line length: 170

Iā€™m guessing the answer is no but @Taylor.gill would know for sure.

Weā€™ll continue testing this latest workaround tomorrow but so far it looks pretty stable. Thanks for all the feedback!

Mike J+

1 Like

Great to hear that the workaround is getting you past 5 readings!

I am still intrigued why the bm_printf and printf of the payload_buffer are causing issues. Does your final code still have the PLUART::setUseLineBuffer(false) in it? It would make sense to me that the fifth reading would cause the line buffer(which is enabled by default) to overflow since it is of size 2048. Therefore because 2048/466 = 4.39, during the fifth reading, as the bytes are coming in the buffer would overflow if it hasnā€™t been cleared, causing this assert to be hit on payload_uart.cpp:59. I may be going down a rabbit hole here and if this isnā€™t helpful feel free to ignore.

Regardless, if this workaround is good enough to keep you guys moving then keep going forward!

Victor

Hi Victor,

We are not using/ do not have a ST-LINK debugger/flashing device. Is this something we should look into getting?

Yes setUseLineBuffer() is still in there (see full source code sent separately). Did you look at any of the log files Taylor sent? I thought I saw ā€˜memfaultā€™ references in there but couldnā€™t tell if they were significant. Would an assertion failure leave some kind of trace?

IMO if this workaround survives longer tests (ongoing) then I think itā€™d be good enough to get us into the ocean but yeah Iā€™d love to know what was causing the crashes. After the present crunch is past I thought I might try to play with the sizing of payload_buffer. If the timing of the crashes has anything to do with cumulatively pushing past those 2048 bytes then a smaller buffer should crash more often, a larger buffer less often.

Mike J+

1 Like

@Taylor.gill I think an ST-LINK debugger could be useful if you are going to continue developing, or as @mike_j might do, go back and debug what was going on. It would allow you to pinpoint exactly where the code last was before crashing, and then at that spot you can inspect different variables, buffer contents and a whole bunch of other stuff. This guide describes how to order a handy ST-LINK adapter that allows you to debug through the USB while also getting serial output. You will also need to order the ST-LINK themselves and those are available at digikey and other websites.

@mike_j I looked through the logs and did not see and coredump files. They would look like: ####_coredump.bin. An assertion on Spotter would leave a coredump file so long as there is an SD card inserted. There are some ####_MFLT.log files, but these are for us to track when we send memfault messages. Spotter uses memfault to track firmware stability and other metrics.

@mike_j I agree, if this workaround is stable for longer tests and you guys are in a crunch to get it in the water then go for it! It is super exciting to hear that your system will be deployed soon! I like the idea of playing with different buffer sizes.

One technique you could use if your are waiting for the debugger or donā€™t want to order it, is to print the reset reason + program counter/link register (PC/LR). The reset reason is an enum of different reasons for a reset. All bmdk apps come with this and you can see the reset reason in the output info command when connected to the bmdkā€™s serial terminal. The bmdk apps also come with memfault built in and you can access the reset reason and PC/LR in your app like this:

#include "memfaul_platform_core.h"
#include "reset_reason.h"

static bool sent_reset_reason = false;

void setup {
  // same set up
}

void loop {
  // After the devkit has been on for 10 seconds
  // this will send/print/log the reset reason and PC/LR 
  // The PC/LR should be zero, unless the reset reason is a "memfault reset"
  if (uptimeGetMs() > 10000 && !sent_reset_reason) {
    sent_reset_reason = true;
    ResetReason_t resetReason = checkResetReason();
    uint32_t pc = memfault_get_pc();
    uint32_t lr = memfault_get_lr();
    bm_printf(0, "Reset Reason: %d: %s, PC: 0x%" PRIx32 ", LR: 0x%" PRIx32 "\n", resetReason, getResetReasonString(), pc, lr);
    bm_fprintf(0, "reset.log", "Reset Reason: %d: %s, PC: 0x%" PRIx32 ", LR: 0x%" PRIx32 "\n", resetReason, getResetReasonString(), pc, lr);
    printf("Reset Reason: %d: %s, PC: 0x%" PRIx32 ", LR: 0x%" PRIx32 "\n", resetReason, getResetReasonString(), pc, lr);
  }
  // your loop code
}

After a reset or crash occurs you can then look into the logs/serial output and see a line like:

1715107760.906 39edb8cdf0746373, Reset Reason: 2: memfault reset, PC: 0x80193b2, LR: 0x801755b

Now this is where is gets a bit tricky, but you can then use the PC/LR to pinpoint the spot in your code where the crash occured. You need to go to your build directory, something like ~/bm_protocol/cmake-build/my_app. There should be a src directory and make sure in the src there is a file that ends in .elf. If it is not there try going back out to your build directory, ~/bm_protocol/cmake-build/my_app, and building your code again: make -j. If there is an .elf file you can then use arm-none-eabi-addr2line to find the lines of code. Here is an example:

(bristlemouth) āžœ  rbr_mote git:(test/rtc-0-bug) āœ— arm-none-eabi-addr2line -pCf -e src/bm_mote_bristleback_v1_0-bm_rbr-dbg.elf 0x80193b2
vPortSuppressTicksAndSleep at ./src/./lib/common/lptimTick_u5.c:557 (discriminator 1)
(bristlemouth) āžœ  rbr_mote git:(test/rtc-0-bug) āœ— arm-none-eabi-addr2line -pCf -e src/bm_mote_bristleback_v1_0-bm_rbr-dbg.elf 0x801755b
prvIdleTask at ./src/./third_party/FreeRTOS/Source/tasks.c:3525

(Note: in my example above I am in the ~/bm_protocol/cmake-build/rbr_mote directory and passing the elf file to the arm-none-eabi-addr2line as src/bm_mote_bristleback_v1_0-bm_rbr-dbg.elf. As long as you are passing the path_to_file/file.elf it will work.)

From here you can isolate the line that is causing the crash. In this example, the PC is pointing to lptimTick_u5.c:557, so I can go inspect line 557 and see if there is an assert, or if it is doing some operation on a buffer, or something else. From there you can make changes to that line, or th the variables that line operates on and test again. With this version however, you will not be able to see the values of the variables like you would if you were doing a ā€œliveā€ debugging session with the ST-LINK.

Hopefully these suggestions can be helpful for you and for anyone else reading this thread.

Victor

1 Like