Most embedded systems use hardware diagnostics to keep track of their hardware’s health. Diagnostics might also be utilized to validate an issue that was discovered during routine operations. Faststream provides a number of diagnostic tests, including Power On Self Tests (POST), Out of Service Tests, and In-Service Monitoring.
Power-on self-tests are performed immediately after a board is turned on. The code for these tests is usually found in the EPROM that boots the card. These checks are run automatically when the EPROM boots. The biggest drawback of these tests is that they can only test the card’s internal functionality. Where external interface logic cannot be examined, Power On Self Tests performs diagnostics on hardware parts.
This is the first, and it examines the CPU’s internal state. This test is carried out by running CPU instructions and then checking their outcome. This test also puts all of the processor’s registers to the test. As part of this test, data in a register may be shifted by one bit, and the result of the shift operation will be compared to a previously computed value.
When an EPROM is programmed, the last two bytes are set to zero on purpose. The EPROM programmer fuses the checksum’s final two bytes together. This test calculates the EPROM checksum by computing a 16-bit exclusive OR (XOR) of the EPROM data, omitting the final two bytes. After that, the computed checksum is compared to the checksum fused in the final two bytes. If the calculated and fused checksums match, the test is passed.
The read-write memory integrity is assessed using the issues described below.
Address Line Faults: For whatever reason, the address lines on the board or within the memory chip may be shorted or set to 0 or 1. When memory is written in either scenario, several locations or an erroneous place may be written. Data corruption may occur when two distinct places in the memory output data on the data bus during a read.
Data Line Faults: It’s possible that data wires are shorting one another. Alternatively, they might be limited to a range of 0 to 1. As a result of this circumstance, incorrect data will be written to or read from the memory.
Data Loss: When data is written to a certain place, it may be OK when to read shortly after it is written. However, it will be gone within a short time. The address and data lines are in good shape, but the memory cells become corrupted with time.
Memory testing procedures may be somewhat intricate, and the methodology utilized is also dependent on the memory banks’ arrangement. We’ll go through a basic test that covers all of the above-mentioned fault situations quite well. This test involves the stages listed below.
This test examines the processor’s interrupt and exception handling. This test is carried out by creating interrupt and exception circumstances, then looping until the required interrupt is detected. As an example, this test may initiate a timer interrupt, and it will do a flag check test. Exception testing is carried out using conditions such as “divide by zero.“ After that, double-check that control has been handed to the correct handler.
Almost every board has a Direct Memory Access Controller. Data must be transferred to and from peripheral devices without affecting the processor, which necessitates the use of DMA procedures. The DMA operations on the board may be readily checked by starting a DMA transfer and then confirming that the source and destination memory addresses match once the DMA is completed.
These tests are performed by connecting the transmitter to the receiver on the same device. This may be accomplished by setting the device to loopback mode. The test broadcasts the data and waits for the receiver to receive it after looping back after the device has been configured. The key benefit of this test is that it may be performed independently on the test board. However, because the loopback test is conducted within the chip, it frequently fails to verify the transmit and interface data channels. The Echo Back Tests, which solve this issue, will be discussed.
Peripheral devices used on a board must be scrutinized during the self-tests. These tests are highly particular to the gadget under consideration. Many suppliers provide unique device testing support in the form of a test mode of operation. To carry out these tests, the gadget is set to “test mode.” When a device lacks test mode capabilities, board designers add extra functionality to the board to test external devices.
When it comes to evaluating the interfaces with other boards in the system, these tests fall short. In this part, we’ll look at how to execute tests in an active system by taking the board under test out of operation and checking its interfaces with other boards.
Interface tests are used to determine how effectively a card’s interface interacts with other cards. In most cases, the nearby cards are asked to participate in these testing. The following are the basic procedures for conducting interface testing.
The LoopBack Evaluation’s main flaw was that it didn’t look at the hardware logic at the transmitter and receiver interfaces. The Echo Back Test can be used to overcome this problem.
The interface card is set to echo back mode, which means it receives data and sends it back to the card under test. As a result, the data transmitted by the test card is returned to it. In this test, there is a significant difference between testing the transmit and receive driver logic.
In an echo backtest, the following processes are performed.
In-service monitoring keeps track of the card’s health while it’s in use.
Transient Error Monitoring
When a card is in use, it should keep track of any transitory problems that the program may identify. Transient errors are mistakes that occur infrequently, even when the hardware appears to be in good working order. In a healthy system, power glitches, spikes, and interference from other cards create such problems.
Interruptions that aren’t intended to be there are a good illustration of transient defects. When a CPU senses an interrupt but the interrupt handler cannot locate the device that generated the interrupt, this is known as a spurious interrupt condition. In such cases, the error counter for a leaking bucket is raised. If erroneous interruptions become too frequent, the overflowing bucket counter will be caused by a leaking bucket. The overflowing bucket counter will be caused by a leaking bucket.
List monitoring is also a critical tool for card in-service monitoring. The bit error rate on the connections can provide an early warning regarding the system’s health. When the bit error rate (BER) reaches a specific level, it’s possible that diagnostics will have to be triggered.
Forward error correction (FEC) is a data transmission method that divides data into blocks and adds parity bits to each block. These parity bits can be utilised by the receiver to detect bits that are in error and allow repair when mistakes are sufficiently randomly dispersed. Reed-Solomon and Firecode are two FEC in use. Reed-Solomon (RS) is the most common FEC. It is a “stronger” FEC. When compared to RS-FEC, Firecode, also known as BASE-R or Clause 74 FEC, is a “weaker” FEC that imposes less delay on the link. The worst-case BER that an FEC can correct is usually used to determine its strength. Another feature is the capacity of FEC to withstand bursts of mistakes.
The RS-FEC transmission technique generates a codeword from 5140 bits of user input plus 140 bits of parity, which results in a 5280-bit block. Internally, this codeword is divided into 10-bit’ symbols.’ These symbols are distributed round-robin to FEC lanes, which are subsequently mapped to PMD lanes when transmitting. The mapping is 1:1 for a 100GbE across 4x25G serdes (100G-4) since there are 4 FEC and 4 PMD lanes. There are four FEC lanes in 100G-2, but only two PMD lanes, so each PMD lane will carry two FEC lanes.
In most cases, the procedure is reversed at the receiving end. The data is passed on for further reception processing after the codeword is constructed from the arriving symbols and the parity bits are utilized to perform error correction.