LETTER TO ROBERT P. WAGNER (SANITIZED)

Document Type: 
Collection: 
Document Number (FOIA) /ESDN (CREST): 
CIA-RDP85B01152R001001310013-0
Release Decision: 
RIPPUB
Original Classification: 
K
Document Page Count: 
5
Document Creation Date: 
December 21, 2016
Document Release Date: 
June 11, 2008
Sequence Number: 
13
Case Number: 
Publication Date: 
November 1, 1983
Content Type: 
LETTER
File: 
AttachmentSize
PDF icon CIA-RDP85B01152R001001310013-0.pdf222.58 KB
Body: 
Approved For Release 2008/06/11 :CIA-RDP85B01152R001001310013-0 November 1, 1983 Office of Data Processing Central Intelligence Agency Washington, DC 20505 Dear Bob: Below are a number of points that caught my ear as we worked through the PDR. Some of them may not prove important simply because there is additional information that I do .not have, but others may affect what is going to happen between now and CDR. First of all, we all know that the schedule is impossibly tight. The Delivery-2 is really the big step when the VM and MVS environments get networked for the first time. It is also the first occasion on which there is a full-up SUL implementation to the user, and also when most of the software packages play together for the first time. There are other technical innovations as well; ones mentioned included distributed restart/recovery, selective operation when failures occur, and selective recovery. o Has enough time been allowed to shake out this first delivery? Delivery-2 is the foundation on which all else hinges, so anything that can be done to give you additional schedule time for all the needed testing and integration will certainly be helpful. I find myself concerned about the whole issue of recovery and restart. The SAFE configuration has about five iargc uw&iY:iiraPlies, and there are five or so large software packages. If one supposes that the mean free- time-between-failure for each is 100 hours, then the system will have a failure every 10 hours on the average. In a scheduled 22-hour day there will be two or three failures, which means that the total time for diagnosing trouble and recovering from it must be of the order of 30 to 45 minutes. While we heard much talk during PDR of restart features in this, that, or the other software package, my intuition says that the whole issue needs a thorough examination to make sure that the restart features of all the parts mesh smoothly. I would argue that the issue of fault-diagnosis and restart/recovery procedures are of such importance that a special effort to examine them with perhaps a special contractor is warranted. THE RAND CORPORATION, 1700 MAIN STREET, SANTA MONICA, CALIFORNIA 90406, PHONE: (213) 393-0411 STAT Approved For Release 2008/06/11 :CIA-RDP85B01152R001001310013-0 Approved For Release 2008/06/11 :CIA-RDP85B01152R001001310013-0 -2- November 1, 1983 STAT An advantage of using a separate contractor would be that he could do a thorough fault analysis of the proposed system and use the existing running one as a source of real data. He could also canvass other operators of large IBM configurations for pertinent information. On the basis of what he could learn, he could assist CSPO in analyzing and designing fault-diagnosis and restart/recovery procedures. There is an important currency, I think, about this issue because it might very well influence the design of the system, particularly with respect to building in redundancy, analytic checks of various kinds, and what IBM calls fixer modules in its operating system. The goal for your reliability/availability contractor (or group) has to be graceful degradation with gracious recovery. I know that IBM hardware and software have certain error-accommodation features, but do the other software packages that you are acquiring have adequate ones? I find myself wondering whether it might not be wise to design rather extensive diagnostic tests and fault-isolation aids in the several large software packages. If the operational concept for SAFE assumes that everything will report its failure state to the console operator, and if he is expected to go through a restart/recovery procedure all by himself, then I feel very queasy about holding a 30- to 45-minute recovery period. Since you are serving such a large population of users, it is simply unthinkable that the system be allowed to fail in such a way that 100 or 200 users have no awareness of their prior status and context of operations. This suggests that one might have to design special user- awarness files that allow users to get back on rapidly without going through all of the usual log-on, password, authentication, and menu actions. As you know, the IBM operating systems incorporate special features to check on things in progress, and these invoke special fixer modules whenever trouble is detected. One wonders whether the big software packages that you are acquiring from commercial sources have such features and if so, how they will integrate into corresponding IBM ones. If they do not have such features, one wonders whether checker-fixer features ought not be added to each application program. Hopefully, a UPS is being provided for the SAFE system, but if not, has consideration been given to the IBM power monitor that puts the system to sleep during misbehavior of the power source? As I understood the discussion of the MEC analyst, he can decide on-the- fly to change the SLP for subsequent processing of messages. It strikes me that this might be a bit risky in that he could fix one thing but mess up many others. Perhaps this feature already exists in SEC and if so, you will have the experience to answer my concern. If it does not Approved For Release 2008/06/11 :CIA-RDP85B01152R001001310013-0 Approved For Release 2008/06/11 :CIA-RDP85B01152R001001310013-0 -3- November 1, 1983 STAT exist, I wonder whether it might not be wise to mock up the feature and do some trials. Would it be worthwile to provide the MEC analyst with system-provided checks and prompts to help assure that he can not wreak damage? Might it even be so important that the process be put under two-person control? During the PDR there was frequent reference to duplicate files for back- up purposes. At Sysgen time, one will have to be careful to make sure that duplicate files are on different spindles which are in turn on different controllers; otherwise the full purpose for backup will not be served. I do not know whether this detail will influence any other aspects of the design, but I surface it for your consideration. I am not quite sure whether to raise the security issue in the same way as I did the restart/recovery one above. The security features, to be sure, are not distributed as widely throughout the system, and it may be that everything is well. On the other hand, it might be useful for some one person to have a comprehensive system-wide look at all security controls to make sure that nothing has dropped in a crack. Somewhere along the way a comment was made that MAP processing strips out all control characters except end-of-line ones. It occurred to me that the line length in DATEX traffic might be unsuitable for the preferred line length on SAFE terminals. System designers make different choices with regard to this problem. Some of them maintain text as a running character stream which is displayed by a smart text formatter, whereas others retain the end-of-line characters on traffic as it comes. This is a detail but it could have extensive design ramifications; it might be worth a look. Someone said, in effect, that "DATEX will restart and retransmit the message [in process] when a failure occurs." I am not familiar with the DATEX system, but the remark implies that the DATEX node which is feed- ing SAFE will be under the latter's control. I heard no discussion of this interface; is it a detail in the crack? I am not familiar with the SANS algorithm, but I found myself wondering whether it clearly does produce a unique identification number. Within a given source probably the source plus the date-time-group is unique, but what happens if messages are picked up and retransmitted by a third party? Is there a problem here? Another detailed point: Does Logicon provide an initial set of patterns/tables/strings to handle messages? Or does it only provide a capability for someone else to exploit the generality of Merlin? Has this detail been dealt with in the contractual relationship with Logicon? Approved For Release 2008/06/11 :CIA-RDP85B01152R001001310013-0 Approved For Release 2008/06/11 :CIA-RDP85B01152R001001310013-0 -4 November 1, 1983 STAT I think I found all of the points in my notes that I wanted to flag for you; but should others arise, I will get back to you with a second letter. cc: Corporate Research Staff STAT STAT Approved For Release 2008/06/11 :CIA-RDP85B01152R001001310013-0 Approved For Release 2008/06/11 :CIA-RDP85B01152R001001310013-0 Approved For Release 2008/06/11 :CIA-RDP85B01152R001001310013-0