[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Proposal for High level sftp api for uploads and downloads
[Thread Prev] | [Thread Next]
- Subject: Proposal for High level sftp api for uploads and downloads
- From: Eshan Kelkar <eshankelkar@xxxxxxxxxxxxx>
- Reply-to: libssh@xxxxxxxxxx
- Date: Wed, 31 May 2023 13:08:49 +0530
- To: libssh@xxxxxxxxxx
Libssh needs a high level api for uploading and downloading files. Functions like sftp_put() and sftp_get() need to be introduced which the users can call simply to upload and download files instead of having to write their own functions to perform uploads and downloads using the low level read/write api's. This mail suggests approaches that can be followed to develop that api. Some terminology : 1. local_fd - is the file descriptor [obtained using open() or creat()] of a file on the local machine of the user. 2. remote_file - is a sftp file handle [obtained using sftp_open()] of a file on the connected server. 3. concurrent_requests - This is the number of requests we initially issue before trying to get the responses. Say you are downloading and you have 20 concurrent requests, this means that first we have issued 20 read requests, and then when we are trying to get/wait for the response of the first request, the server may be processing or may have processed the other 19 requests (This is the advantage of asynchronous request-response pattern over synchronous request-response where we would have issued the next request [and server would have processed it] only after getting a response of the first request). And after getting the response of the first request, we issue another request if needed and then try to get the responses of the other outstanding requests and repeat this procedure for them too. Though you can see this name "concurrent_requests" is not the best as the requests are not being issued in a concurrent manner here, neither are they being processed concurrently by the server. So if a better name comes to your mind, kindly suggest. 4. chunk_size - The number of bytes we usually issue a read/write request for, if the number of bytes to read/write is less than this chunk_size then the request is issued for those many number of bytes and not chunk_size number of bytes. How the api can look from User's perspective : Approach 1 : ---------------------------------------------- The user has to pass each of the 4 required things for the transfer to the put, get functions. For the default values of concurrent_requests and chunk_size we can provide macros (which expand to default values suggested by libssh) which the user can use if he/she is not interested in setting some custom values. int sftp_put(int local_fd, sftp_file remote_file, int concurrent_requests, size_t chunk_size); int sftp_get(int local_fd, sftp_file remote_file, int concurrent_requests, size_t chunk_size); Approach 2 : ------------------------------------- Store these 4 things in a structure, make user be able to configure the number of concurrent_requests and chunk_size if he/she wants to according to the requirements. /* in sftp.h file */ typedef struct sftp_file_transfer_struct* sftp_file_transfer; /* in sftp.c file */ struct sftp_file_transfer_struct { int local_fd; sftp_file remote_file; int concurrent_requests; size_t chunk_size; }; sftp_file_transfer sftp_file_transfer_new(int local_fd, sftp_file file) { 1. Allocate a new structure of type struct sftp_file_transfer_struct. 2. Assign its members the received local_fd and remote file handle and set concurrent_requests and chunk_size as the default ones recommended by libssh. 3. Return the address of that structure. } int sftp_file_transfer_options_set(sftp_file_transfer ft, what_to_set, void *ptr) { /* what_to_set is just a temporary name to denote * the parameter in which we'll receive info about * the field of the struct that the user wants to set, * e.g if SFTP_FILE_TRANSFER_OPTIONS_CHUNK_SIZE * is received in the parameter what_to_set, then chunk_size * is to be set. */ According to the value received in what_to_set, typecast ptr and assign the appropriate member of the structure that ft points to the value of the variable that ptr points to. } int sftp_file_transfer_put(sftp_file_transfer ft) { Upload the data of the file associated with ft->local_fd and write it in the file associated with ft->remote_file. } int sftp_file_transfer_get(sftp_file_transfer ft) { Download the data of the file associated with ft->remote_file and store in the file associated with ft->local_fd. } The code for sftp_get() and sftp_post() that will be discussed further resembles a lot to the benchmark code for async download and upload added as a commit in this merge request. (See https://gitlab.com/libssh/libssh-mirror/-/merge_requests/375) ------------------------------------------------------ What happens inside the get (download) ------------------------------------------------------ 1. Query the remote file size using sftp_stat(). This is the data size that we will wish to read. 2. Initially issue at max concurrent_requests number of read requests to the server, while the requested number of bytes (to read) are less than size to read. The issued requests are to be stored in a request queue. 3. while (request queue is not empty) { 3.1) Dequeue a request from the request queue and get its response. 3.2.1) If the response is eof, the file is smaller than expected. 3.2.2) If the response is not eof, but a short read before reaching end of file, then its a short read (not yet decided what to do in this case) 3.2.3) If it's neither of the above cases then the read was successful, write the read data in the local file. 3.3) If the bytes requested to read < file_size, then issue one more read request and add it to the request queue. /* Since we'll make sure that the requested number of * bytes never exceed the queried file size, 3.2.1 should * never ideally occur, that is why if we get eof, its something * unexpected. */ } /* Issuing read requests for get (download) */ ----------------------------------------------------------- sftp_aio_begin_read() can be called as it is here. (This is a function introduced in the above linked merge request and is used to issue a read request for reading some number of bytes from a remote file) /* Getting response for the issued request for get (download) */ --------------------------------------------------------------------------------- First Iets analyse the steps involved when a user uses the existing low level read api to get a response for a previously issued read request and writes the received data in a local file. /* Step-1 to Step-3 are performed by the libssh api */ 1. Get response from the server in msg (of type sftp_message), the ssh_buffer containing the received read data is msg->payload. 2. Call ssh_buffer_get_ssh_string(msg->payload) which allocates a new ssh_string, takes out the data from the buffer msg->payload (i.e performs a memory copy) and stores it in the ssh_string and returns it. 3. Copy the data from the ssh_string to the application buffer /* Step-4 is performed by the libssh user */ 4. Write data from the application buffer to the file. In the case of sftp_get() all these 4 steps are to be performed by the libssh api. We can optimize this process and avoid some memory copies if we write directly from libssh buffer to the file using the local file descriptor. Something like : write(fd, address where data to write is in ssh buffer, byte_count); ----------------------------------------------------- What happens inside the post (upload) ----------------------------------------------------- 1) Use some function like stat() to get the local file size, these are the total number of bytes we wish to write. 2) Issue at max concurrent_requests number of requests while the number of bytes requested to write are less than the total file size. Add the requests in a request queue. 3) while(request queue is not empty) { 3.1) Dequeue a request from the request queue and get its response. 3.2) The response can be a successful write or a failure. 3.3) Issue one more write request if needed (i.e if requested_bytes < total number of bytes) and add it to the request queue. } /* Issuing write request for post (upload) */ ------------------------------------------------------------ First let's analyse the steps involved in sending a write request when the user is using the low level write api to upload a local file. /* Step-1 is performed by the user */ 1. Read data from the local file to the application buffer. /* Step-2 to Step-4 are performed by the libssh api */ 2. Copy the data from the application buffer to the libssh buffer. 3. Data of the libssh buffer is encrypted and written to session->out_buffer 4. The data of session->out_buffer is written to the socket write buffer. In case of sftp_put() all these 4 steps need to be performed by the libssh api while issuing each write request. We can optimize this process and avoid one memory copy if we read directly from the file to the libssh buffer, by-passing the application buffer. We will need to do something like : read(local_fd, address of location where data is to be read in libssh buffer, byte_count); /* Getting response for a write request */ ------------------------------------------------------ sftp_aio_wait_write() can be used here as it is. (This is a function introduced in the above linked merge request, and is used to get the response of a previously issued write request) Since the suggested optimisations need us to directly read from local file to libssh buffer or write from libssh buffer to local file, I think we will need to extend the buffer api in src/buffer.c to achieve this. Another improvement that can be done is that instead of using system calls read() and write() on local file we can use the higher level fread(), fwrite() since they are buffered and using them leads to lesser system calls in general. (which will improve performance) I am yet not sure that what should be done if we get short reads from the remote side without reaching end of file, if any approach comes to your mind kindly suggest. (According to the sftp protocol https://datatracker.ietf.org/doc/html/draft-ietf-secsh-filexfer-02#page-13 for reading, short reads before reaching end of file should not occur for regular disk files. But they can occur for other kinds of files like device files, but does using this put, get api make sense for device files or other kinds of special files? If it doesn't I think returning an error is the way short reads before reaching end of file should be dealt with) Kindly provide your thoughts on the proposed aspects (how the api should look from the user's perspective, the working of the api, the optimisations etc). Any suggestions to improve the proposed high level put, get api are appreciated.
Archive administrator: postmaster@lists.cynapses.org