[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Proposal for High level sftp api for uploads and downloads

[Thread Prev] | [Thread Next]
Subject: Proposal for High level sftp api for uploads and downloads
From: Eshan Kelkar <eshankelkar@xxxxxxxxxxxxx>
Reply-to: libssh@xxxxxxxxxx
Date: Wed, 31 May 2023 13:08:49 +0530
To: libssh@xxxxxxxxxx
Libssh needs a high level api for uploading and
downloading files. Functions like sftp_put() and
sftp_get() need to be introduced which the users
can call simply to upload and download files instead
of having to write their own functions to perform uploads
and downloads using the low level read/write api's.

This mail suggests approaches that can be followed
to develop that api.

Some terminology :
1. local_fd - is the file descriptor [obtained using open()
or creat()] of a file on the local machine of the user.

2.  remote_file  - is a sftp file handle [obtained using
sftp_open()] of a file on the connected server.

3. concurrent_requests -  This is the number of requests
we initially issue before trying to get the responses.

Say you are downloading and you have 20 concurrent
requests, this means that first we have issued 20 read
requests, and then when we are trying to get/wait for the
response of the first request, the server may be processing
or may have processed the other 19 requests (This is the
advantage of asynchronous request-response pattern over
synchronous request-response where we would have issued
the next request [and server would have processed it] only
after getting a response of the first request).

And after getting the response of the first request, we issue
another request if needed and then try to get the responses
of the other outstanding requests and repeat this procedure
for them too.

Though you can see this name "concurrent_requests" is not
the best as the requests are not being issued in a concurrent
manner here, neither are they being processed concurrently by
the server. So if a better name comes to your mind, kindly suggest.

4. chunk_size -  The number of bytes we usually issue a
read/write request for, if the number of bytes to read/write
is less than this chunk_size then the request is issued for
those many number of bytes and not chunk_size number
of bytes.

How the api can look from User's perspective :

Approach 1 :
----------------------------------------------
The user has to pass each of the 4 required things
for the transfer to the put, get functions.

For the default values of concurrent_requests and
chunk_size we can provide macros (which expand
to default values suggested by libssh) which the user
can use if he/she is not interested in setting some
custom values.

int sftp_put(int local_fd,
                   sftp_file remote_file,
                   int concurrent_requests,
                   size_t chunk_size);

int sftp_get(int local_fd,
                   sftp_file remote_file,
                   int concurrent_requests,
                   size_t chunk_size);

Approach 2 :
-------------------------------------
Store these 4 things in a structure, make user be able
to configure the number of concurrent_requests and
chunk_size if he/she wants to according to the requirements.

/* in sftp.h file */
typedef struct sftp_file_transfer_struct* sftp_file_transfer;

/* in sftp.c file */

struct sftp_file_transfer_struct
{
int local_fd;
sftp_file remote_file;
int concurrent_requests;
size_t  chunk_size;
};

sftp_file_transfer  sftp_file_transfer_new(int local_fd,
                                                                 sftp_file
file)
{
1. Allocate a new structure of type
struct sftp_file_transfer_struct.

2. Assign its members the received local_fd and
remote file handle and set concurrent_requests and
chunk_size as the default ones recommended by libssh.

3. Return the address of that structure.
}

int sftp_file_transfer_options_set(sftp_file_transfer ft,
                                                     what_to_set,
                                                     void *ptr)
{
/* what_to_set is just a temporary name to denote
 * the parameter in which we'll receive info about
 * the field of the struct that the user wants to set,
 * e.g if SFTP_FILE_TRANSFER_OPTIONS_CHUNK_SIZE
 * is received in the parameter what_to_set, then chunk_size
 * is to be set.
 */
According to the value received in what_to_set,
typecast ptr and assign the appropriate member of the
structure that ft points to the value of the variable that
ptr points to.
}

int sftp_file_transfer_put(sftp_file_transfer ft)
{
Upload the data of the file associated with ft->local_fd
and write it in the file associated with ft->remote_file.
}

int sftp_file_transfer_get(sftp_file_transfer ft)
{
Download the data of the file associated with ft->remote_file
and store in the file associated with ft->local_fd.
}

The code for sftp_get() and sftp_post() that will
be discussed further resembles a lot to the benchmark
code for async download and upload added as a commit
in this merge request. (See
https://gitlab.com/libssh/libssh-mirror/-/merge_requests/375)

------------------------------------------------------
What happens inside the get (download)
------------------------------------------------------
1. Query the remote file size using sftp_stat().
This is the data size that we will wish to read.

2. Initially issue at max concurrent_requests number
of read requests to the server, while the requested number
of bytes (to read) are less than size to read. The issued
requests are to be stored in a request queue.

3. while (request queue is not empty)
{
3.1) Dequeue a request from the request queue and
get its response.

3.2.1) If the response is eof,  the file is smaller than expected.

3.2.2) If the response is not eof, but a short read before
reaching end of file, then its a short read (not yet decided what
to do in this case)

3.2.3) If it's neither of the above cases then the read was
successful, write the read data in the local file.

3.3) If the bytes requested to read < file_size, then issue
one more read request and add it to the request queue.

/* Since we'll make sure that the requested number of
 * bytes never exceed the queried file size, 3.2.1 should
 * never ideally occur, that is why if we get eof, its something
 * unexpected.
 */
}

/* Issuing read requests for get (download) */
-----------------------------------------------------------
sftp_aio_begin_read() can be called as it is here.
(This is a function introduced in the above linked
merge request and is used to issue a read request
for reading some number of bytes from a remote
file)

/* Getting response for the issued request for get (download) */
 ---------------------------------------------------------------------------------
First Iets analyse the steps involved when a user uses
the existing low level read api to get a response for a
previously issued read request and writes the received data
in a local file.

/* Step-1 to Step-3 are performed by the libssh api */
1. Get response from the server in msg (of type sftp_message),
the ssh_buffer containing the received read data is msg->payload.

2. Call ssh_buffer_get_ssh_string(msg->payload) which allocates
a new ssh_string, takes out the data from the buffer msg->payload
(i.e performs a memory copy) and stores it in the ssh_string and
returns it.

3. Copy the data from the ssh_string to the application buffer

/* Step-4 is performed by the libssh user */
4. Write data from the application buffer to the file.

In the case of sftp_get() all these 4 steps are to be performed
by the libssh api. We can optimize this process and avoid some
memory copies if we write directly from libssh buffer to the file using
the local file descriptor.

Something like :
write(fd,
         address where data to write is in ssh buffer,
         byte_count);

-----------------------------------------------------
What happens inside the post (upload)
-----------------------------------------------------
1) Use some function like stat() to get the local file
size, these are the total number of bytes we wish to
write.

2) Issue at max concurrent_requests number of
requests while the number of bytes requested to write
are less than the total file size. Add the requests in a
request queue.

3) while(request queue is not empty)
{
3.1) Dequeue a request from the request queue
and get its response.

3.2) The response can be a successful write or
a failure.

3.3) Issue one more write request if needed (i.e if
requested_bytes < total number of bytes) and add
it to the request queue.
}

/* Issuing write request for post (upload) */
------------------------------------------------------------
First let's analyse the steps involved in sending
a write request when the user is using the low level
write api to upload a local file.

/* Step-1 is performed by the user */
1. Read data from the local file to the application
buffer.

/* Step-2 to Step-4 are performed by the libssh api */
2. Copy the data from the application buffer to the libssh buffer.

3. Data of the libssh buffer is encrypted and written to
session->out_buffer

4. The data of session->out_buffer is written to the socket write
buffer.

In case of sftp_put() all these 4 steps need to be performed by the
libssh api while issuing each write request. We can optimize this
process and avoid one memory copy if we read directly from the file
to the libssh buffer, by-passing the application buffer. We will need to
do something like :

read(local_fd,
        address of location where data is to be read in libssh buffer,
        byte_count);

/* Getting response for a write request */
------------------------------------------------------
sftp_aio_wait_write() can be used here as it is.
(This is a function introduced in the above linked merge request,
and is used to get the response of a previously issued write
request)

Since the suggested optimisations need us to directly read from
local file to libssh buffer or write from libssh buffer to local file, I
think we will need to extend the buffer api in src/buffer.c to achieve this.

Another improvement that can be done is that instead
of using system calls read() and write() on local file we can use
the higher level fread(), fwrite() since they are buffered and using
them leads to lesser system calls in general. (which will improve
performance)

I am yet not sure that what should be done if we get short reads
from the remote side without reaching end of file, if any approach
comes to your mind kindly suggest. (According to the sftp protocol
https://datatracker.ietf.org/doc/html/draft-ietf-secsh-filexfer-02#page-13
for reading, short reads before reaching end of file should not occur for
regular disk files. But they can occur for other kinds of files like device
files,
but does using this put, get api make sense for device files or other kinds
of special files? If it doesn't I think returning an error is the way short
reads
before reaching end of file should be dealt with)

Kindly provide your thoughts on the proposed aspects (how the api
should look from the user's perspective, the working of the api, the
optimisations etc). Any suggestions to improve the proposed high level
put, get api are appreciated.
Archive administrator: postmaster@lists.cynapses.org