
In terms of gate usage, a bicubic interpolation algorithm implemented on a Virtex IIpro fpga uses 890 CLBs (logic blocks...also pretty sure it's scaling from 640x480) and needs about 3.5ms for execution.
From a rough comparison I'm using an Artix-7 where even the lower end models have about 5000 logic slices. Lets say the bicubic scaling uses ~1000 CLB, for virtex that's about 4000 slices. A logic slice in virtex is basically a single 4-input lookup table, a logic slice in the Artix-7 has four 6-input LUTs. So I would think the Artix FPGA series lower end chips could handle proper scaling in terms of logic capacity, I think the bottleneck would be memory as I could see a scaling algorithm eating up tons of BRAM. That's a super basic comparison, so chances are it's not valid but better than nothing.
Actually implementing a real scaling algorithm would likely take me months to figure out. I've done edge detection with an FPGA, but that's a pretty simple filter.
Always open to suggestions, I'm hoping this weekend I have a chance to sit down and try out some of these ideas.